Tuesday, March 27, 2012

Cursor looping versus set-based queries

I know this question has been asked. And the usual answer is don't use
cursors or any other looping method. Instead, try to find a solution
that uses set-based queries.

But this brings up several questions / senarios:

* I created several stored procedures that take parameters and inserts
the data into the appropriate tables. This was done for easy access/use
from client side apps (i.e. web-based).

Proper development tactics says to try and do "code reuse". So, if I
already have stored procs that do my logic, should I be writing a
second way of handling the data? If I ever need to change the way the
data is handled, I now have to make the same change in two (or more)
places.

* Different data from the same row needs to be inserted into multiple
tables. "Common sense" (maybe "gut instinct" is better) says to handle
each row as a "unit". Seems weird to process the entire set for one
table, then to process the entire set AGAIN for another table, and then
YET AGAIN for a third table, and so on.

* Exception handling. Set based processing means that if one row fails
the entire set fails. Looping through allows you to fail a row but
allow everything else to be processed properly. It also allows you to
gather statistics. (How many failed, how many worked, how many were
skipped, etc.)

?? Good idea ?? The alternative is to create a temporary table (sandbox
or workspace type thing), copy the data to there along with "status" or
"valdation" columns, run through the set many times over looking for
any rows that may fail, marking them as such, and then at the end only
dealing with those rows which "passed" the testing. Of course, in order
for this to work you must know (and duplicate) all constraints so you
know what to look for in your testing.Disclaimer: difficult to answer without more specific info.

JayCallas@.hotmail.com wrote:
> I know this question has been asked. And the usual answer is don't use
> cursors or any other looping method. Instead, try to find a solution
> that uses set-based queries.
> But this brings up several questions / senarios:
> * I created several stored procedures that take parameters and inserts
> the data into the appropriate tables. This was done for easy access/use
> from client side apps (i.e. web-based).
> Proper development tactics says to try and do "code reuse". So, if I
> already have stored procs that do my logic, should I be writing a
> second way of handling the data? If I ever need to change the way the
> data is handled, I now have to make the same change in two (or more)
> places.

If a SP encapsulates a plain INSERT I'd throw it out. If there is more
complex logic involved I'd probable leave it in.

> * Different data from the same row needs to be inserted into multiple
> tables. "Common sense" (maybe "gut instinct" is better) says to handle
> each row as a "unit". Seems weird to process the entire set for one
> table, then to process the entire set AGAIN for another table, and then
> YET AGAIN for a third table, and so on.

This smells like a design issue. If you have to insert the exact same
data into multiple tables chances are that your table layout is flawed.

> * Exception handling. Set based processing means that if one row fails
> the entire set fails. Looping through allows you to fail a row but
> allow everything else to be processed properly. It also allows you to
> gather statistics. (How many failed, how many worked, how many were
> skipped, etc.)

Normally you want all or nothing. This one sounds as if you placed
business logic into to database. This might or might not be a good idea
but there seems to be a chance that this is better done in a middle tier
(for example because it eases porting to another RDBMS). But it depends
on the error handling and operations you do in SP's.

> ?? Good idea ?? The alternative is to create a temporary table (sandbox
> or workspace type thing), copy the data to there along with "status" or
> "valdation" columns, run through the set many times over looking for
> any rows that may fail, marking them as such, and then at the end only
> dealing with those rows which "passed" the testing. Of course, in order
> for this to work you must know (and duplicate) all constraints so you
> know what to look for in your testing.

I wouldn't recommend one or the other direction with so few info.

Kind regards

robert|||On 28 Mar 2006 12:06:31 -0800, JayCallas@.hotmail.com wrote:

(snip)
>Proper development tactics says to try and do "code reuse".

Hi Jay,

This is not always true when dealing with databases. You have to weigh
the benefit of code reuse (cheaper maintenance) against the benefit of
code duplication (better performance).

In databases where performance matters, I won't hesitate a single second
to duplicate a stored procedure and change it slightly to optimize for
specific situations.

>* Different data from the same row needs to be inserted into multiple
>tables. "Common sense" (maybe "gut instinct" is better) says to handle
>each row as a "unit". Seems weird to process the entire set for one
>table, then to process the entire set AGAIN for another table, and then
>YET AGAIN for a third table, and so on.

I agree with Robert's reaction about the validitiy of such a design. But
if we assume that this really is valid, then you'll probablly find the
cost of the second, third, etc. access to the data faster than the first
access, because the data can all be fetched from cache and no physical
disk I/O is needed.

Of course, you can always test both versions against each other - I
think that you'll be hard-pressed to find a scenario where using a
cursor outperforms using the same SELECT statement two or three times in
a row.

>* Exception handling. Set based processing means that if one row fails
>the entire set fails. Looping through allows you to fail a row but
>allow everything else to be processed properly. It also allows you to
>gather statistics. (How many failed, how many worked, how many were
>skipped, etc.)

In an RDBMS, transactions should be atomic (the A in the ACID properties
of transactions) - they either succeed as a whole, or they fail as a
whole.

If you need to exclude rows that would violate a constraint, check the
constraint in the WHERE clause.

> The alternative is to create a temporary table (sandbox
>or workspace type thing), copy the data to there along with "status" or
>"valdation" columns, run through the set many times over looking for
>any rows that may fail, marking them as such,

No need to loop over the data many times. In most cases, you only need a
single UPDATE with a CASE to check the variuous constraints and set the
"status" column accordingly.

> and then at the end only
>dealing with those rows which "passed" the testing. Of course, in order
>for this to work you must know (and duplicate) all constraints so you
>know what to look for in your testing.

Yes. You need to know your constraints.

Isn't that a normal part of your job?

--
Hugo Kornelis, SQL Server MVP|||JayCallas@.hotmail.com wrote:
> I know this question has been asked. And the usual answer is don't use
> cursors or any other looping method. Instead, try to find a solution
> that uses set-based queries.
> But this brings up several questions / senarios:
> * I created several stored procedures that take parameters and inserts
> the data into the appropriate tables. This was done for easy access/use
> from client side apps (i.e. web-based).
> Proper development tactics says to try and do "code reuse". So, if I
> already have stored procs that do my logic, should I be writing a
> second way of handling the data? If I ever need to change the way the
> data is handled, I now have to make the same change in two (or more)
> places.
> * Different data from the same row needs to be inserted into multiple
> tables. "Common sense" (maybe "gut instinct" is better) says to handle
> each row as a "unit". Seems weird to process the entire set for one
> table, then to process the entire set AGAIN for another table, and then
> YET AGAIN for a third table, and so on.
> * Exception handling. Set based processing means that if one row fails
> the entire set fails. Looping through allows you to fail a row but
> allow everything else to be processed properly. It also allows you to
> gather statistics. (How many failed, how many worked, how many were
> skipped, etc.)
> ?? Good idea ?? The alternative is to create a temporary table (sandbox
> or workspace type thing), copy the data to there along with "status" or
> "valdation" columns, run through the set many times over looking for
> any rows that may fail, marking them as such, and then at the end only
> dealing with those rows which "passed" the testing. Of course, in order
> for this to work you must know (and duplicate) all constraints so you
> know what to look for in your testing.

Another reason why code re-use is less of an issue in SQL compared to
general programming languages is that it's so easy to generate scripts
automatically from your database's metadata. That's particularly the
case for CRUD scripts, transformation scripts, archiving, auditing,
etc. If you have good constraints and naming conventions then you can
automate the production of thousands of lines of code in seconds. That
maybe not as important as Robert and Hugo's other sensible comments but
it is a distinct advantage.

--
David Portas, SQL Server MVP

Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.

SQL Server Books Online:
http://msdn2.microsoft.com/library/...US,SQL.90).aspx
--|||Thanks for replying. As you probably guessed, there are specific
reasons why I am asking these questions. Maybe a bit more info would be
helpful.

* I am not talking about the same data being written to multiple
tables. I referring to situations where different columns from the same
row are written to different tables.

A concrete example would be security (stocks) data. There are different
types of securities. (Equities and Options are just two of them). Both
types of securities share common information like symbol, name, and
security type. But options have additional data (strike price, class,
and expiration).

My current schema has a table named SecurityMaster where common data
goes with the primary key being Symbol. I also have a
SecurityMasterDerivative table where I put Option specific data which
has a foreign key reference to the SecurityMaster table.

So when I get a new security master file, the set-based approach says
to insert all the new common security data into the SecurityMaster
table and then go back and and reprocess the new data to insert the
option specific data. My dilemma is that I feel it is more "natural" to
treat each row as a single entity.

* The stored procedure approach I mentioned in the first point is
related to the example I just gave. When inserting a new security into
the "Security Master database" (I do not mean the single table), there
is a at least one table involved but possibly up to 4 tables depending
on the type of security that we are dealing with.

* As far as exception handling goes, why is it considered "normal" to
want all the data or just some of it? What about the idea of getting
"most" of the data in and then dealing with those "exceptions"? Is it
considered "business logic" to say "I rather have the majority of the
data as opposed to NONE of the data"?

I understand that I can make one large INSERT query which also performs
ALL the required checks in order to filter out "bad" rows. But I would
think that the query could get very complex very quickly. I regularly
deal with situations where the data can not only NOT be trusted to be
complete but where column names and formats are VERY different even for
"similar" information.

I find that I need to "backfill" a lot of data or perform a good number
of value conversions.

One example that comes to mind of using a single query resulted in code
that took 25+ minutes ro run. In this situation I was dealing with
position data for securities.

For Equities, I trusted the CUSIP value they gave me but not the
symbol. So I decided to "verify" the symbol against what I knew it as
(ie. backfill the column). But it was not enough to just do a left
join. Because it was possible to have the CUSIP repeated in the lookup
table (the PK was CUSIP and Symbol). So my Symbol lookup logic
(remember this was a single query), was to see if there was only one
row with the given cusip, then check to see if there was only one
active security row with the given cusip, THEN check to see if there
was only one north american active security with the given cusip,
otherwise use what the source gave us.

And that was only for equities... For options the logic is 100%
opposite... Talk about complex queries that SUCK in performance...

The reason why I suggested the multiple pass validation approach was to
be able to record or mark not only those rows that failed but WHY they
failed. (Assume that there is a [ValidationCode] and [ValidationReason]
column in the "sandbox" table.) I need to be able to create an
exception report so somebody can go back and "fix" the data.

--

I guess the bottom line is that I am probably trying to apply my many
years of application development approaches and opinions to database
development and the two do not mesh.|||(JayCallas@.hotmail.com) writes:
> * I created several stored procedures that take parameters and inserts
> the data into the appropriate tables. This was done for easy access/use
> from client side apps (i.e. web-based).
> Proper development tactics says to try and do "code reuse". So, if I
> already have stored procs that do my logic, should I be writing a
> second way of handling the data? If I ever need to change the way the
> data is handled, I now have to make the same change in two (or more)
> places.

Undeniably, this is surely a common reason why people end up with
cursors. They already have a procedure that performs an operation on
a scalar set of data, and now they need to do it multiple times. We
have plenty of such examples in our database.

When you face this situation you have a tradeoff: speed up development
and run a cursor and take the performance cost. Or rewrite that
stored procedure so that it deals with data in a table. Typically
when we do this, we use a temp table for input or more often a process-
keyed table. (See http://www.sommarskog.se/share_data.html#usingtable for
a discussion on this.) You keep the old scalar procedure, but to not
duplicate logic, you make it a wrapper on the set-based procedure.

Rewriting a scalar procedure into set-based is not a trivial task,
not the least when you code with performance in mind. And not the least
that at least during a transitional period there will be loops that
call the set-based procedure for one row at a time. There is certainly
an overhead of getting data from a table rather than from parameters.

So there is definitely a trade-off here. But if a loop today is the
best bet, it may not be tomorrow, because as the data grows in size,
performance becomes an issue.

I should add that I'm here talking of procedure where logic is really
complex. As the others have said, code reuse is not equally much a
virtue in SQL as it is application code.

> * Different data from the same row needs to be inserted into multiple
> tables. "Common sense" (maybe "gut instinct" is better) says to handle
> each row as a "unit". Seems weird to process the entire set for one
> table, then to process the entire set AGAIN for another table, and then
> YET AGAIN for a third table, and so on.

I'm not sure that I understand this point. As I mentioned, I usually
have my input to my set-based procedures in a table, so I don't really
see the problem.

> * Exception handling. Set based processing means that if one row fails
> the entire set fails. Looping through allows you to fail a row but
> allow everything else to be processed properly. It also allows you to
> gather statistics. (How many failed, how many worked, how many were
> skipped, etc.)

Yes, this is a point that is often over-looked. Judging from some of
the other replies, some people appears to prefer ignore this issue,
but depending on your business requirement and the likelyhood for
errors, this may be a decisive reason to run a cursor.

For instance, I recently rewrote a stored procedure in our system
which makes a contract note definitive. A stock broker have thousands
of contract notes every day. Typically customer notes stay open all
day, but at the end of the day they should all be made definitive. In
conjunction with this there are a number of updates to perform. With
the old procedure, the notes were handled one by one, and if an occasional
note failed, that was no disaster. It could be sorted out the next day.

For the new procedure, I do perform some initial validations, and notes
that fails these validations will not be updated whereas the rest
will be. But from the point that I've started to perform updates, there
is not really an easy way out if there is some problem with a single note.
That means that the entire operation will fail. In our case, this should
only happen exceptionally, but if it happens the customer be in dire
straits, not the least that this will typically happen after hours.
(Please don't suggest that I should validate everything in advance.
We're not only talking constraints, but assertions in sub-procedures,
whereof some are very complex.)

A possible fall back in this case, that I have implemented, but maybe
I should would be that if there is an error, I would then start to take
the notes one-by-one.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||(JayCallas@.hotmail.com) writes:
> So when I get a new security master file, the set-based approach says
> to insert all the new common security data into the SecurityMaster
> table and then go back and and reprocess the new data to insert the
> option specific data. My dilemma is that I feel it is more "natural" to
> treat each row as a single entity.

That is a feeling that you should overcome. It's perfectly normal to
read the same row all over again. (Maybe one day we will get a multi-table
INSERT, but I am not holding my breath.)

> * As far as exception handling goes, why is it considered "normal" to
> want all the data or just some of it? What about the idea of getting
> "most" of the data in and then dealing with those "exceptions"? Is it
> considered "business logic" to say "I rather have the majority of the
> data as opposed to NONE of the data"?

Of course, sometimes that is the business rule: all or nothing. But
there are also lots of processes where it's OK that some data slips
through the crack, as long what is a unit is a unit. (You don't want
an option to be imported into SecurityMaster, but then not getting the
option-specific data in place. Then you rather lose it.)

In the case where you import data, validation errors may be anticipated,
and you could try to check for the most likely in advance. In the end
what matters is of course, if the performance for the solution you
have is acceptable or not.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Erland Sommarskog wrote:
>> * Different data from the same row needs to be inserted into multiple
>> tables. "Common sense" (maybe "gut instinct" is better) says to handle
>> each row as a "unit". Seems weird to process the entire set for one
>> table, then to process the entire set AGAIN for another table, and then
>> YET AGAIN for a third table, and so on.
> I'm not sure that I understand this point. As I mentioned, I usually
> have my input to my set-based procedures in a table, so I don't really
> see the problem.

I think what Jay is up to here is mapping of an OO model to a relational
model. There is one "master" table which represents the root class of
his hierarchy and several tables for specific types. I've never worked
with user defined types but maybe these are an option here as well.

Kind regards

robert|||Robert Klemme (bob.news@.gmx.net) writes:
> Erland Sommarskog wrote:
>>> * Different data from the same row needs to be inserted into multiple
>>> tables. "Common sense" (maybe "gut instinct" is better) says to handle
>>> each row as a "unit". Seems weird to process the entire set for one
>>> table, then to process the entire set AGAIN for another table, and then
>>> YET AGAIN for a third table, and so on.
>>
>> I'm not sure that I understand this point. As I mentioned, I usually
>> have my input to my set-based procedures in a table, so I don't really
>> see the problem.
> I think what Jay is up to here is mapping of an OO model to a relational
> model. There is one "master" table which represents the root class of
> his hierarchy and several tables for specific types. I've never worked
> with user defined types but maybe these are an option here as well.

I suppose you are thinking of the CLR UDTs added to SQL 2005. No, they are
not an option here. CLR UDTs are for small values that always go together,
and where you rarely would search for the individual values. Personally, I
see CLR UDTs as a fairly marginal benefit.

As for the structure of Jay's tables, I have no problem to understand it,
as I work with securities myself, and different classes of securities
have different properties.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Erland Sommarskog wrote:
> Robert Klemme (bob.news@.gmx.net) writes:
>> Erland Sommarskog wrote:
>>>> * Different data from the same row needs to be inserted into multiple
>>>> tables. "Common sense" (maybe "gut instinct" is better) says to handle
>>>> each row as a "unit". Seems weird to process the entire set for one
>>>> table, then to process the entire set AGAIN for another table, and then
>>>> YET AGAIN for a third table, and so on.
>>> I'm not sure that I understand this point. As I mentioned, I usually
>>> have my input to my set-based procedures in a table, so I don't really
>>> see the problem.
>> I think what Jay is up to here is mapping of an OO model to a relational
>> model. There is one "master" table which represents the root class of
>> his hierarchy and several tables for specific types. I've never worked
>> with user defined types but maybe these are an option here as well.
> I suppose you are thinking of the CLR UDTs added to SQL 2005. No, they are
> not an option here. CLR UDTs are for small values that always go together,
> and where you rarely would search for the individual values. Personally, I
> see CLR UDTs as a fairly marginal benefit.

Ah, good to know.

> As for the structure of Jay's tables, I have no problem to understand it,
> as I work with securities myself, and different classes of securities
> have different properties.

Then I probably misunderstood your posting. Sorry for the noise.

Kind regards

robert|||JayCallas@.hotmail.com wrote:
> Thanks for replying. As you probably guessed, there are specific
> reasons why I am asking these questions. Maybe a bit more info would be
> helpful.
> * I am not talking about the same data being written to multiple
> tables. I referring to situations where different columns from the same
> row are written to different tables.
> A concrete example would be security (stocks) data. There are different
> types of securities. (Equities and Options are just two of them). Both
> types of securities share common information like symbol, name, and
> security type. But options have additional data (strike price, class,
> and expiration).
> My current schema has a table named SecurityMaster where common data
> goes with the primary key being Symbol. I also have a
> SecurityMasterDerivative table where I put Option specific data which
> has a foreign key reference to the SecurityMaster table.
> So when I get a new security master file, the set-based approach says
> to insert all the new common security data into the SecurityMaster
> table and then go back and and reprocess the new data to insert the
> option specific data. My dilemma is that I feel it is more "natural" to
> treat each row as a single entity.
> * The stored procedure approach I mentioned in the first point is
> related to the example I just gave. When inserting a new security into
> the "Security Master database" (I do not mean the single table), there
> is a at least one table involved but possibly up to 4 tables depending
> on the type of security that we are dealing with.
> * As far as exception handling goes, why is it considered "normal" to
> want all the data or just some of it? What about the idea of getting
> "most" of the data in and then dealing with those "exceptions"? Is it
> considered "business logic" to say "I rather have the majority of the
> data as opposed to NONE of the data"?
> I understand that I can make one large INSERT query which also performs
> ALL the required checks in order to filter out "bad" rows. But I would
> think that the query could get very complex very quickly. I regularly
> deal with situations where the data can not only NOT be trusted to be
> complete but where column names and formats are VERY different even for
> "similar" information.
> I find that I need to "backfill" a lot of data or perform a good number
> of value conversions.
> One example that comes to mind of using a single query resulted in code
> that took 25+ minutes ro run. In this situation I was dealing with
> position data for securities.
> For Equities, I trusted the CUSIP value they gave me but not the
> symbol. So I decided to "verify" the symbol against what I knew it as
> (ie. backfill the column). But it was not enough to just do a left
> join. Because it was possible to have the CUSIP repeated in the lookup
> table (the PK was CUSIP and Symbol). So my Symbol lookup logic
> (remember this was a single query), was to see if there was only one
> row with the given cusip, then check to see if there was only one
> active security row with the given cusip, THEN check to see if there
> was only one north american active security with the given cusip,
> otherwise use what the source gave us.
> And that was only for equities... For options the logic is 100%
> opposite... Talk about complex queries that SUCK in performance...
> The reason why I suggested the multiple pass validation approach was to
> be able to record or mark not only those rows that failed but WHY they
> failed. (Assume that there is a [ValidationCode] and [ValidationReason]
> column in the "sandbox" table.) I need to be able to create an
> exception report so somebody can go back and "fix" the data.
> --
> I guess the bottom line is that I am probably trying to apply my many
> years of application development approaches and opinions to database
> development and the two do not mesh.

You are describing some common data integration scenarios. SQL isn't
always an ideal integration tool. Certainly it is possible to do most
kinds of transformation and validation in SQL, especially if you build
a good framework and use a staging database to help you. However, it's
fair to say that SQL works best under the assumption that your data as
a whole will conform to a set of business rules enforced by
constraints. Validating data at row level can therefore be hard and
complex. This is a major reason why the market for enterprise
integration tools exists. Micrsoft's offerings are DTS and Integration
Services but there are a host of other solutions too.

--
David Portas, SQL Server MVP

Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.

SQL Server Books Online:
http://msdn2.microsoft.com/library/...US,SQL.90).aspx

--|||On 28 Mar 2006 13:50:45 -0800, JayCallas@.hotmail.com wrote:

>Thanks for replying. As you probably guessed, there are specific
>reasons why I am asking these questions. Maybe a bit more info would be
>helpful.
(snip)

Hi Jay,

David and Erland have already addressed many of your points. I'll skip
most of your message and only chime in where I feel that I have sometinh
useful to add.

>* As far as exception handling goes, why is it considered "normal" to
>want all the data or just some of it? What about the idea of getting
>"most" of the data in and then dealing with those "exceptions"? Is it
>considered "business logic" to say "I rather have the majority of the
>data as opposed to NONE of the data"?

Because, in most cases, all the data that is being handled in the same
batch is in some way related. Like the classic example of a debit and a
credit booking in a double-entry bookkeeping system - you definitely
don't want one to fail and the other to succeed. The ""all or nothing"
principle is the most common situation in databases.

There are example, of course. Like yours. These examples are often found
in subsystems that deal with importing lots of (potentially corrupt)
data - again, like yours. David already said that a DB might not be the
ideal tool for this job. I recommend that you look into ETL tools. SQL
Server ships with DTS (for SQL Server 2000) or Integration Services
(SSIS - for SQL Server 2005), but there are many more available.

If you can't or won't use an ETL tool, than my recommendation would be
to import to a staging table, massage the data until you have a
completely valid set of data, then import into the main table. Most of
the massaging can (and should) be done with set-based operation - but
there might be exception where a cursor performs better; as I've already
indicated in my first reply, testing is the only way to find out.

>The reason why I suggested the multiple pass validation approach was to
>be able to record or mark not only those rows that failed but WHY they
>failed. (Assume that there is a [ValidationCode] and [ValidationReason]
>column in the "sandbox" table.) I need to be able to create an
>exception report so somebody can go back and "fix" the data.

Your proposed approach was, if I understood correctly, to just try and
insert the data and catch errors. You would then try to parse the error
and translate it to a valid ValidationCode and ValidationReason for your
application?

Using a set-based construct might be faster and is (IMO) definitely
easier:

UPDATE StagingTable
SET ValidationCode = 123,
ValidationReason = 'Negative amount'
WHERE Amount < 0

Or, for failures that would have a similar where clause:

UPDATE s
SET ValidationCode = CASE WHEN t.Symbol IS NULL
THEN 124
WHEN t.Market <> 'DJI'
THEN 125
END,
ValidationReason = CASE WHEN t.Symbol IS NULL
THEN 'Unknown ticker symbol'
WHEN t.Market <> 'DJI'
THEN 'Not traded on Dow Jones'
END
FROM StagingTable AS s
INNER JOIN TickerSymbols AS t
ON t.Symbol = s.Symbol
WHERE t.Symbol IS NULL
OR t.Market <> 'DJI'

And then for the final import:

INSERT INTO RealTable (...)
SELECT ....
FROM StagingTable
WHERE ValidationCode IS NULL

--
Hugo Kornelis, SQL Server MVP|||>> * Different data from the same row needs to be inserted into multiple tables. "Common sense" (maybe "gut instinct" is better) says to handle
each row as a "unit". Seems weird to process the entire set for one
table, then to process the entire set AGAIN for another table, and then
YET AGAIN for a third table, and so on. <<

You still think sequentially and procedurally. The SQL engine is free
to hold the data internally and parallelize the process within a single
transaction. It is not like you must open and close a file over and
over. In Standard SQL/PSM, the keywords are BEGIN ATOMIC.. END; What a
series of inserts must do is is see that the order of insertion is
EFFECTIVELY preserved. That is a technical thing in "SQL-Speak" that
says if I can put data item X in tables T1, T2, ..Tn all at once, I can
do it inparallel. I might have to defer constraints, etc., but if the
effect is the same, sequential order is not required.

>> * Exception handling. Set based processing means that if one row fails the entire set fails. Looping through allows you to fail a row but allow everything else to be processed properly. It also allows you to gather statistics. (How many failed, how many worked, how many were skipped, etc.)<<

Look up the concept of SAVEPOINTs and transaction grandularity. A
SAVEPOINT is a "mini-COMMIT"; when you hit an error, you can roll back
to the last save point, do something and try again.

The other method is to break the batch into the smallest possible
transaction grandularity and run each one. The gimmick is that they
have to be separable. Example: making bank deposits to single
accounts.

Univac experimented with a concurency control method they called
logical concurency control. The idea was to look at the job queue, see
which statements were done on disjoint sets (posting deposits and
bowling league scores) and let them run at the same time without any
locks or further checking. I do not know what happened to the project.

>> The alternative is to create a temporary table (sandbox or workspace type thing), copy the data to there along with "status" or "valdation" columns, run through the set many times over looking for any rows that may fail, marking them as such, and then at the end only dealing with those rows which "passed" the testing. Of course, in order for this to work you must know (and duplicate) all constraints so you
know what to look for in your testing. <<

This is the approach used for data scrubbing in Data Warehouses. It is
also the reason that we are seeing Business Rules engines that could
used to generate code for the RDBMS, the DW, the front end code, etc.
with the certainity that they will all match.|||I'm a little late to this discussion, hopefully still being monitored.
The ultimate goal is that the data supports the business needs, and
provides solutions for your customers.
For lack of a better word, lets use the word "business transaction." If
the customer wants to insert as much data from a business transaction
is valid, then your code should do it.

For a complex example like yours, I can't help but believe you will
have to look at a given input row, and process and validate that across
all the tables and business rules that apply, and then implement the
changes.
Then, repeat for the rest of the input rows.

So you have cursors, and custom logic, and repeated validation across
many tables and business rules and foreign keys.
The primary goal is to solve the business needs. A secondary goal is
supportability. Perhaps you end up with a bunch of custom stored
procedures. That is ok, as long as they are documented, readable by
others, and supportable.

In your examples, the set based answer really won't work. As an
example, there could be input Row Q and Row B which you processed and
inserted. Then, later you Process Row T which it turns out is prevented
from being inserted by part of the existance of Row Q and Row B. You
could have maybe figured it out ahead of time, but the odds say you are
better off not doing it that way. In general, I'd rather flunk data on
import out then get bad data into my main database. It is easier to
find and understand data that flunked, then is to clean up a "mistake"
that has been in the database a few months.

The hardest part for me in projects like this is to come up with a
logical way to go about the big picture. Talking it over with
someone, and explaining it to someone is a great way to solidify your
thoughts, and make sure you are keeping the big picture in mind.
regards,
doug

No comments:

Post a Comment