Mike Ault ran foul of a misunderstanding over the functionality of the EXCEPTIONS INTO claues of the constraint validation SQL command, and writes about it on his blog (skip the stuff about the generation of the test data — not valid to this issue).
The documented functionality of this clause is that it inserts into an exceptions table a record of all the rows in a table which prevent a constraint from being validated. The example here is of a unique constraint, but most of the same principles apply to other constraint types.
The problem that Mike has is that he is attempting to use this clause to identify the rows in a table that must be deleted in order to allow a unique constraint to be validated. Fortunately or unfortunately — and I shall later explain why I believe this to be fortunate — the clause will log all occurances of values which are duplicated in the source table — it wil not exclude rows which represent the original occurance of the value that is duplicated.
So why do i believe this to be fortunate? Because I believe that it is the only rational way that the functionality could possibly work.
I don’t think that I’m misrepresenting Mike when I say that his desire that the clause should not log all of the violations is based on the notion that there is an “ordinal value” involved — that there is a single row of which the others are duplicates. Now this is true in the case of a constraint violation caused by the addition of a duplicate value in a table wherein the constraint is already enabled — the ordinal row would be the one that has already passed the validation.
This is not the case here though. To the database all of the existing, non-validated rows are equivalent and the code has no insight into which of the multiple rows that contain a duplicated value represents the ordinal value in the eyes of the user. And nor should it do so, because for the code to make an assumption concerning which of the rows was the first to be inserted (in the absence of any application knowledge concerning sequence numbers or timestamps or whatever) would represent a violation of one of Codd’s rules — I forget which one. That would be a pretty serious issue, IMHO.
Now suppose that the EXCEPTIONS INTO clause did work as Mike desired. Then I would submit that it is of almost no use whatsoever, because even if the rationale by which some dupes were not tagged was fully documented then if it did not agree with your own deduplicating intentions then it is of no value to you — you’d still have to go and hand-code your own methodology.
I also believe that the faults that Mike believes this functionality has are exaggerated by the case at hand. If you took a one million row table out of which EXCEPTIONS INTO logged 100 as … erm … exceptions then that would probably appear to be reasonable — after all to identify the 99, or 48, or 86 rows out of one hundred that must be deleted from a one million row table to allow the constraint to be validated does not sound unreasonable at all, and some value has been added by use of the clause. In this case though, having run this functionality Mike is literally no closer to resolving the issue than he was beforehand (other than having learned something about the number of rows that are duplicates of another, of course). Even in the 100-out-of-1,000,000 example you still have to make your own choice and write your own code to deduplicate the original data set, so there is no difference in the amount of work to be done, just in the perceived value of the EXCEPTIONS INTO clause in the two cases.
There is a point on which I do agree, however, where Mike quotes the documentation … “All rows that violate a constraint must be either updated or deleted from the table containing the constraint.” and adds his own comment “Which is incorrect”. Well it is actually only partially incorrect. It is correct for what I would term “intra-row” constraints such as NOT NULL, check, or foreign keys which are based on the values of a single row in the table. For “inter-row” constraints such as unique and primary keys it is not correct, and follwoing as it does an example of an intra-row violation it is poorly phrased. It would probably behoove Mike to submit this as a documentation bug.
Other than that issue, which I’d suggest is a minor one, the documentation of the clause seems to accurately reflect it’s functionality, and its functionality seems to be in line with the only possible way in which it could make sense.
Comments welcome, as always.
Edit: TK has his own thoughts in the comments on Mike’s blog and also on his own blog here.
Edit: Dumb spelling errors corrected — thanks PN