2012-07-21

Aftermath of vacation

During
my vacation
, our Business intelligence system was hit by two severe crashes, the first one
currency rate load failure
I have blogged about, but we also had failed to load project costs from SAP. Two days before my vacation I enhanced the
job iterator
functionality in my job scheduling system. While trying to debug the currency rate problem on the German Autobahn on my way to France, I got a call from the office telling me there is also a problem with project costs. From the info I got I guessed it must have something to do with job iterators, the costs where to high 17 times to be exact. And the job that failed had a job iterator with 17 rows. I could not for my life figure out what the problem was and I had done extensive testing of my changes, but if you finds bugs look at the last changes and 17 was also pointing at my job iterator, so still on the road I backed out my changes, but the cost problem remained.

Back at the office we found out the problem had started before I made the changes, it was another job that was the culprit and not all projects were affected. Since all updates are done with SQL we suspected unwanted duplicate rows in some involved table. During my absence two colleagues had analyzed the problem and they had isolated the problem to SAP table EKPO (purchasing document items). So we checked our ‘copy’ of the EKPO table and found no indexes. EBELN (Purchasing Document) and EBELP (PD item) should be primary index in our EKPO table. By the query
select count(*),EBELN, EBELP from EKPO group by EBELN, EBELP having count(*) > 1
, we found
SEVENTEEN
invalid rows with blank EBELN, EBELP (these rows were inserted into SAP a few days before my vacation). We removed the offending lines, inserted the primary index and reloaded the costs, now right again to the last cent.

This crash could have been avoided if we have had proper input checks and correct indexes on the database table but we do make mistakes, ‘those who do not do mistakes tend not to do anything at all’. And as always, shit happens at the worst possible time. Murphy rules.

One interesting observation, the execution time for the failing SQL increased from 2 minutes to over 8 hours with these 17 bad EKPO rows! I end this post with the failing SQL.