Paper to database – third phase

When using OCR scanned data as input, you run into dust problems, erroros in OCR-interpretation and as with AbleToExtract also some alignment problems in which column to find the data you are looking for. Dust should be removed, common errors with figures could also easily be handled, and floating columns makes need for an anchor.

So now we have picked up all rows in correct order and have them in random columns, which is a good start. But what about the dust? It does occur so why not remove it if it is by it’s own in a cells by:

Where the sub “TaBortAlla”, Swedish for “RemoveThemAll”:Removing the dust in empty cells, makes it easier to recognize empty cells. One way of navigating on a row is counting columns from the left. But it happened sometimes that the where new columns inserted when there was a shift of page so we went for better approach. What we found was that there were seldom shifts in columns in the middle where we had the article numbers, i.e. keeping neighbours together. What we needed was an anchor point in each row, a place to feel comfortable around to identify it as a valid input row. In our case we were lucky all article numbers started with ‘”AXX” in one of the columns, so we believed it to be a good anchor. Some columns where to the left of it and some where to the right. So we did some scanning for “AXX” which in most cases where successful, but it also appeared as: “AXXC”,”AX0C”,”AX0(“,”A)(X”,”A)CX”….. OCR is not easy! So far we have found this code useful to solve the anchor point problem, it is a bit of fuzzy logic but, it works…:

We also added some lines of code to find new version of “AXX”, logging rows with many columns where AXX was not found with the above lines, picking up new alternatives as they arrive. So with these lines of code we pick them all up and have got our anchor on a row:

Sometimes our anchor is in F sometimes i G and sometimes in … But in the column to the right we assume the quantity will be and then the amount per unit etc, and to the left we assume that it contains the article number of the wholesaler,etc. We seldom found that a column had changed it’s relative position to the anchor column, which would have made the logic is a little bit more tricky. All data we where looking for on each row was supposed to be numeric. But what is numeric when working with OCR-scanned data? There are some common problems with numbers in OCR, so if we are almost certain that in this columns should only contain figures we use the following code:

Three types of replacement, first is dust among the figures, then comes common errors like Il for 1 etc, and then the decimal point which could be “.;,”, since we use “,” in Swedish that is what we want. If value still is not numeric the content is written to the log for the human eye to investigate. For each valid (ie one with an identified anchor column) row we find, we produce an output row that will be used in as input in the next step. If we suspect there are some errors the code could not handle we produce a message on a sheet we use for logging: For each error found the row where the problem was found is recorded. There are three problems reported on this sample, 2 in the 1st phase and one in the second which we describe in the next part of this story. There is also a severity on each error if it is just informative or if it is critical.