Extracting data from PDF forms
Recently I was asked if I could create a program to extract data from a scanned form saved as a PDF and then write it out to a CSV data file for processing back into their computer database. The information was printed on A5 size form (see example below). These were then used to contact their client. Where the marketing call was successful the form was marked and placed in a pile of successful calls. These forms were then scanned and converted to a PDF document using a scanner.
This is a very basic example of the form layout on A5 page (This is only a crude mock up, the real form occupied most of the A5 page and had a lot more information)
The main item of interest was the client number but the phone number was also extracted as a check to ensure the client number was correct.
My first idea was to save the PDF to a text file and then have the custom program extract the data from the text file based on key words on the form. In order to save the PDF as a text file it needs to be an indexed PDF file, this turned out to just be a setting on the scanner. NB. A non-indexed PDF basically contains only an image of the text and therefore cannot be searched for words or symbols.
Next I crafted a program to search for static key words in the indexed PDF file just prior to the information I want to extract. This worked, but had a very low extraction percentage about 30 – 50%. On investigation I found it was due primarily to the form having lines or boxes drawn around the data.
When the documents were scanned the alignment shifted slightly for many of the pages i.e horizontal lines varied from 170 – 190 degrees in the PDF. This affected the way the text was positioned in the file. Generally the form would be read column by column with each column be output from the top row to the bottom. However when the form was out of alignment the form then got read row by row with each row being split by column. This made the “Client No” text in the text file unreliable as a tag to find the client number.
There are some scanners on the market that will auto align the page and this will help reduce OCR errors. However this was not an option, in this case as the customer already had purchased a scanner for the job. After looking for other ideas and solutions on the Internet, I found mention of using the position of the text in the PDF to find the data. Using a Delphi PDF library from Gnostice I was able to process the PDF file directly and retrieve text objects and their coordinates of where the text was located. Each block of text on the page was returned with the coordinates of text elements location IE. Top, Right, Bottom and Left coordinates in pixels.
After changing the custom program to extract the text elements using their coordinates, this allowed me to analyse the position of text I need. I was then able to calculate the minimum and maximum variations of the placement of the fields. I then changing the program to extract the data from anywhere within this boxed area on the form. This resulted in successful extraction of 99% of the text required. I also needed to refine the program to correct some numbers as they were still incorrect i.e In some cases the digits were interpreted incorrectly as alpha characters. IE 1 was saved as /, L or i etc. After adjusting the program to read and correct these errors I was able to achieve 99% accuracy in extracting the Client ID numbers from the PDF file.
Notes and things that affect reading the data –
- Use OCR friendly fonts if possible, if you print the forms, I suggest you use Verdana or Arial fonts.
- If can, buy a scanner that has auto-alignment of pages. e.g Brother SCB-ADS2600W.
- Any hand writing or ink marks or dirt on the form causes issues with reading the data.