Helik: retrieving data from PDF documents

Reference or Study

Unlike artificial intelligence, we have 100% accuracy when retrieving data from PDF documents.

It is not always necessary to use artificial intelligence

HELík was originally developed for loading data from delivery notes and invoices into the receipt in the HELIOS Red accounting system. We assumed that the invoice or delivery note from a given supplier always has the same structure. We teach HELik to recognize this invoice and then we reload it repeatedly. We store the extracted data in a file that can be loaded in the information system. The system works on machine-generated PDF documents. It handles multi-page documents as well as situations where one item is on a different number of lines.

We have thus gained a powerful tool that can be used in other types of projects. Wherever we need to read a specific item from a PDF, for example for automatic renaming of files or sorting them.



During the development of HELIK we solved a standard problem with PDF documents - invoice and delivery note

Extracting information from PDF documents (for example, a PDF invoice) is not entirely straightforward. In terms of computer science, PDF is more like a picture. At the last level, the text in a PDF is composed of blocks of letters (called chunks, the smallest unit of text you can work with, containing different numbers of letters) that contain coordinates for placement on the page. Chunks can of course be organized into higher formatted units such as paragraphs, blocks, etc. So the problem is to reconstruct lines and words from the PDF at all. The PDF libraries used in Helik group PDF chunks into words (the word chunk can consist of chunks "ch", "un" and "k" for example) are well handled, the problem was getting an image of the lines by location. Unfortunately, that's where the libraries' functions fail; they don't copy the location of words on the page, but group text by other information as well. This creates a problem for processing the invoiced data in a "spreadsheet". In the case of a single line description of goods, everything works correctly:

Item description Price

However, in the case of a multi-line entry of a single invoice line block, there is a problem with the algorithm in the library, the ordering of the table block breaks. Instead of the required line ordering by coordinates:

Description Price

A "smart" form of writing by grouping chunks into a higher block will result:

Item Price

Thus, without programmed line correction, it is not possible to select the correct chunks from invoices according to one sample line of the "table".

Then the processing of the invoice is quite simple, we search the text for the specified pattern of fields according to the "learned" invoice line.


Others References and Studies