Who is Selling to Whom – Feature Evaluation for Multi-block Classification in Invoice Information Extraction

Varování

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

HA Hien Thi HORÁK Aleš

Rok publikování 2021
Druh Článek ve sborníku
Konference SPECOM 2021: 23rd International Conference on Speech and Computer
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www https://link.springer.com/chapter/10.1007/978-3-030-87802-3_23
Doi http://dx.doi.org/10.1007/978-3-030-87802-3_23
Klíčová slova OCR; Invoice; Block type classification; Seller; Buyer; Delivery address
Popis The invoice information extraction task aims at unifying the automatized processing of invoices in structured forms and in the form of a scanned image. Recognizing the pieces of information where a specific value is identified with a keyword (such as the invoice date) is a relatively well-managed task. On the other hand, identification of multi-block information on the invoice, such as distinguishing the seller, buyer, and the delivery address, is much more challenging due to versatile invoice layouts. In this work, we present a new technique of feature extraction and classification to recognize the seller, buyer, and delivery address text blocks in scanned invoices based on a combination of complex layout and annotated text features. The method does not only consider the block positional features but also the relation between blocks and block contents at a higher level. The technique is implemented as a module of the OCRMiner system. We offer its detailed evaluation and error analysis with a dataset of more than five hundred Czech invoices reaching the overall macro average F1-score of 94%.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.