The J4L OCR tools is set of components that can be used to include OCR
capabilities in Java applications. That means you can receive faxes, PDF files or scan documents and extract business information from the images.
The main 3 components are:
- a Java wrapper for the Tesseract
OCR engine. The OCR engine Tesseract itself is delivered under the Apache
2.0 license and we support a version compiled for windows only.
- a PDF to text converter.
- a text document parser.
The document recognition process can therefore be divided in 2 steps:
- The component takes an image file (tif, png, jpg....) or a PDF file and returns the
text contained in it. The Java wrapper will perform this operation by using
Tesseract. Alternatively you can use any other OCR engine. If you are
however using a PDF file, you will use our PDF to Text converter.
- In the second step, your Java application needs to understand the text
returned by the OCR engine or PDF converter. This is done by the document parser. The
document parser uses as input as text string (the data) and a xml file that
describes the structure of the document and the ouput is a business document
either as a Java object or as a XML file