top of page

OCR and extracting text using AWS Textract

We are developing an app that requires scanning documents and extracting key information; the extraction is then used to update the relevant user data. We looked at several technologies including AWS Textract, Azure Computer Vision, Google Lens and the open-source technology, Tesseract. All were feature-rich and have certain strengths, but in our case, the documents to be scanned are multipage and heavy on tabular & form data. Due to the amount of structured data, we decided to go with Textract.

Textract uses OCR to auto-detect printed text, handwriting, and numbers. All extracted data is returned in a polygon frame with bounding box coordinates. You can detect key-value pairs and context making it easy to import extracted data. Textract also preserves the composition of data stored in tables. This is helpful for documents that are composed of structured data, such as financial reports and medical records. You then load the extracted data using a predefined database schema. Textract can extract data with high confidence scores, whether the text is free-form or embedded in tables. Amazon Textract uses ML to understand the context of invoices and receipts and automatically extracts relevant data such as vendor, invoice #, price, total amount, and payment terms. Textract also uses ML to understand the context of identity documents such as passports and drivers’ licenses without the need for templates or configuration. When extracting information from documents, Amazon Textract returns a confidence score so you can make informed decisions about how to use the results. Amazon Textract is directly integrated with Amazon Augmented AI (A2I) so you can implement a human review process when the confidence score is low.

The UI/UX flow requires that documents are scanned using the camera. The document is uploaded to an S3 bucket. A Lambda function is invoked to call the AWS Textract API. Behind the scenes, AWS Textract processes the document and spits out a very long JSON that describes the contents, location in the document, and metadata. Along with the JSON, Textract also creates a CSV file containing all structured data. Upon completion, Textract notifies our callback function that stores the extracted structured data. We then invoke another service to run that data against our matching model, extract the data needed, and update the database.

Textract supports both synchronous and asynchronous calls. The synchronous design is to support small mostly single-page documents and we can get near real-time responses. However, we had to go with the asynchronous call since most of our documents are multiple pages. The main drawback of asynchronous processing is that it can take several minutes, negatively affecting the user experience. Breaking the document into single pages and scanning them via synchronous calls is a possibility, but there is a lot of overhead going that route.

Contact us if you need help adding OCR to your application

bottom of page