Is Tesseract currently the best open source OCR library? Best in terms of accura...

ianhawes · on Nov 5, 2023

Tesseract is the current best open source OCR library.

When looking at the “best” prop solution, there are a few worth mentioning:

- If you are looking for the best OCR to DOCX solution, ABBYY OCR SDK is the front runner. Their OCR engine is not AS accurate as others I’ll mention, but their output engine (I.e. taking data beyond just the character, like bold or underlined or font name) is probably the best in the market.

- Google Document AI/Cloud Vision is probably the best all-around OCR. The 2 flavors determine whether you want to handle scanned PDFs/images (DocAI) or generalized photos (Cloud Vision). I believe they also have some level of training capabilities via Vertex but I haven’t checked it out.

- IRIS OCR.. Meh

- AWS Textract and Azure Vision are worth mentioning as contenders, but just like Google Document AI, they’re cloud based and that may factor into your decision.

- I haven’t tried DocTR or Paddle OCR

abdullahkhalids · on Nov 5, 2023

Thanks for the detailed answer.