How to improve tessaract ocr accuracy? [duplicate] - java

This question already has answers here:
image processing to improve tesseract OCR accuracy
(14 answers)
Closed 4 years ago.
I have a PDF which contains a scanned document where I should be reading some parts of it. I already had it done with Google Cloud OCR, but I just noticed it might not be adequate as I'll be exceeding monthly quota (1k requests/month), so instead I'm switching to Tessaract.
The project is done in Windows and Java, but currently I'm doing some tests using linux.
I am not uploading my original image or none of them as I am not sure if it contains sensible information, but rather some images from the internet which are VERY similar.
I have read that I can help improve Tessaract to have a better quality doing some previous work on the original image (using TextCleaner?). I would like to know how to do that kind of stuff in a windows/java enviroment and most important, how to eliminate successfully the dark background on the table and if possible eliminate the horizontal and vertical lines of the table as the don't help at all during the OCR.

Yes, you are right, you can clean the image to get a better recognition, see https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality .

You can use ImageMagick to sharpen the image(high resolution). Tessaract works better on high resolution images. If you are using python(I think you don't), pillow (PIL or Python Imaging Library) works great to enhance the quality of images.

My text cleaner script will not help much with this image. It won't remove the dark background, especially since it is textured. For other images will large regions of nearly constant color, it can make that background white. But it runs only on Unix-like systems and not with java. So for Windows you would need to use Windows 10 built-in Unix or install Cygwin.
Here is one example from http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
Input:
textcleaner -g -e stretch -f 25 -o 10 -s 1 twinkle.jpg twinkle_g_stretch_f25_o10_s1.jpg

Text Recognition depends on a variety of factors to produce a good quality output. OCR output highly depends on the quality of input image. This is why every OCR engine provides guidelines regarding the quality of input image and its size. These guidelines help OCR engine to produce accurate results.
Here Image Preprocessing comes into play to improve the quality of input image so that the OCR engine gives you an accurate output.
I have written a detailed article on image processing in python. Kindly follow the link below for more explanation.
https://medium.com/cashify-engineering/improve-accuracy-of-ocr-using-image-preprocessing-8df29ec3a033

Related

Libdmtx vs ZXing for DataMatrix Decoding?

How reliable is ZXing's barcode localization for DataMatrix decoding compared to libdmtx?
I have a set of png image files of stickers (proprietary, so unfortunately I'm not able to share them) containing DataMatrix barcodes. These stickers sit on flat surfaces, have very nice quiet zones and are generally centered in the image, but suffer from inequal lighting conditions and slight dust, likely the largest obstacle to reliable decoding.
I'd like to use a modifiable Java library to decode them and it seems that ZXing is the only open-source option (open to other suggestions). However, upon running these images through the ZXing online decoder, I consistently get NO BARCODE FOUND, even on the cleanest images. In contrast, when I run the same images through proprietary online decoders, like Inlite's Free Online Barcode Reader, I get reliable decodes for all the images. My company has implemented a library in C that also reliable decodes the barcode images by processing them and calling libdmtx. Similarly, this online DataMatrix decoder built on libdmtx can also reliably read my image files.
Is the barcode localization in ZXing significantly inferior to libdmtx?
If I attempt the same preprocessing on the image files before I run them through ZXing, could I achieve similar results? I have a strong preference for a Java library (ZXing), but I may have no choice but to use libdmtx. Would appreciate any insight, thanks!
I had similar problem as you but on encoding side. As per my findings Zxing is certainly inferior to Libdmtx. We are using both libraries in house in C++ and Java project.
There is a case when Zxing breaks while generating barcode look at my comments here:
https://github.com/zxing/zxing/issues/624
However Libdmtx works flowless. The other free options you have in java world are (they are for encoding):
barcode4j
OkapiBarcode
Another alternative is the relatively new ZXing cpp port here: https://github.com/nu-book/zxing-cpp.
It contains a completely new DataMatrix detector that was meant to fix serious limitations of the Java upstream version. It was specifically designed to deal with low resolution images (module size as low as around 2 pixels) and symbols that have just the required 1 module quite zone and a busy background.
The following comparison is certainly not 'fair' but I just had the dmtxread utility of the libdmtx try my test set of images and it missed 3 of 17 samples and took a whooping 300 times as long compared to my code :).

Speed optimization of Website with a lot of images

I am currently working on a website which involves a lot of images. The problem is all the images are uploaded by the user so I can't do anything to alter the images. The website runs quiet ok on local system but the speed drops too much on the server,it becomes too slow
I'd suggest you to use Timthumb. It creates a thumbnail by generating a URL on the fly and uses very minimal disk space.
If the users of your website are uploading the images, then I presume (there must be) an upload script. Inside of that script or directly after its execution you could compress or rescale the image to size needed on the website, shortening loading time. There is a PHP image processing library called ImageMagick here:
http://php.net/manual/en/book.imagick.php
There is the PHP GD image processing library here:
http://php.net/manual/en/book.image.php
I don't have much personal experience with them, but from my knowledge it looks like one will do the job. Off the top of my head, that's the best solution I can think of, and hopefully it works. There is not a lot you can change about your problem if you don't compress/scale the images, and these are probably your best options. Wish you the best.

Java OCR for images with complicated backgrounds

I'm trying to get some text from images which look like this:
This example would actually be the best case scenario as most of them would have a colored and more complex background instead.
I don't need it to be 100% accurate since I know the possible outcomes and could try to do a partial match with them.
I tried Aspose OCR and Tess4j. Aspose gives me random characters and Tess4j gives nothing.
Is this doable with a free library?
Tesseract seems to be the best free library for this purpose.
I know some projects using Tesseract do pre-processing to images they are OCR'ing. Like changing contrast, rotating, resolution, etc. Then they OCR same image with multiple times for different pre-processing changes and then compare the results.
More information here

Java JAI - creating 1 BIG jpg image from many smaller ones

Before you all say 'thats already on here somewhere...' :-)
PLEASE let me say I have looked and not found a simple example of using JAI to tile multiple jpg's and save to disk without java heap errors or other memory problems
I cant find a complete working set of code anywhere - they all seem to be miswritten / unchecked or simply do not work....
Help me Some-BiWan Kenobi - you're my only hope!
How big are the images? You might be able to just increase the memory allocated to the JVM. If they are huge, then JPG might not be the right format, because you need to load the whole image into memory to compress it and write it out. You might have better luck writing a tiled TIFF, using JAI.
I asked a similar question here: Write swing component to large TIFF image using JAI

Using Java to capture an area of the screen and identify text found there

This question may be beyond the scope of a simple answer here at stack overflow, but my hope is that it will lead me to be able to formulate several more specific questions to get where I need to be.
I want to write a program that searches a buffered image for text and returns it as a string. I don't want to write an entire OCR program, but would rather use an API that is freely available such as tesseract. Unfortunately I've been unable to find a Java API for tesseract.
I know that the font is arial and I know it's size. I am wondering if that will help.
I've already managed to capture the screen, but I'm not sure how to accomplish the next step of identifying the text found in the image.
the question
How can I implement a simple OCR function into my java program?
You can use tesjeract or tess4j wrapper of Tesseract API. Be sure to rescale you images to 300 DPI since screenshots' resolution (72 or 96 DPI) is in general not adequate for OCR purpose.
The OCR implementation is complicated, but using an SDK like http://asprise.com/product/ocr/index.php?lang=java is simple.

Categories