How to read text from image file [duplicate]

How to read text from image file [duplicate] - java

This question already has answers here:
Turn Image into Text - Java [duplicate]
(4 answers)
Closed 7 years ago.
I want to search a word from image(scanned copy), retrieve values from image, highlight the location. Is there any API or library available for processing images. I am using Swing for displaying images.

You need something to convert the pixels into characters. That something is a program that provides OCR.
Keep in mind that any program you use will provide its best approximation of what it thinks the character is. While technology has improved a lot, there are many fonts, sufficient noise, and various other confounding factors that could result in false input (where the character is not what you would have deemed it to be). There are also scenarios where the input cannot be mapped to a character. Write your software defensively to handle both cases, as this should be considered "non validated input".

Check out "tesseract". It isn't Java, put available for most platforms open-source, and you can call the command-line program from java via System.exec()
https://code.google.com/p/tesseract-ocr/
given the images in the correct format, it's recognition rate is even better than many commercial OCR software products.

Related

How to Read Text From Bounding Box using Java With OpenCV

I am working on Handwritten Form Recognition System, till now i have reached to this step where,i have been able to detect text using java with openCV but now i want to read the text from each of these bounding boxes Click to open image
I have being doing research to find out the process for the same using java with openCV but i was unable to find any.
Suggest me some links,Technologies,methods or process to perform this particular task with "JAVA".

This answer is more general than question specific. I will try to stick as much as possible with the problem statement.
Although there is a lot of on going research on recognition of hand written text, there is no full-proof method, which works with all possible problems.
The sample image you posted here is relatively noisy, with extremely high variance between the font of the same letter. This is exactly where it gets tricky.
I would personally suggest that once you have the bounding boxes around the text (which you already do), run contour extraction in all these bounding boxes in order to extract single letters. Once you have them, you need to figure out relevant feature/s that can represent the maximum variance (or at least 95% Confidence Interval) of the particular letter.
With this/ese feature/s, you need to train a supervised algorithm, letters as training data and their corresponding value (for eg. actual values) as labels. Once you have that, give it some data (the easiest and most difficult cases) to analyze the accuracy.
These links can help you for a start :
One of my first tools to check the accuracy with the set of features I use before I start coding: Weka
Go through basic tutorials on machine learning and how they work - Personal Favorite
You could try TensorFlow.
Simple Digit Recognition OCR in OpenCV-Python - Great for beginners.
Hope it helps!

Java get decibel from Microphone [duplicate]

This question already has answers here:
Microphone level in Java
(2 answers)
Closed 10 years ago.
I know there are some questions already like that but always only a part of the answer.
I just want to get the current decibels which are "recorded" by the microphone.
I got as far that i have opened a TargetDataLine but the read method only returns confusing bytes :/
Could you tell me how i can read the decibels?

If you are interested in measuring, for example, dB SPL, This is not possible, at least not in the sense you probably mean. Here is one of several answers about using a computer mike to measure absolute sound intensity: How can I calculate audio dB level?
If you are confused about what the bytes mean and are interested in, for example, measuring change in volume/sound intensity/something like that over time, that is doable, but it's a different question. There are many questions about how to interpret the raw data that comes out of javasound and other audio apis here on SO, but a better source is a tutorial. One good place to start is with some of the examples and tutorials over at java sound resources. You might also be interested in my slides from a talk on the basics of computer audio.

Java library to compare image similarity [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I spent quite some time researching for a library that allows me to compare images to one another in Java.
I didn't really find anything useful, maybe my GoogleSearch-skill isn't high enough so I thought I'd ask you guys if you could point me into a direction of where I could find something like this.
Basically what I want to do is to compare two images with each other and get a value of how much the two are similar. Like a percentage or so.
I hope you guys have something I can use, I wouldn't know how to write something like that myself...
PS: It doesn't necessarily has to be in Java, that's just the environment my app will be running.

You could take a look at two answers on SO itself: this one is about image comparison itself, offering links to stuff in C++ (if I read correctly) while this one offers links to broader approaches, one being in C.
I would suggest starting with the second link since there's links on that discussion that'll lead to implementation code of some relevant techniques which you might be able to "translate" into Java yourself.
That's the best my google skills could do, no Java though - sorry. I hope it's a good starting point!
EDIT:
Here's someone with your problem who wrote his own comparison class in Java. I didn't read the source code though. He expressly states that he couldn't find Java libraries for that purpose either, so that's why he wrote it himself.
Oh, and this question on SO has probably the best links on this, all regarding Java libraries of image processing. Hopefully there's one amongst them that can compare images for similarity.
Ok, last edit:
The Java Image Processing Cookbook shows a Java implementation of a basic algorithm to determine the difference between two pictures. It also has an email to contact the guy who wrote it as well as a host of references. No library though.
EDIT after reading your comment to your question:
Unless you've already checked all of the above links, since what you want seems to be checking whether two images are equal, I would suggest starting with the Java Image Processing Cookbook (since that has an implementation of an algorithm in Java to check for equal images) and the last link to an SO question. Also, check PerceptualImageDiff and the source code of that project (C++); it sounds really nifty - it's apparently supposed to check whether two images look equal to the human visual system.

Just off the top of my head, OpenCV is a great image processing library, but it might be overkill if you just want to compare images. If that's the case, I'd go with ImageJ.
Someone already asked how to do this using OpenCV here.
I'd use C++ for this, but if you must use Java, there is a project which made a Java wrapper for OpenCV, here.

I used the class in this link to compare two product images, and the results were cool. It's not very hard to implement it just to be used for comparing two images, you just need to delete the lines of JAI and Swing and such. It resizes images to 300x300 and returns a difference value such as "1234". The maximum difference value is near "11041", it's stated in the link. Doing a division, you can simply get the percentage. If interested I can post the modified code here later.
The results were cool, but I still got "digital camera photos", detected to be similar to "TV photos". So, I used ImageJ to detect edges in the picture. Using the detect edges operation, ImageJ converts the image into a edge detected greyform image. Than I put the two edge-detected images in the same comparator and multiplied the both values. The results got even more accurate.
Getting the edge-detected form of the images

Where do I start for Text Pattern Recognition - Java Based

I am seriously considering doing a Optical Character Recognition program. I am well versed with Java and would love to know about libraries available out there. Basically, I want to convert something like the following to text. I will need to give manual interruption to specify a pattern. For example, I would need to ask user to mark f in this text, so that I know where f occurs.
I am a newbie to this entirely, so I dont mind learning from scratch as well. Need guidance.

If you are thinking of coding an OCR program from scratch, reading up on techniques may be useful. I found an OCR Survey from 1996 which reviews some of the popular techniques from a decade and a half ago. Reading that might be helpful; track down papers it cites or papers which cite it.
Usually the process goes as follows:
find text
find characters in the text
extract features from the characters found
do pattern matching
report suspected character
While getting a user to annotate text is fun and exciting, finding a collection of handwriting which is already annotated might save you a lot of time, that way you can focus on the nuts and bolts of doing OCR rather than building your own database of annotated text.
To start with a slightly easier task you might want to consider building a system to detect handwritten digits. The USPS produced a corpus for developing systems to do this for zip code processing. The link was something I found with a quick search.

If you want to use/look at a library, you could try the Google-endorsed Tesseract.

What is a fast and unsupervised way of checking quality of pdf-extracted text?

I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I'm currently thinking of running a spellchecker on each of them, but the language can differ, I am not yet sure which languages I'm dealing with. Natural language detection with scores may also be an idea.
Oh, and any method also has to play nice with Java, be fast and relatively quick to integrate.

Try an automatically learning spell checker. That's not as scary as it sounds: Start with a big dictionary containing all the words you're likely to encounter. This can be from several languages.
When scanning a PDF, allow for a certain number of unknown words (say 5%). If any of these words are repeated often enough (say 5 times), add them to the dictionary. If the PDF contains more than 5% unknown words, it's very likely something that couldn't be processed.
The scanner will learn over time allowing you to reduce the amount of unknown words if that should be necessary. If that is too much hazzle, a very big dictionary should work well, too.
If you don't have a dictionary, manually process a couple of documents and have the scanner learn. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.

Of course no method will be perfect.
There are usually two classes of text extraction poblems :
1 - nothing gets extracted.
This can be because you've got a scanned document or something is invalid in the PDF.
Usually easy to detect, you should not need complicaed code to check those.
2 - You get garbage.
Most of the times because the PDF file is weirdly encoded.
This can be because of homemade encoding not properly declared, or maybe the PDF author needed characters not recognized by PDF ( For example, The turkish S with cedilla was missing for some time in the adobe glyph list : you could not create a correctly encoded file with it inside so you had to cheat to get it visually on the page ).
I use a ngram based method to detect languages of PDF files based on the extracted text (with different technologies but the idea is the same). Files where the language was not recognized are usually good suspects of a problem...
About spellchecking I suppose it will give you tons of false positives especially if you have multiple languages !

You could just run the corpus against a list of stop words (the most frequent words that search engines ignore, like "and" and "the"), but then you obviously need stop word lists for all possible/probable languages first.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.