Where do I start for Text Pattern Recognition - Java Based

Where do I start for Text Pattern Recognition - Java Based - java

I am seriously considering doing a Optical Character Recognition program. I am well versed with Java and would love to know about libraries available out there. Basically, I want to convert something like the following to text. I will need to give manual interruption to specify a pattern. For example, I would need to ask user to mark f in this text, so that I know where f occurs.
I am a newbie to this entirely, so I dont mind learning from scratch as well. Need guidance.

If you are thinking of coding an OCR program from scratch, reading up on techniques may be useful. I found an OCR Survey from 1996 which reviews some of the popular techniques from a decade and a half ago. Reading that might be helpful; track down papers it cites or papers which cite it.
Usually the process goes as follows:
find text
find characters in the text
extract features from the characters found
do pattern matching
report suspected character
While getting a user to annotate text is fun and exciting, finding a collection of handwriting which is already annotated might save you a lot of time, that way you can focus on the nuts and bolts of doing OCR rather than building your own database of annotated text.
To start with a slightly easier task you might want to consider building a system to detect handwritten digits. The USPS produced a corpus for developing systems to do this for zip code processing. The link was something I found with a quick search.

If you want to use/look at a library, you could try the Google-endorsed Tesseract.

Related

Scanning texts for specific words

I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.
Are there any particular techniques or approaches that try to solve this problem?
Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.

Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.
For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.
It may seems a bit like over-engineering, but its not :).

I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.
Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM
I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.

How to Read Text From Bounding Box using Java With OpenCV

I am working on Handwritten Form Recognition System, till now i have reached to this step where,i have been able to detect text using java with openCV but now i want to read the text from each of these bounding boxes Click to open image
I have being doing research to find out the process for the same using java with openCV but i was unable to find any.
Suggest me some links,Technologies,methods or process to perform this particular task with "JAVA".

This answer is more general than question specific. I will try to stick as much as possible with the problem statement.
Although there is a lot of on going research on recognition of hand written text, there is no full-proof method, which works with all possible problems.
The sample image you posted here is relatively noisy, with extremely high variance between the font of the same letter. This is exactly where it gets tricky.
I would personally suggest that once you have the bounding boxes around the text (which you already do), run contour extraction in all these bounding boxes in order to extract single letters. Once you have them, you need to figure out relevant feature/s that can represent the maximum variance (or at least 95% Confidence Interval) of the particular letter.
With this/ese feature/s, you need to train a supervised algorithm, letters as training data and their corresponding value (for eg. actual values) as labels. Once you have that, give it some data (the easiest and most difficult cases) to analyze the accuracy.
These links can help you for a start :
One of my first tools to check the accuracy with the set of features I use before I start coding: Weka
Go through basic tutorials on machine learning and how they work - Personal Favorite
You could try TensorFlow.
Simple Digit Recognition OCR in OpenCV-Python - Great for beginners.
Hope it helps!

In Java how to say speak/read instead of println/system.in

Just curious to know what it will take for me to have human capability to my java programs. Currently to display a message i use System.out.println and to read user's input i may use something like System.in. Wondering if there is a way for me to say System.out.speak() and System.hear();
If not possible with Java i'm okay to learn other languages please help.

Wondering if there is a way for me to say System.out.speak() and System.hear();
Literally, no.
System.out is a PrintWriter and there is no speak() method.
There is no System.hear() method.
Adding such methods would entail hacking on standard system classes ... making the resulting library "NOT Java(tm)".
Furthermore, there are no standard APIs in the Java libraries for text to speech or speech to text. (And I'm not aware of any other language that offers this functionality as a standard feature.)
However, I'm sure that if you looked hard enough you could find 3rd-party tools for doing this that could be integrated with Java, one way or another.
UPDATE
In fact, you have found the standard Android (as distinct from Java) APIs for this:
Speech recognition: android.speech
Text to speech: android.speech.tts
From a design perspective, I think it would be a better idea to support this kind of thing in the OS's user interface framework (where the user can control it), and not embed it in individual applications.

So it sounds like this is what you want:
"System.out.speak()" -- as you know by now, that's not a real thing. I think I could propose a high-level, temporary solution.
It sounds like you just want to be audibly notified when you reach a certain part in your code. Perhaps you could just record a wav or mp3 of yourself saying whatever it is you want to hear as an alert, and then import the wav/mp3 into your project directory. Refer to this article to figure out how to playback that audio:
Playing .mp3 and .wav in Java?
You could simply make a static method that takes in a string representing the desired audio playback and then does so by however the link above suggests.
If you want it to take in a string, and then have some sort of computer voice (e.g. Microsoft Sam) speak that string, that's a lot more complicated. I have no idea how to do that haha. But I'm guessing it's not as hard as your idea of "System.in.hear()"
"System.in.hear()" -- This is definitely not a thing. This requires knowledge in the field of Speech-To-Text (STT). This is basically how Siri or Google Now parses what you say to them. I'm sure there are libraries you could find that do this, but I'm too lazy to look for you :(
I hope this helps a little bit. I'm doing a little bit of research right now on STT and I saw your question pop up. I'm not very knowledgeable in the area, but I hope you figure out a way to get audio feedback instead of having to put println's everywhere. You should figure that out and reuse it.
Happy programming!

Handwritten character (English letters, kanji,etc.) analysis and correction

I would like to know how practical it would be to create a program which takes handwritten characters in some form, analyzes them, and offers corrections to the user. The inspiration for this idea is to have elementary school students in other countries or University students in America learn how to write in languages such as Japanese or Chinese where there are a lot of characters and even the slightest mistake can make a big difference.
I am unsure how the program will analyze the character. My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work. Endpoints will also be useful to know. I would also like to tell the user if their character could be interpreted as another character similar to the one they wanted to write.
I imagine I will need a library of some sort to complete this project in any sort of timely manner but I have been unable to locate one which meets the standards I will need for the program. I looked into OpenCV but it appears to be meant for vision than image processing. I would also appreciate the library/module to be in python or Java but I can learn a new language if absolutely necessary.
Thank you for any help in this project.

Character Recognition is usually implemented using Artificial Neural Networks (ANNs). It is not a straightforward task to implement seeing that there are usually lots of ways in which different people write the same character.
The good thing about neural networks is that they can be trained. So, to change from one language to another all you need to change are the weights between the neurons, and leave your network intact. Neural networks are also able to generalize to a certain extent, so they are usually able to cope with minor variances of the same letter.
Tesseract is an open source OCR which was developed in the mid 90's. You might want to read about it to gain some pointers.

You can follow company links from this Wikipedia article:
http://en.wikipedia.org/wiki/Intelligent_character_recognition
I would not recommend that you attempt to implement a solution yourself, especially if you want to complete the task in less than a year or two of full-time work. It would be unfortunate if an incomplete solution provided poor guidance for students.
A word of caution: some companies that offer commercial ICR libraries may not wish to support you and/or may not provide a quote. That's their right. However, if you do not feel comfortable working with a particular vendor, either ask for a different sales contact and/or try a different vendor first.
My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work.
The initial step of getting a stroke representation only a single pixel wide is much more difficult than you might guess. Although there are simple algorithms (e.g. Stentiford and Zhang-Suen) to perform thinning, stroke crossings and rough edges present serious problems. This is a classic (and unsolved) problem. Thinning works much of the time, but when it fails, it can fail miserably.
You could work with an open source library, and although that will help you learn algorithms and their uses, to develop a good solution you will almost certainly need to dig into the algorithms themselves and understand how they work. That requires quite a bit of study.
Here are some books that are useful as introduct textbooks:
Digital Image Processing by Gonzalez and Woods
Character Recognition Systems by Cheriet, Kharma, Siu, and Suen
Reading in the Brain by Stanislas Dehaene
Gonzalez and Woods is a standard textbook in image processing. Without some background knowledge of image processing it will be difficult for you to make progress.
The book by Cheriet, et al., touches on the state of the art in optical character recognition (OCR) and also covers handwriting recognition. The sooner you read this book, the sooner you can learn about techniques that have already been attempted.
The Dehaene book is a readable presentation of the mental processes involved in human reading, and could inspire development of interesting new algorithms.

Have you seen http://www.skritter.com? They do this in combination with spaced recognition scheduling.
I guess you want to classify features such as curves in your strokes (http://en.wikipedia.org/wiki/CJK_strokes), then as a next layer identify componenents, then estimate the most likely character. All the while statistically weighting the most likely character. Where there are two likely matches you will want to show them as likely to be confused. You will also need to create a database of probably 3000 to 5000 characters, or up to 10000 for the ambitious.
See also http://www.tegaki.org/ for an open source program to do this.

preferred language/technique for sequence processing or parsing

I have come across similar problems a few times in the past and want to know what language (methodology) if any is used to solve similar problems (I am a J2EE/java developer):
problem: Out of a probable set of words, with a given rule (say the word can be a combination of A and X, and always starts with a X, each word is delimited by a space), you have to read a sequence of words and parse through the input to decide which of the words are syntatctically correct. In a nutshell these are problems that involve parsing techniques. Say simulate the logic of an vending machine in Java.
So what I want to know is what are the techniques/best approach to solve problems pertaining to parsing inputs. Like alien language processing problem in google code jam
Google code jam problem
Do we use something like ANTLR or some library in java.
I know this question is slightly generic, but I had no other way of expressing it.
P.S: I do not want a solution, I am looking for best way to solve such recurring problems.

You can use JavaCC for complex parsing.
For relative simple parsing and event processing I use enum(s) as a state machine. esp as a push parser.
For very simple parsing, you can use indexOf or split(" ") with equals, switch or startsWith

If you want to simulate the logic of a something that is essentially a finite state automation, you can simply code the FSA by hand. This is a standard computer science solution. A less obvious way to do this is to use a lexer-generator (there are lots of them) to generate the FSA from descriptions of the valid sequences of events (in lexer-generator speak, these are called "characters" but you can cheat and substitute event occurrences for characters).
If you have complex recursive rules about matching, you'll want a more traditional parser.
You can code these by hand, too, if the grammar isn't complicated; see my ?SO answer on "how to build a recursive descent parser". If your grammar is complex or it changes quickly, you'll want to use a standard parser generator. Other answers here suggest specific ones but there are many to choose from, all generally very capable.
[FWIW, I applied parser generators to recognizing valid transaction sequences in 1974 in TRW POS terminals the May Company department store. Worked pretty well.]

You can use ANTLR which is good, It will help in complex problem But you can also use regular expressions eg: spilt("\\s+").

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.