Word vectorization for neural network input? - java

I wanted to try some NLP things on a Neural-Network, but for the input I need vectors of word, "one-of-k" can't be used because of the big vocabulary. So I tried to do "multidimensional scaling",which for reasons unknown to me doesn't work. "Programming Collective Intelligence" was the book that I followed for this.
This isn't actually my Problem on which I wanted to work on. So if there would be a library available which will do this work, I could overcome this obstacle and experiment on my actual problem.

You can try OpenNLP and try the api's from it. Although, I must accept I have not clearly understood the question.

Related

Scanning texts for specific words

I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.
Are there any particular techniques or approaches that try to solve this problem?
Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.
Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.
For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.
It may seems a bit like over-engineering, but its not :).
I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.
Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM
I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.

How to design a high performance Key Matching algorithm for a Translation Memory/Cache?

Recently I've been assigned to build a translation memory for a new project. The idea is the TM is a cache layer on top of the RPC layer which will call the Google Translate API to translate if there is no match in the TM. I consider using the source text as key in TM and I need a fuzzy matching algorithm to match a query text with key in TM. If the result is higher than some threshold like 0.85 (range is 0 to 1) the cached translated text will be used instead of calling google service.
I've read a lot of articles/blogs/papers, but still don't know where to start.
TD-IDF+cosine similarity seems not good enough? Levenshtein distance?
What about semantic similarity? But how?
I read about this
In the comments #mbatchkarov seems provide a correct direction.
Does anyone has similar experience on the subject? Any suggestions are welcome.
A lot of the time the accepted answer to the question you linked to can get you quite far. You can compare the word (lemma) overlap between a query and all queries in the cache. To improve performance, you can incorporate word similarity to help you link semantically similar words. The thesaurus-building software I linked to in my is BSD-licensed, so you are free to use it as you see fit. If you need any help using it, the developers (disclaimer: I am a part of the team) will be happy to help out. In fact, I've got a few pre-built thesauri lying around. These should probably be a part of the software, but they are too large to upload to github.
Whichever approach you go for, be aware that there will be many cases where this does not work well. This is because the approaches discussed in that question are about semantic similarity, and your application may require semantic equivalence. For example, "I like big ginger cats" and "We like big ginger cats" or "We like small ginger cats" are very similar in meaning, but it would be wrong to use the translation of one as a translation of the other.

Handwriting Recognition Java

I know there are other posts about this, but I cannot seem to find one strictly for handwriting. I am going to have a form and all I need to read in is 8 squares in the left hand corner that will have 3 letters proceeded by 5 numbers.
The problem with most posts is that people either post about software for writing on the screen or software that doesn't recognize handwriting yet. I would prefer to have something in java, but something simple in another language would work.
What would really work is if people could scan their documents and just type the job number for the document name, but apparently they cant do that right...
Can you change the form? This problem will simplify a lot if you can change the form to be something that is easier for a machine to read. To recognize an arbitrary handwriting is hard as well as error prone.
What I have in mind is a form like this:
form example http://shareworldonline.com/w3/testprep/images/test%20form.jpg
However, if you have to have handwriting, check out the solutions in this thread.
if i got you correctly, you are doing offline hwr,
when i was doing offline hwr, i found most difficult separating characters in word, seems like you have them in squares, so all what you need to do is find your boxes (ie by using histogram)
and compare content of your box with each element in yours characters database (i used levenshtein distance for that)
I know it maybe not very helpful, but maybe push you on right track.

Handwritten character (English letters, kanji,etc.) analysis and correction

I would like to know how practical it would be to create a program which takes handwritten characters in some form, analyzes them, and offers corrections to the user. The inspiration for this idea is to have elementary school students in other countries or University students in America learn how to write in languages such as Japanese or Chinese where there are a lot of characters and even the slightest mistake can make a big difference.
I am unsure how the program will analyze the character. My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work. Endpoints will also be useful to know. I would also like to tell the user if their character could be interpreted as another character similar to the one they wanted to write.
I imagine I will need a library of some sort to complete this project in any sort of timely manner but I have been unable to locate one which meets the standards I will need for the program. I looked into OpenCV but it appears to be meant for vision than image processing. I would also appreciate the library/module to be in python or Java but I can learn a new language if absolutely necessary.
Thank you for any help in this project.
Character Recognition is usually implemented using Artificial Neural Networks (ANNs). It is not a straightforward task to implement seeing that there are usually lots of ways in which different people write the same character.
The good thing about neural networks is that they can be trained. So, to change from one language to another all you need to change are the weights between the neurons, and leave your network intact. Neural networks are also able to generalize to a certain extent, so they are usually able to cope with minor variances of the same letter.
Tesseract is an open source OCR which was developed in the mid 90's. You might want to read about it to gain some pointers.
You can follow company links from this Wikipedia article:
http://en.wikipedia.org/wiki/Intelligent_character_recognition
I would not recommend that you attempt to implement a solution yourself, especially if you want to complete the task in less than a year or two of full-time work. It would be unfortunate if an incomplete solution provided poor guidance for students.
A word of caution: some companies that offer commercial ICR libraries may not wish to support you and/or may not provide a quote. That's their right. However, if you do not feel comfortable working with a particular vendor, either ask for a different sales contact and/or try a different vendor first.
My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work.
The initial step of getting a stroke representation only a single pixel wide is much more difficult than you might guess. Although there are simple algorithms (e.g. Stentiford and Zhang-Suen) to perform thinning, stroke crossings and rough edges present serious problems. This is a classic (and unsolved) problem. Thinning works much of the time, but when it fails, it can fail miserably.
You could work with an open source library, and although that will help you learn algorithms and their uses, to develop a good solution you will almost certainly need to dig into the algorithms themselves and understand how they work. That requires quite a bit of study.
Here are some books that are useful as introduct textbooks:
Digital Image Processing by Gonzalez and Woods
Character Recognition Systems by Cheriet, Kharma, Siu, and Suen
Reading in the Brain by Stanislas Dehaene
Gonzalez and Woods is a standard textbook in image processing. Without some background knowledge of image processing it will be difficult for you to make progress.
The book by Cheriet, et al., touches on the state of the art in optical character recognition (OCR) and also covers handwriting recognition. The sooner you read this book, the sooner you can learn about techniques that have already been attempted.
The Dehaene book is a readable presentation of the mental processes involved in human reading, and could inspire development of interesting new algorithms.
Have you seen http://www.skritter.com? They do this in combination with spaced recognition scheduling.
I guess you want to classify features such as curves in your strokes (http://en.wikipedia.org/wiki/CJK_strokes), then as a next layer identify componenents, then estimate the most likely character. All the while statistically weighting the most likely character. Where there are two likely matches you will want to show them as likely to be confused. You will also need to create a database of probably 3000 to 5000 characters, or up to 10000 for the ambitious.
See also http://www.tegaki.org/ for an open source program to do this.

Algorithm for searching for an image in another image. (Collage)

Is this even possible? I have one huge image, 80mb with a lot of tiny pictures. They are tilted and turned around as well. How can i search for an image with programming? I know how to use java and c++. How would you go about this?
You might want to look up the Scale Invariant Feature Transform (SIFT) algorithm. Just for example, it's used in a fair number of programs for automatically generating panoramas, to recognize the parts of pictures that match up, despite differences in scaling, tilting, panning, and so on.
Edit: Quite true -- it is patented, and I probably should have mentioned that to start with. In case anybody care's it's US patent # 6,711,293.
One algorithm I've used before is SIFT. If you're interested in implementing the algorithm for yourself, you can see course notes for CPSC 425 at UBC, which describes in gentle detail how to implement SIFT in MATLAB. If you just want code that does this, take a look at VLFeat, a C library that does SIFT and a number of other algorithms.
Quotation from Jerry Coffin:
Edit: Quite true -- it is patented, and I probably should have mentioned that to start with. In case anybody care's it's US patent # 6,711,293.
How much do you know about the image? Exactly what it looks like? Do you have a copy of the image and you just need to figure out where in the large image it is?
Anyway, the branch of CS that deals with these kinds of questions is called Computer Vision.
Open CV and TINA are two open source libraries you might be able to use.
You should probably start out with the simplest ideas and see if they are sufficient for your needs. In the field of pattern matching the simplest idea is that of template matching. There is an efficient implementation of template matching found in OpenCv.
Note that template matching is rotation variant, meaning if the template you are trying to match can be rotated in the image you are trying to find it in, it won't work unless you pre-rotate the templates.

Categories