I am very new to Sentiment analysis. How can I judge if a given word or sentence is positive or negative. I have to implement it with java. I tried to read something like lingpipe, rapidminer tutorial, but I do not understand. In their examples they use a lot of data. In my case I do not have much data. All I have is a word or a sentence, lets say. I tried to read the questions from stackoverflow too. But they do not help me much.
Thanks in advance.
Computers don't know about a human thing like sentiment unless they learn it from examples that a human has labeled as positive or negative.
The goal of Machine Learning is in fact to make the most informed decision about a new example based on the empirical data of previous examples. Statistically, the more data, the better.
To "judge" the sentiment of a sentence, you'll need to have trained a model or classifier on some sentences labeled for sentiment. The classifier takes an unlabeled sentence as input and outputs a label: positive or negative.
First get training examples. I'm sure you can find some labeled sentiment data in the public domain. One of the best data set repositories is the UCI KDD Archive. You may then train a classifier on the data to judge new examples. There are a host of learning algorithm resources available. My favorites are jBoost, which can output a classifier as Java code, and Rapidminer, which is better for visual analysis.
You could use an existing web-service which is trained from prior data. For example:
Chatterbox Sentiment Detection API
Which has libraries for Java & Android.
(Disclosure: I work for the company that builds this API)
This is not really programming related (neuro-linguistic programming is not programming), and in general there is no reliable solution.
My best idea is to make it work like Google "Pigeon"Rank, i.e. collect words and sentences, and then collect human feedback whether they are positive or negative, and then use Bayesian matching with this data.
Your can try to use Wordnet for searching word's Semantic Orientation based on "distance" calculation between your word and "good" or "bad" words.Shorter distance will give you word's SO. Results seems will be a bit weak but not a lot of data(or time) is necessary for this approach.
Related
I am trying to develop an android (java) project for my Artificial Intelligent thesis. It is shortly based on story reading and word quiz. One person reads a story and marks the words that he doesnt know. These words are registered to WordPortfolio db that has "Word_id", "Seen"(how many times), "Asked" (How many times asked in quiz), "Right"(how many times answered right).
I have "Words" table in my db that has 3 different parameters to make one word unique. Those are "Priority", "Level" and a specifier whether it is a Verb, Noun, Adj, Adv. etc.
What I want to ask is;
Which algorithm can I use to classify these words to ask a "word-meaning question" wisely to the learner? I want learner to see the words he had seen in story-reading part more than one to consolidate the meaning of it and also I want him to learn new words.
There are many types of algorithms designed to do this. For instance, you could use a linear regression, nearest neightbor, clustering, or neural network. http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms provides a pretty comprehinsive list of the options out there.
I would also see if your library has the book "Programming Collective Intelligence" by Toby Segaran (http://shop.oreilly.com/product/9780596529321.do) or something similar in your library.
classification and clustering algorithms has implemented in many artificial intelligence software's such as MATLAB,WEKA,etc. you can see a sample of this in WEKA Text Classification for First Time & Beginner Users but i think your problem will have a good performance on MAP/REDUCE Freamework. I suggest you to use MAHOUT in your Problem which has a parallel framework and you can compromise your speed of other platforms with it.
I extracted all the entities present in a particular sentence. For example if my sentence is
infrastructure is good, Work-culture is pathetic,hikes are not good either
I have developed a code that gives me entity. now i need sentiment based upon entities. my output should be something like
infrastructure--> positive
work-culture--> negative
hikes--> negative
how am i supposed to do that?
If you are done with the coding next thing which is the most challenging part is to train the system with proper content. I have worked in Google prediction API for same sentiment analysis. You need content for the matter, means if it is a movie review then the training content should contains lots of movie review. I can tell you I have trained a system for movie review analysis with 30 movie review contents(15 positive and 15 negative). Still the system does not give 80% perfect result.
If you are using Stanford NLP package then it comes with a sentiment analyzer.
See http://nlp.stanford.edu/sentiment/
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I’m thinking of adding a feature to the TalkingPuffin Twitter client, where, after some training with the user, it can rank incoming tweets according to their predicted value. What solutions are there for the Java virtual machine (Scala or Java preferred) to do this sort of thing?
This is a classification problem, where you essentially want to learn a function y(x) which predicts whether 'x', an unlabeled tweet, belongs in the class 'valuable' or in the class 'not valuable'.
The trickiest bits here are not the algorithm (Naive Bayes is just counting and multiplying and is easy to code!) but:
Gathering the training data
Defining the optimal feature set
For one, I suggest you track tweets that the user favorites, replies to, and retweets, and for the second, look at qualities like who wrote the tweet, the words in the tweet, and whether it contains a link or not.
Doing this well is not easy. Google would love to be able to do such things ("What links will the user value"), as would Netflix ("What movies will they value") and many others. In fact, you'd probably do well to read through the notes about the winning entry for the Netflix Prize.
Then you need to extract a bunch of features, as #hmason says. And then you need an appropriate machine learning algorithm; you either need a function approximator (where you try to use your features to predict a value between, say, 0 and 1, where 1 is "best tweet ever" and 0 is "omg who cares") or a classifier (where you use your features to try to predict whether it's a "good" or "bad" tweet).
If you go for the latter--which makes user-training easy, since they just have to score tweets with "like" (to mix social network metaphors)--then you typically do best with support vector machines, for which there exists a fairly comprehensive Java library.
In the former case, there are a variety of techniques that might be worth trying; if you decide to use the LIBSVM library, they have variants for regression (i.e. parameter estimation) as well.
What are books about how to build a natural language parsing program like this:
input: I got to TALL you
output: I got to TELL you
input: Big RAT box
output: Big RED box
in: hoo un thum zend three
out: one thousand three
It must have the language model that allows to predict what words are misspelled !
What are the best books on how to build such a tool??
p.s. Are there free webservices to spell-check? From Google maybe?..
Peter Norvig has written a terrific spell checker. Maybe that can help you.
You have at least three options
You can write a program which understands the language (i.e. what a word means). This is a topic for research today. Expect the first results when you can buy a computer which is fast enough to run such a program (which is probably in 10 years when computers have become 1000 times faster than today).
Use a huge corpus (text documents) to train a Hidden Marcov Model.
Use a huge corpus and generate statistics about quadruplets n-grams, i.e. how often a tuple of N words appears. I don't have a link handy for this but the idea is that some words always appear in the context of other words. So when you parse your text into 4-grams and look them up in your database and you can't find one, chances are that there is something wrong with the current tuple. The next step is to find all possible matches (other 4-grams which have a small soundex or similar distance to the current one) and try the one with the highest frequency.
Google has this data for quite a few languages and you might find more in Google labs about this.
[EDIT] After some googling, I finally found the link: On this page, you can buy English 1- to 5-grams which Google collected over the whole Internet on 6 DVDs.
Googling for "google spelling statistics n-grams" will also turn up some interesting links.
soundex (wiki) is one option
There are quite a few Java libraries for natural language processing that would help you implement a spelling corrector. But you asked about a book. Foundations of Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze looks like a good option. The first author is a Stanford Professor leading a group that does natural language processing and developing Java libraries and NLP resources that many people use.
In Dev Days London, Michael Sparks presented a Python script coded exactly for that. It was surprisingly very simple! See if you can find in Google. Maybe somebody here will have the link.