How to correlate similar messages using NLP

How to correlate similar messages using NLP - java

I have couple of tweets which needs to be processed. I am trying to find occurrences of messages where it mean some harm to a person. How do I go about achieving this via NLP
I bought my son a toy gun
I shot my neighbor with a gun
I don't like this gun
I would love to own this gun
This gun is a very good buy
Feel like shooting myself with a gun
In the above sentences, the 2nd, 6th one is what I would like to find.

If the problem is restricted only to guns and shooting, then you could use a dependency parser (like the Stanford Parser) to find verbs and their (prepositional) objects, starting with the verb and tracing its dependants in the parse tree. For example, in both 2 and 6 these would be "shoot, with, gun".
Then you can use a list of (near) synonyms for "shoot" ("kill", "murder", "wound", etc) and "gun" ("weapon", "rifle", etc) to check if they occur in this pattern (verb - preposition - noun) in each sentence.
There will be other ways to express the same idea, e.g. "I bought a gun to shoot my neighbor", where the dependency relation is different, and you'd need to detect these types of dependencies too.

All of vpekar's suggestions are good. Here is some python code that will at least parse the sentences and see if they contain verbs in a user defined set of harm words. Note: most 'harm words' probably have multiple senses, many of which could have nothing to do with harm. This approach does not attempt to disambiguate word sense.
(This code assumes you have NLTK and Stanford CoreNLP)
import os
import subprocess
from xml.dom import minidom
from nltk.corpus import wordnet as wn
def StanfordCoreNLP_Plain(inFile):
#Create the startup info so the java program runs in the background (for windows computers)
startupinfo = None
if os.name == 'nt':
startupinfo = subprocess.STARTUPINFO()
startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
#Execute the stanford parser from the command line
cmd = ['java', '-Xmx1g','-cp', 'stanford-corenlp-1.3.5.jar;stanford-corenlp-1.3.5-models.jar;xom.jar;joda-time.jar', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit,pos', '-file', inFile]
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, startupinfo=startupinfo).communicate()
outFile = file(inFile[(str(inFile).rfind('\\'))+1:] + '.xml')
xmldoc = minidom.parse(outFile)
itemlist = xmldoc.getElementsByTagName('sentence')
Document = []
#Get the data out of the xml document and into python lists
for item in itemlist:
SentNum = item.getAttribute('id')
sentList = []
tokens = item.getElementsByTagName('token')
for d in tokens:
word = d.getElementsByTagName('word')[0].firstChild.data
pos = d.getElementsByTagName('POS')[0].firstChild.data
sentList.append([str(pos.strip()), str(word.strip())])
Document.append(sentList)
return Document
def FindHarmSentence(Document):
#Loop through sentences in the document. Look for verbs in the Harm Words Set.
VerbTags = ['VBN', 'VB', 'VBZ', 'VBD', 'VBG', 'VBP', 'V']
HarmWords = ("shoot", "kill")
ReturnSentences = []
for Sentence in Document:
for word in Sentence:
if word[0] in VerbTags:
try:
wordRoot = wn.morphy(word[1],wn.VERB)
if wordRoot in HarmWords:
print "This message could indicate harm:" , str(Sentence)
ReturnSentences.append(Sentence)
except: pass
return ReturnSentences
#Assuming your input is a string, we need to put the strings in some file.
Sentences = "I bought my son a toy gun. I shot my neighbor with a gun. I don't like this gun. I would love to own this gun. This gun is a very good buy. Feel like shooting myself with a gun."
ProcessFile = "ProcFile.txt"
OpenProcessFile = open(ProcessFile, 'w')
OpenProcessFile.write(Sentences)
OpenProcessFile.close()
#Sentence split, tokenize, and part of speech tag the data using Stanford Core NLP
Document = StanfordCoreNLP_Plain(ProcessFile)
#Find sentences in the document with harm words
HarmSentences = FindHarmSentence(Document)
This outputs the following:
This message could indicate harm: [['PRP', 'I'], ['VBD', 'shot'], ['PRP$', 'my'], ['NN', 'neighbor'], ['IN', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]
This message could indicate harm: [['NNP', 'Feel'], ['IN', 'like'], ['VBG', 'shooting'], ['PRP', 'myself'], ['IN', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]

I would have a look at SenticNet
http://sentic.net/sentics
It provides an open source knowledge base and parser that assigns emotional value to text fragments. Using the library, you could train it to recognize statements that you're interested in.

Related

PorterStemmer with verbs ending in -es and -ed java

i am using PorterStemmer in java to get the base form of a verb, But i found a problem with the verbs "goes" and "gambles". Instead of stemming it to "go" and "gamble", it stems them to "goe" and "gambl". Is there a better tool that can handle verbs that ends with -es and -ed to retrieve the base form of a verb? P.S JAWS with wordnet java does that too.
Here is my code:
public class verb
{
public static void main(String[] args)
{
PorterStemmer ps = new PorterStemmer();
ps.setCurrent("gambles");
ps.stem();
System.out.println(ps.getCurrent());
}
}
Here is the output in console:
gambl

Take a few minutes to read this tutorial of Stanford NLP group
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
You can find the stemmer actually is not working as what you may think. It is crude so it not always gives you a complete base form of verbs with the ending chopped off. In your case, since you are caring about getting a complete base form of a verb, lemmatization seems better for you.

GATE how to annotate each verb except those in the list

I have this task: I have .lst file in GATE with list of verbs. The have annotation Inner_predicates. And I need to annotate other verbs as Outer_predicates. Can you help me to write this rule?
I tried this:
Phase: Outer_Pred
Input: Morph Inner_Pred
Options: control = appelt
Rule: Outer_Pred
(
({Morph.pos == verb}, Morph.baseForm !=Inner_Pred)
):tag
-->
:tag.Outer_Pred = {rule = "Outer_Pred"}
But it is of no use. How can I find a verb, check whether it already has an annotation Inner_Pred and if not, annotate this verb as Outer_Pred?
In inner_pred.lst I have list of verbs in base form.
Thanks in advance. And also it would be great, if you could tell me where I can look this information by myself. I found only GATE Jape Manual but it is quite short and doesn't provide many answers.

If you have a list of verbs in their base form, you should try a flexible gazetteer. It will create "Lookup" annotations where it matches (not only on verbs).
Then to match each verb that is not on the list:
Phase: Outer_Pred
Input: Morph Lookup
Options: control = appelt
Rule: Outer_Pred (
{Morph.pos == "verb", !Lookup.majorType == "yourInnerPredType"}
):tag
-->
:tag.Outer_Pred = {rule = "Outer_Pred"}
This will match every "verb" which does not start on the same offset as a Lookup with major type "yourInnerPredType".
Also, make sure you have the Morph annotations with the right pos feature.

You could write a simple JAPE rule using Java in RHS which will:
a) take any verb (Token.category==VB | Token.category==VBD | and etc). List of possible verb POS tags you could find at the end of GATE tao.pdf
b) take start and stop nodes of this Token and check for Inner_Pred annotations covering current Token
c) If returned AnnotationSet for Inner_Pred is empty, create a new Annotation Outer_Verb with boundaries from b)

This is source code of JAPE rule:
Phase: Verb
Input: Token
Options: control = appelt
Macro:VERB
(
({Token.category==VBD}|{Token.category==VBG}|{Token.category==VBN}|{Token.category==VBP}|{Token.category==VB}|{Token.category==VBZ})
)
Rule:Verb
(
(VERB)
):verb
-->
{
AnnotationSet set = (AnnotationSet) bindings.get("verb");
Node startNode = set.firstNode();
Node stopNode = set.lastNode();
AnnotationSet innerAnnotations = outputAS.get(startNode.getOffset(), stopNode.getOffset()).get("Inner_Pred");
if (innerAnnotations == null || innerAnnotations.isEmpty()) {
outputAS.add(startNode, stopNode, "Outer_Pred", Factory.newFeatureMap());
}
}

Using mahout in java code, not cli

i want to be able to build a model using java, i am able to do so with CLI as folowing:
./mahout trainlogistic --input Candy-Crush.twtr.csv \
--output ./model \
--target hd_click --categories 2 \
--predictors click_frequency country_code ctr device_price_range hd_conversion time_of_day num_clicks phone_type twitter is_weekend app_entertainment app_wallpaper app_widgets arcade books_and_reference brain business cards casual comics communication education entertainment finance game_wallpaper game_widgets health_and_fitness health_fitness libraries_and_demo libraries_demo lifestyle media_and_video media_video medical music_and_audio news_and_magazines news_magazines personalization photography productivity racing shopping social sports sports_apps sports_games tools transportation travel_and_local weather app_entertainment_percentage app_wallpaper_percentage app_widgets_percentage arcade_percentage books_and_reference_percentage brain_percentage business_percentage cards_percentage casual_percentage comics_percentage communication_percentage education_percentage entertainment_percentage finance_percentage game_wallpaper_percentage game_widgets_percentage health_and_fitness_percentage health_fitness_percentage libraries_and_demo_percentage libraries_demo_percentage lifestyle_percentage media_and_video_percentage media_video_percentage medical_percentage music_and_audio_percentage news_and_magazines_percentage news_magazines_percentage personalization_percentage photography_percentage productivity_percentage racing_percentage shopping_percentage social_percentage sports_apps_percentage sports_games_percentage sports_percentage tools_percentage transportation_percentage travel_and_local_percentage weather_percentage reads_magazine_sum reads_magazine_count interested_in_gardening_sum interested_in_gardening_count kids_birthday_coming_sum kids_birthday_coming_count job_seeker_sum job_seeker_count friends_sum friends_count married_sum married_count charity_donor_sum charity_donor_count student_sum student_count interested_in_real_estate_sum interested_in_real_estate_count sports_fan_sum sports_fan_count bascketball_sum bascketball_count interested_in_politics_sum interested_in_politics_count gamer_sum gamer_count activist_sum activist_count traveler_sum traveler_count likes_soccer_sum likes_soccer_count interested_in_celebs_sum interested_in_celebs_count auto_racing_sum auto_racing_count age_group_sum age_group_count healthy_lifestyle_sum healthy_lifestyle_count interested_in_finance_sum interested_in_finance_count sports_teams_usa_sum sports_teams_usa_count interested_in_deals_sum interested_in_deals_count business_oriented_sum business_oriented_count interested_in_cooking_sum interested_in_cooking_count music_lover_sum music_lover_count beauty_sum beauty_count follows_fashion_sum follows_fashion_count likes_wrestling_sum likes_wrestling_count name_sum name_count shopper_sum shopper_count golf_sum golf_count vegetarian_sum vegetarian_count dating_sum dating_count interested_in_fashion_sum interested_in_fashion_count interested_in_news_sum interested_in_news_count likes_tennis_sum likes_tennis_count male_sum male_count interested_in_cars_sum interested_in_cars_count follows_bloggers_sum follows_bloggers_count entertainment_sum entertainment_count interested_in_books_sum interested_in_books_count has_kids_sum has_kids_count interested_in_movies_sum interested_in_movies_count musicians_sum musicians_count tech_oriented_sum tech_oriented_count female_sum female_count has_pet_sum has_pet_count practicing_sports_sum practicing_sports_count \
--types numeric word numeric word word word numeric word word word numeric \
--features 100 --passes 1 --rate 50
i cant understand the 20 news group example because its to big to learn from.
can anyone give me a code that is doing the same as the cli command?
to clarify:
i need something like this:
model.train(1,0,"monday",6,44,1,7,4,6,78,7,3,4,6,........,"good");
model.train(1,0,"sunday",6,44,5,7,9,2,4,6,78,7,3,4,6,........,"bad");
model.train(1,0,"monday",4,99,2,4,6,3,4,6,........,"good");
model.writeTofile("myModel.model");
PLESE DO NOT ANSWER IF YOU ARE NOT FAMILIAR WITH CLASSIFICATION AND ONLY WANT TO TELL ME HOW TO EXECUTE CLI COMMAND FROM JAVA

I am not 100% familiar with the Mahout API (I agree that documentation is very sparse) so I can only give pointers, but I hope it helps:
The Java source code for the trainlogistic example can actually be found in the mahout-examples library - it's on maven [0] (in org.apache.mahout.classifier.sgd.TrainLogistic). I suppose if you wanted to, you could just use the exact same source code, but it depends on a couple of utility classes in the mahout-examples library (and it's not very clean, either).
The class performing the training in this example is org.apache.mahout.classifier.sgd.OnlineLogisticRegression [1], although considering the large number of predictor variables you have you might want to use the AdaptiveLogisticRegression [2] (same package), which uses a number of OnlineLogisticRegressions internally. But you have to see for yourself which works best with your data.
The API is fairly straightforward, there's a train method which takes a Vector of your input data and a classify method to test your model, as well as learningRate and others to change the model's parameters.
To save the model to disk like the command line tool does, use the org.apache.mahout.classifier.sgd.ModelSerializer, which has a straightforward API to write and read your model. (There's also write and readFields methods in the OLR class itself, but frankly, I'm not sure what they do or if there's a difference to ModelSerializer - they're not documented either.)
Lastly, aside from the source code in mahout-examples, here's two other example of using the Mahout API directly, that might be useful [3, 4].
Sources:
[0] http://repo1.maven.org/maven2/org/apache/mahout/mahout-examples/0.8/
[1] http://archive.cloudera.com/cdh4/cdh/4/mahout/mahout-core/org/apache/mahout/classifier/sgd/OnlineLogisticRegression.html
[2] http://archive.cloudera.com/cdh4/cdh/4/mahout/mahout-core/org/apache/mahout/classifier/sgd/AdaptiveLogisticRegression.html
[3] http://mail-archives.apache.org/mod_mbox/mahout-user/201206.mbox/%3CCAJwFCa3X2fL_SRxT7f7v9uMjS3Tc9WrT7vuMQCVXyH71k0H0zQ#mail.gmail.com%3E
[4] http://skife.org/mahout/2013/02/14/first_steps_with_mahout.html

This blog has a good post about how to do training and classification with Mahout Java API: http://nigap.blogspot.com/2012/02/bayes-algorithm-with-apache-mahout.html

You could use Runtime.exec to execute the same cmd line from java.
The simple approach is:
Process p = Runtime.getRuntime().exec("/usr/bin/bash -ic \"<path_to_mahout>/mahout trainlogistic --input Candy-Crush.twtr.csv "
+ "--output ./model "
+ "--target hd_click --categories 2 "
+ "--predictors click_frequency country_code ctr device_price_range hd_conversion time_of_day num_clicks phone_type twitter is_weekend app_entertainment app_wallpaper app_widgets arcade books_and_reference brain business cards casual comics communication education entertainment finance game_wallpaper game_widgets health_and_fitness health_fitness libraries_and_demo libraries_demo lifestyle media_and_video media_video medical music_and_audio news_and_magazines news_magazines personalization photography productivity racing shopping social sports sports_apps sports_games tools transportation travel_and_local weather app_entertainment_percentage app_wallpaper_percentage app_widgets_percentage arcade_percentage books_and_reference_percentage brain_percentage business_percentage cards_percentage casual_percentage comics_percentage communication_percentage education_percentage entertainment_percentage finance_percentage game_wallpaper_percentage game_widgets_percentage health_and_fitness_percentage health_fitness_percentage libraries_and_demo_percentage libraries_demo_percentage lifestyle_percentage media_and_video_percentage media_video_percentage medical_percentage music_and_audio_percentage news_and_magazines_percentage news_magazines_percentage personalization_percentage photography_percentage productivity_percentage racing_percentage shopping_percentage social_percentage sports_apps_percentage sports_games_percentage sports_percentage tools_percentage transportation_percentage travel_and_local_percentage weather_percentage reads_magazine_sum reads_magazine_count interested_in_gardening_sum interested_in_gardening_count kids_birthday_coming_sum kids_birthday_coming_count job_seeker_sum job_seeker_count friends_sum friends_count married_sum married_count charity_donor_sum charity_donor_count student_sum student_count interested_in_real_estate_sum interested_in_real_estate_count sports_fan_sum sports_fan_count bascketball_sum bascketball_count interested_in_politics_sum interested_in_politics_count gamer_sum gamer_count activist_sum activist_count traveler_sum traveler_count likes_soccer_sum likes_soccer_count interested_in_celebs_sum interested_in_celebs_count auto_racing_sum auto_racing_count age_group_sum age_group_count healthy_lifestyle_sum healthy_lifestyle_count interested_in_finance_sum interested_in_finance_count sports_teams_usa_sum sports_teams_usa_count interested_in_deals_sum interested_in_deals_count business_oriented_sum business_oriented_count interested_in_cooking_sum interested_in_cooking_count music_lover_sum music_lover_count beauty_sum beauty_count follows_fashion_sum follows_fashion_count likes_wrestling_sum likes_wrestling_count name_sum name_count shopper_sum shopper_count golf_sum golf_count vegetarian_sum vegetarian_count dating_sum dating_count interested_in_fashion_sum interested_in_fashion_count interested_in_news_sum interested_in_news_count likes_tennis_sum likes_tennis_count male_sum male_count interested_in_cars_sum interested_in_cars_count follows_bloggers_sum follows_bloggers_count entertainment_sum entertainment_count interested_in_books_sum interested_in_books_count has_kids_sum has_kids_count interested_in_movies_sum interested_in_movies_count musicians_sum musicians_count tech_oriented_sum tech_oriented_count female_sum female_count has_pet_sum has_pet_count practicing_sports_sum practicing_sports_count "
+ "--types numeric word numeric word word word numeric word word word numeric "
+ "--features 100 --passes 1 --rate 50\"");
If you opt for this, then I suggest reading this first:
When Runtime.exec() won't
This way the application will run in a diffent process.
Additionally you can follow the section 'Integration with your application' from the following site:
Recomender Documentation
Also this is a good reference on writing a recomender:
Introducing Apache Mahout
Hope this helps.
Cheers

How to write a Ruby-regex pattern in Java (includes recursive named-grouping)?

well... i have a file containing tintin-script. Now i already managed to grab all actions and substitutions from it to show them properly ordered on a website using Ruby, which helps me to keep an overview.
Example TINTIN-script
#substitution {You tell {([a-zA-Z,\-\ ]*)}, %*$}
{<279>[<269> $sysdate[1]<279>, <269>$systime<279> |<219> Tell <279>] <269>to <219>%2<279> : <219>%3}
{4}
#substitution {{([a-zA-Z,\-\ ]*)} tells you, %*$}
{<279>[<269> $sysdate[1]<279>, <269>$systime<279> |<119> Tell <279>] <269>from <119>%2<279> : <119>%3}
{2}
#action {Your muscles suddenly relax, and your nimbleness is gone.}
{
#if {$sw_keepaon}
{
aon;
};
} {5}
#action {xxxxx}
{
#if {$sw_keepfamiliar}
{
familiar $familiar;
};
} {5}
To grab them in my Ruby-App i read my script-file into a varibable 'input' and then use the following pattern to scan the 'input'
pattern = /(?<braces>{([^{}]|\g<braces>)*}){0}^#(?<type>action|substitution)\s*(?<b1>\g<braces>)\s*(?<b2>\g<braces>)\s*(?<b3>\g<braces>)/im
input = ""
File.open("/home/igambin/lmud/lmud.tt") { |file| input = file.read }
input.scan(pattern) { |prio, type, pattern, code|
## here i usually create objects, but for simplicity only output now
puts "Type : #{type}"
puts "Pattern : #{pattern}"
puts "Priority: #{prio}"
puts "Code :\n#{code}"
puts
}
Now my idea was to use the netbeans platform to write a module to not only keep an overview but also to assist editing the tintin script file. So opening the file in an Editor-Window I still need to parse the tintin-file and have all 'actions' and 'substitutions' from the file grabbed and displayed in an eTable, in wich I could dbl-click on one item to open a modification-window.
I've setup the module and got everything ready so far, i just can't figure out how to translate the ruby-regex pattern i've written to a working java-regex-pattern. It seems named-group-capturing and especially the recursive application of these groups is not supported in Java. Without that I seem to be unable to find a working solution...
Here's the ruby pattern again...
pattern = /(?<braces>{([^{}]|\g<braces>)*}){0}^#(?<type>action|substitution)\s*(?<b1>\g<braces>)\s*(?<b2>\g<braces>)\s*(?<b3>\g<braces>)/im
Can anyone help me to create a java pattern that matches the same?
Many thanks in advance for tips/hints/ideas and especially for solutions or (close-to-solution comments)!

Your text format seems pretty simple; it's possible you don't really need recursive matching. This Java-compatible regex matches your sample data correctly, as far as I can tell:
(?s)#(substitution|action)\s*\{(.*?)\}\s*\{(.*?)\}\s*\{(\d+)\}
Would that work for you? If you run Java 7, you can even name the groups. ;)

Can anyone help me to create a java pattern that matches the same?
No, no one can: Java's regex engine does not support recursive patterns (as Ruby 1.9 does).

Way to view a parsed file output?

I am Vinod and am interested to use an ANTLR v3.3 for C parser generation in a Java project and generate the parsed tree in some viewable form. I got help to write grammar from this tutorial
ANTLR generates lexer and parser files for the grammar but I don't exactly get how these generated files are viewed. e.g. in few examples from above article, author has generated output using ASTFrame. I found only an interpreter option in ANTLRWorks which shows some tree but it gives error if predicates are more.
Any good reference book or article would be really helpful.

There's only one book you need:
The Definitive ANTLR Reference: Building Domain-Specific Languages.
After that, many more excellent books exist (w.r.t. DSL creation), but this is the book for getting started with ANTLR.

As you saw, ANTLRWorks will print both parse trees and the AST but won't work with predicates and the C target. While not a nice picture like ANTLRWorks, this will print a text version of the AST when you pass it the root of your tree.
void printNodes(pANTLR3_BASE_TREE thisNode, int level)
{
ANTLR3_UINT32 numChildren = thisNode->getChildCount(thisNode);
//printf("Child count %d\n",numChildren);
pANTLR3_BASE_TREE loopNode;
for(int i=0;i<numChildren;i++)
{
//Need to cast since these can hold anything
loopNode = (pANTLR3_BASE_TREE)thisNode->getChild(thisNode,i);
//Print this node
pANTLR3_STRING thisText = loopNode->getText(loopNode);
for(int j=0;j<level;j++)
printf(" ");
printf("%s\n",thisText->chars);
//If this node has a child
if(loopNode->getChildCount(loopNode) > 0)
printNodes(loopNode, level + 2);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to correlate similar messages using NLP - java

I would have a look at SenticNet http://sentic.net/sentics It provides an open source knowledge base and parser that assigns emotional value to text fragments. Using the library, you could train it to recognize statements that you're interested in.

Related

PorterStemmer with verbs ending in -es and -ed java

GATE how to annotate each verb except those in the list

Using mahout in java code, not cli

How to write a Ruby-regex pattern in Java (includes recursive named-grouping)?

Way to view a parsed file output?

Categories

Resources