NLP - can I find out who "it" is? - java

I am using the Stanford Parser to get the dependencies from a text, sentence by sentence like this:
Reader reader = new StringReader("The room was not nice. It was bright, but cold.");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
// the dependencies of the entire text
List<TypedDependency> textDependencies = new ArrayList<TypedDependency>();
// get the dependencies of each sentence and add it to the list
for (List<HasWord> sentence : new DocumentPreprocessor(reader)) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
textDependencies.addAll(gs.typedDependenciesCCprocessed());
}
After running the code from above, the list called textDependencies will contain the following dependencies:
det(room-2, The-1)
nsubj(nice-5, room-2)
cop(nice-5, was-3)
neg(nice-5, not-4)
root(ROOT-0, nice-5)
nsubj(warm-3, It-1)
nsubj(noisy-6, It-1)
cop(warm-3, was-2)
root(ROOT-0, warm-3)
conj_but(warm-3, noisy-6)
Is there a way to find out who "it" is, to get something showing that it is actually the room?

What you want is called coreference resolution. Stanford CoreNLP does that already. I couldn't find a demo of it is done programatically, but if you are running the precompiled executable you need to add dcoref to the list of annotators like this:
java -cp <all jars> edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt

Here is a Java code example of Stanford CoreNLP coreference resolution (as suggested by mbatchkarov):
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.CorefChain.CorefMention;
import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
public class StanfordExample {
protected StanfordCoreNLP pipeline;
public StanfordExample() {
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
this.pipeline = new StanfordCoreNLP(props);
}
public void getCoref(String documentText)
{
Annotation document = new Annotation(documentText);
this.pipeline.annotate(document);
Map<Integer, CorefChain> sentences = document.get(CorefChainAnnotation.class);
for(CorefChain chain : sentences.values())
{
List<CorefMention> mentionsInTextualOrder = chain.getMentionsInTextualOrder();
for (CorefMention corefMention : mentionsInTextualOrder)
{
System.out.println(corefMention.toString());
}
}
}
public static void main(String[] args) {
String text = "The room was not nice. It was bright, but cold.";
StanfordExample slem = new StanfordExample();
slem.getCoref(text);
}
}

Use StanfordCoreNlpDemo.java in the downloaded CoreNLP's zip

Related

IllegalArgumentException: PTBLexer: Invalid options key in constructor: asciiQuotes Stanford NLP

I'm trying to test the Hello word of Stanford POS tagger API in Java (I used the same .jar in python and it worked well) on french sentences.
Here is my code
public class TextPreprocessor {
private static MaxentTagger tagger=new MaxentTagger("../stanford-tagger-4.1.0/stanford-postagger-full-2020-08-06/models/french-ud.tagger");
public static void main(String[] args) {
String taggedString = tagger.tagString("Salut à tous, je suis coincé");
System.out.println(taggedString);
}
}
But I get the following exception:
Loading POS tagger from C:/Users/_Nprime496_/Downloads/Compressed/stanford-tagger-4.1.0/stanford-postagger-full-2020-08-06/models/french-ud.tagger ... done [0.3 sec].
Exception in thread "main" java.lang.IllegalArgumentException: PTBLexer: Invalid options key in constructor: asciiQuotes
at edu.stanford.nlp.process.PTBLexer.<init>(PTBLexer.java)
at edu.stanford.nlp.process.PTBTokenizer.<init>(PTBTokenizer.java:285)
at edu.stanford.nlp.process.PTBTokenizer$PTBTokenizerFactory.getTokenizer(PTBTokenizer.java:698)
at edu.stanford.nlp.process.DocumentPreprocessor$PlainTextIterator.<init>(DocumentPreprocessor.java:271)
at edu.stanford.nlp.process.DocumentPreprocessor.iterator(DocumentPreprocessor.java:226)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tokenizeText(MaxentTagger.java:1148)
at edu.stanford.nlp.tagger.maxent.MaxentTagger$TaggerWrapper.apply(MaxentTagger.java:1332)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagString(MaxentTagger.java:999)
at modules.generation.preprocessing.TextPreprocessor.main(TextPreprocessor.java:19)
Can you help me?
You can use this code and the full CoreNLP package:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class PipelineExample {
public static String text = "Paris est la capitale de la France.";
public static void main(String[] args) {
// set up pipeline properties
Properties props = StringUtils.argsToProperties("-props", "french");
// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,mwt,pos");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
CoreDocument document = pipeline.processToCoreDocument(text);
// display tokens
for (CoreLabel tok : document.tokens()) {
System.out.println(String.format("%s\t%s", tok.word(), tok.tag()));
}
}
}
You can download CoreNLP here: https://stanfordnlp.github.io/CoreNLP/
Make sure to download the latest French models.
I am not sure why your example with the standalone tagger does not work. What jars were you using?

Set options in Stanford CoreNLP tokenizer

I adapted Prof. Mannings code sample from here to read in a file, tokenize, part-of-speech-tag, and lemmatize it.
Now I came across the issue of untokenizable characters, and I would like to use the "untokenizable" option and set it to "noneKeep".
Other questions on StackOverflow explain that I would need to instantiate the tokenizer myself. However, I am not sure how to do that so that the following tasks (POS tagging etc.) are still performed as needed. Can anyone point me in the right direction?
// expects two command line parameters: one file to be read, one to write to
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
public class StanfordCoreNlpDemo {
public static void main(String[] args) throws IOException {
PrintWriter out;
out = new PrintWriter(args[1]);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation;
annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
}
}
Add this to your code:
props.setProperty("tokenize.options", "untokenizable=allKeep");
The 6 options for untokenizable are:
noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep

Why does Stanford dcoref stop throwing warnings on bad text when 2 Stanford NLP pipelines are created?

I am using Stanford CoreNLP 3.5.1 with the SR parser model.
When running the code below, the dcoref annotator logs warnings in the method "try1", but not in "try2", when another (unused) pipeline is created in the same scope.
import java.util.Properties;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
public class NLPBug {
static String badText = "Colonial Williamsburg A historical researcher has compiled beauty tips from the time and created a hog's lard moisturizer.";
public static void main(String[] args) {
Properties p1 = new Properties();
p1.put("annotators",
"tokenize, ssplit, cleanxml, pos, lemma, ner, entitymentions, parse, dcoref");
p1.put("parse.model",
"edu/stanford/nlp/models/srparser/englishSR.ser.gz");
Properties p2 = new Properties();
p2.put("annotators", "tokenize, ssplit, pos, lemma");
try1(p1);
try2(p1, p2);
}
public static void try1(Properties p1) {
StanfordCoreNLP nlp = new StanfordCoreNLP(p1);
System.out.println();
System.out.println("Annotating (1)...");
nlp.process(badText);
System.out.print("Done");
}
public static void try2(Properties p1, Properties p2) {
StanfordCoreNLP nlp = new StanfordCoreNLP(p1);
StanfordCoreNLP notUsed = new StanfordCoreNLP(p2);
System.out.println();
System.out.println("Annotating (2)");
nlp.process(badText);
System.out.println("Done...");
}
}
The warnings "RuleBasedCorefMentionFinder: Failed to find head token", "RuleBasedCorefMentionFinder: Last resort: returning as head: moisturizer" occur only in try1. What is the reason for this behaviour and is there any way to fix these warnings?

how to fix .props file error for Stanford NLP built with Eclipse?

I'm trying to build an eclipse project for stanford coreNLP but getting errors for my customized properties file created as a class with main method.
Could anyone help me solving the following errors with my new "properties (.prop)" code, please?
Errors returned by eclipse are:
Multiple markers at this line
The type List is not generic; it cannot be parameterized with arguments at the line 25; line 27: Multiple markers at
this line
The method get(Class)
is undefined for the type CoreMap
CollapsedCCProcessedDependenciesAnnotation cannot be resolved to a type
SemanticGraph cannot be resolved to a type.
My properties code is:
import java.util.Properties;
import java.util.Hashtable;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import java.awt.List;
import java.io.FileWriter;
import java.io.IOException;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation;
public class NLP {
/**
* #param <CoreMap>
* #param args
*/
public static <CoreMap> void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "This is a sentence I want to parse.";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap>sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(dependencies.toDotFormat());
}
}
}
Thanks for your help!

Stanford CoreNLP sentiment

I'm trying to implement the coreNLP sentiment analyzer in eclipse. Getting the error:
Unable to resolve "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz"
As either class path, filename or URL. I installed all of the NLP files using maven so I am not sure why it is looking for something else. Here is the code I am getting the error on.
import java.util.Properties;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.util.CoreMap;
public class StanfordSentiment {
StanfordCoreNLP pipeline;
public StanfordSentiment(){
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
pipeline = new StanfordCoreNLP(props);
}
public float calculateSentiment (String text) {
float mainSentiment = 0;
int longest = 0;
Annotation annotation = pipeline.process(text);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
int sentiment = RNNCoreAnnotations.getPredictedClass(tree) - 2;
String partText = sentence.toString();
if (partText.length() > longest) {
mainSentiment = sentiment;
longest = partText.length();
}
}
return mainSentiment;
}
}
public class SentimentAnalysis {
public static void main(String[] args) throws IOException {
String text = "I am very happy";
Properties props = new Properties();
props.setProperty("annotators",
"tokenize, ssplit, pos, lemma, parse, sentiment");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = pipeline.process(text);
List<CoreMap> sentences = annotation
.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
String sentiment = sentence
.get(SentimentCoreAnnotations.ClassName.class);
System.out.println(sentiment + "\t" + sentence);
}
}
}
Hope it will help..:)

Categories