Set options in Stanford CoreNLP tokenizer

Set options in Stanford CoreNLP tokenizer - java

I adapted Prof. Mannings code sample from here to read in a file, tokenize, part-of-speech-tag, and lemmatize it.
Now I came across the issue of untokenizable characters, and I would like to use the "untokenizable" option and set it to "noneKeep".
Other questions on StackOverflow explain that I would need to instantiate the tokenizer myself. However, I am not sure how to do that so that the following tasks (POS tagging etc.) are still performed as needed. Can anyone point me in the right direction?
// expects two command line parameters: one file to be read, one to write to
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
public class StanfordCoreNlpDemo {
public static void main(String[] args) throws IOException {
PrintWriter out;
out = new PrintWriter(args[1]);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation;
annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
}
}

Add this to your code:
props.setProperty("tokenize.options", "untokenizable=allKeep");
The 6 options for untokenizable are:
noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep

Related

What Does "file does not contain class api.configuration" Mean?

Please keep in mind that I am completely new to Java. I don't know what 'classes' and stuff are.
When trying to compile (javac -g Sphinx.java) this code:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.PrintWriter;
import api.Configuration;
import api.SpeechResult;
import api.LiveSpeechRecognizer;
public class Sphinx {
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("models/en-us/en-us");
configuration.setDictionaryPath("models/en-us/cmudict-en-us.dict");
configuration.setLanguageModelPath("models/en-us/en-us.lm.bin");
PrintWriter pw = new PrintWriter(new PrintWriter("status.txt"));
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
recognizer.startRecognition(true);
pw.print("running");
SpeechResult result = recognizer.getResult();
recognizer.stopRecognition();
pw.print("stopped");
pw.close();
PrintWriter pw2 = new PrintWriter(new PrintWriter("result.txt"));
pw2.println(result);
pw2.close();
}
}
I get this error:
Sphinx.java:8: error: cannot access Configuration
import api.Configuration;
^
bad source file: .\api\Configuration.java
file does not contain class api.Configuration
Please remove or make sure it appears in the correct subdirectory of the sourcepath.
I don't quite understand what 'file does not contain class api.configuration' means, or how to fix it.

Looking at your error message, it seems like your ./api/Configuration.java file is missing a package declaration.
Can you make sure that in ./api/Configuration.java the first line has
package api;
This tells the compiler that your file is accessible through the api package, not the default package.

Stanford CoreNLP - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

I am trying to run simple program available on this website https://stanfordnlp.github.io/CoreNLP/api.html
My Program
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.List;
import java.util.Properties;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
public class StanfordClass {
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "What is the Weather in Mumbai right now?";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println(String.format("Print: word: [%s] pos: [%s] ne: [%s]",word, pos, ne));
}
}
}
}
But getting Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
What I tried
1. if I remove ner (named entity recognizer) property from above code i.e. props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse");
then the code runs fine.
2.but I required ner(named entity recognizer) hence I increase heap size in eclipse.ini file up to 1g and sure that this much size is far enough for this program and also sure that heap size is not the problem in this case. I think something is missing but not getting that.

After lots of searches gets answer here
Using Stanford CoreNLP
Use following answer:-
1.Windows -> Preferences
2.Java -> Installed JREs
3.Select the JRE and click on Edit
4.On the default VM arguments field, type in "-Xmx1024M". (or your memory preference, for 1GB of ram it is 1024)
5.Click on finish or OK.

Why does Stanford dcoref stop throwing warnings on bad text when 2 Stanford NLP pipelines are created?

I am using Stanford CoreNLP 3.5.1 with the SR parser model.
When running the code below, the dcoref annotator logs warnings in the method "try1", but not in "try2", when another (unused) pipeline is created in the same scope.
import java.util.Properties;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
public class NLPBug {
static String badText = "Colonial Williamsburg A historical researcher has compiled beauty tips from the time and created a hog's lard moisturizer.";
public static void main(String[] args) {
Properties p1 = new Properties();
p1.put("annotators",
"tokenize, ssplit, cleanxml, pos, lemma, ner, entitymentions, parse, dcoref");
p1.put("parse.model",
"edu/stanford/nlp/models/srparser/englishSR.ser.gz");
Properties p2 = new Properties();
p2.put("annotators", "tokenize, ssplit, pos, lemma");
try1(p1);
try2(p1, p2);
}
public static void try1(Properties p1) {
StanfordCoreNLP nlp = new StanfordCoreNLP(p1);
System.out.println();
System.out.println("Annotating (1)...");
nlp.process(badText);
System.out.print("Done");
}
public static void try2(Properties p1, Properties p2) {
StanfordCoreNLP nlp = new StanfordCoreNLP(p1);
StanfordCoreNLP notUsed = new StanfordCoreNLP(p2);
System.out.println();
System.out.println("Annotating (2)");
nlp.process(badText);
System.out.println("Done...");
}
}
The warnings "RuleBasedCorefMentionFinder: Failed to find head token", "RuleBasedCorefMentionFinder: Last resort: returning as head: moisturizer" occur only in try1. What is the reason for this behaviour and is there any way to fix these warnings?

how to fix .props file error for Stanford NLP built with Eclipse?

I'm trying to build an eclipse project for stanford coreNLP but getting errors for my customized properties file created as a class with main method.
Could anyone help me solving the following errors with my new "properties (.prop)" code, please?
Errors returned by eclipse are:
Multiple markers at this line
The type List is not generic; it cannot be parameterized with arguments at the line 25; line 27: Multiple markers at
this line
The method get(Class)
is undefined for the type CoreMap
CollapsedCCProcessedDependenciesAnnotation cannot be resolved to a type
SemanticGraph cannot be resolved to a type.
My properties code is:
import java.util.Properties;
import java.util.Hashtable;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import java.awt.List;
import java.io.FileWriter;
import java.io.IOException;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation;
public class NLP {
/**
* #param <CoreMap>
* #param args
*/
public static <CoreMap> void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "This is a sentence I want to parse.";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap>sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(dependencies.toDotFormat());
}
}
}
Thanks for your help!

NLP - can I find out who "it" is?

I am using the Stanford Parser to get the dependencies from a text, sentence by sentence like this:
Reader reader = new StringReader("The room was not nice. It was bright, but cold.");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
// the dependencies of the entire text
List<TypedDependency> textDependencies = new ArrayList<TypedDependency>();
// get the dependencies of each sentence and add it to the list
for (List<HasWord> sentence : new DocumentPreprocessor(reader)) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
textDependencies.addAll(gs.typedDependenciesCCprocessed());
}
After running the code from above, the list called textDependencies will contain the following dependencies:
det(room-2, The-1)
nsubj(nice-5, room-2)
cop(nice-5, was-3)
neg(nice-5, not-4)
root(ROOT-0, nice-5)
nsubj(warm-3, It-1)
nsubj(noisy-6, It-1)
cop(warm-3, was-2)
root(ROOT-0, warm-3)
conj_but(warm-3, noisy-6)
Is there a way to find out who "it" is, to get something showing that it is actually the room?

What you want is called coreference resolution. Stanford CoreNLP does that already. I couldn't find a demo of it is done programatically, but if you are running the precompiled executable you need to add dcoref to the list of annotators like this:
java -cp <all jars> edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt

Here is a Java code example of Stanford CoreNLP coreference resolution (as suggested by mbatchkarov):
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.CorefChain.CorefMention;
import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
public class StanfordExample {
protected StanfordCoreNLP pipeline;
public StanfordExample() {
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
this.pipeline = new StanfordCoreNLP(props);
}
public void getCoref(String documentText)
{
Annotation document = new Annotation(documentText);
this.pipeline.annotate(document);
Map<Integer, CorefChain> sentences = document.get(CorefChainAnnotation.class);
for(CorefChain chain : sentences.values())
{
List<CorefMention> mentionsInTextualOrder = chain.getMentionsInTextualOrder();
for (CorefMention corefMention : mentionsInTextualOrder)
{
System.out.println(corefMention.toString());
}
}
}
public static void main(String[] args) {
String text = "The room was not nice. It was bright, but cold.";
StanfordExample slem = new StanfordExample();
slem.getCoref(text);
}
}

Use StanfordCoreNlpDemo.java in the downloaded CoreNLP's zip

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Set options in Stanford CoreNLP tokenizer - java

Add this to your code: props.setProperty("tokenize.options", "untokenizable=allKeep"); The 6 options for untokenizable are: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep

Related

What Does "file does not contain class api.configuration" Mean?

Stanford CoreNLP - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Why does Stanford dcoref stop throwing warnings on bad text when 2 Stanford NLP pipelines are created?

how to fix .props file error for Stanford NLP built with Eclipse?

NLP - can I find out who "it" is?

Categories

Resources