I'm trying to test the Hello word of Stanford POS tagger API in Java (I used the same .jar in python and it worked well) on french sentences.
Here is my code
public class TextPreprocessor {
private static MaxentTagger tagger=new MaxentTagger("../stanford-tagger-4.1.0/stanford-postagger-full-2020-08-06/models/french-ud.tagger");
public static void main(String[] args) {
String taggedString = tagger.tagString("Salut à tous, je suis coincé");
System.out.println(taggedString);
}
}
But I get the following exception:
Loading POS tagger from C:/Users/_Nprime496_/Downloads/Compressed/stanford-tagger-4.1.0/stanford-postagger-full-2020-08-06/models/french-ud.tagger ... done [0.3 sec].
Exception in thread "main" java.lang.IllegalArgumentException: PTBLexer: Invalid options key in constructor: asciiQuotes
at edu.stanford.nlp.process.PTBLexer.<init>(PTBLexer.java)
at edu.stanford.nlp.process.PTBTokenizer.<init>(PTBTokenizer.java:285)
at edu.stanford.nlp.process.PTBTokenizer$PTBTokenizerFactory.getTokenizer(PTBTokenizer.java:698)
at edu.stanford.nlp.process.DocumentPreprocessor$PlainTextIterator.<init>(DocumentPreprocessor.java:271)
at edu.stanford.nlp.process.DocumentPreprocessor.iterator(DocumentPreprocessor.java:226)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tokenizeText(MaxentTagger.java:1148)
at edu.stanford.nlp.tagger.maxent.MaxentTagger$TaggerWrapper.apply(MaxentTagger.java:1332)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagString(MaxentTagger.java:999)
at modules.generation.preprocessing.TextPreprocessor.main(TextPreprocessor.java:19)
Can you help me?
You can use this code and the full CoreNLP package:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class PipelineExample {
public static String text = "Paris est la capitale de la France.";
public static void main(String[] args) {
// set up pipeline properties
Properties props = StringUtils.argsToProperties("-props", "french");
// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,mwt,pos");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
CoreDocument document = pipeline.processToCoreDocument(text);
// display tokens
for (CoreLabel tok : document.tokens()) {
System.out.println(String.format("%s\t%s", tok.word(), tok.tag()));
}
}
}
You can download CoreNLP here: https://stanfordnlp.github.io/CoreNLP/
Make sure to download the latest French models.
I am not sure why your example with the standalone tagger does not work. What jars were you using?
Related
i'm developing a DMS.
i'm currently working on the document management system aspect , like Managing PDFs and Docs;
Now I want my application to be able to show all the existing PDF and
DOC files on the computer in my application. so that they can be
opened when the user clicks on them.
i'm currently just focusing on PDFs And Docs
import java.io.File;
import java.util.Collection;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.IOFileFilter;
public class SearchDocFiles {
public static String[] EXTENSIONS = { "doc", "docx" };
public Collection<File> searchFilesWithExtensions(final File directory, final String[] extensions) {
return FileUtils.listFiles(directory,
extensions,
true);
}
public static void main(String... args) {
Collection<File> documents = new SearchDocFiles().searchFilesWithExtensions(
new File("/path/to/document/folder"),
SearchDocFiles.EXTENSIONS);
for (File document: documents) {
System.out.println(document.getName() + " - " + document.length());
}
}
}
this uses Apache Commons IO expectially FileUtil
I am trying to write my own program using Java in order to segment set of text files into sentences. I have make a search on the available NLP tools and I found that GATE but i couldn't use it to just segment using the pipeline.
Any ideas how to limit the functionality of the pipeline
Any piece of codes that can help me to write my program
Adapted from a different answer:
import gate.*;
import gate.creole.SerialAnalyserController;
import java.io.File;
import java.util.*;
public class Segmenter {
public static void main(String[] args) throws Exception {
Gate.setGateHome(new File("C:\\Program Files\\GATE_Developer_8.0"));
Gate.init();
regiterGatePlugin("ANNIE");
SerialAnalyserController pipeline = (SerialAnalyserController) Factory.createResource("gate.creole.SerialAnalyserController");
pipeline.add((ProcessingResource) Factory.createResource("gate.creole.tokeniser.DefaultTokeniser"));
pipeline.add((ProcessingResource) Factory.createResource("gate.creole.splitter.SentenceSplitter"));
Corpus corpus = Factory.newCorpus("SegmenterCorpus");
Document document = Factory.newDocument("Text to be segmented.");
corpus.add(document);
pipeline.setCorpus(corpus);
pipeline.execute();
AnnotationSet defaultAS = document.getAnnotations();
AnnotationSet sentences = defaultAS.get("Sentence");
for (Annotation sentence : sentences) {
System.err.println(Utils.stringFor(document, sentence));
}
//Clean up
Factory.deleteResource(document);
Factory.deleteResource(corpus);
for (ProcessingResource pr : pipeline.getPRs()) {
Factory.deleteResource(pr);
}
Factory.deleteResource(pipeline);
}
public static void regiterGatePlugin(String name) throws Exception {
Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), name).toURI().toURL());
}
}
I'm trying to build an eclipse project for stanford coreNLP but getting errors for my customized properties file created as a class with main method.
Could anyone help me solving the following errors with my new "properties (.prop)" code, please?
Errors returned by eclipse are:
Multiple markers at this line
The type List is not generic; it cannot be parameterized with arguments at the line 25; line 27: Multiple markers at
this line
The method get(Class)
is undefined for the type CoreMap
CollapsedCCProcessedDependenciesAnnotation cannot be resolved to a type
SemanticGraph cannot be resolved to a type.
My properties code is:
import java.util.Properties;
import java.util.Hashtable;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import java.awt.List;
import java.io.FileWriter;
import java.io.IOException;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation;
public class NLP {
/**
* #param <CoreMap>
* #param args
*/
public static <CoreMap> void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "This is a sentence I want to parse.";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap>sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(dependencies.toDotFormat());
}
}
}
Thanks for your help!
I am using the Stanford Parser to get the dependencies from a text, sentence by sentence like this:
Reader reader = new StringReader("The room was not nice. It was bright, but cold.");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
// the dependencies of the entire text
List<TypedDependency> textDependencies = new ArrayList<TypedDependency>();
// get the dependencies of each sentence and add it to the list
for (List<HasWord> sentence : new DocumentPreprocessor(reader)) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
textDependencies.addAll(gs.typedDependenciesCCprocessed());
}
After running the code from above, the list called textDependencies will contain the following dependencies:
det(room-2, The-1)
nsubj(nice-5, room-2)
cop(nice-5, was-3)
neg(nice-5, not-4)
root(ROOT-0, nice-5)
nsubj(warm-3, It-1)
nsubj(noisy-6, It-1)
cop(warm-3, was-2)
root(ROOT-0, warm-3)
conj_but(warm-3, noisy-6)
Is there a way to find out who "it" is, to get something showing that it is actually the room?
What you want is called coreference resolution. Stanford CoreNLP does that already. I couldn't find a demo of it is done programatically, but if you are running the precompiled executable you need to add dcoref to the list of annotators like this:
java -cp <all jars> edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt
Here is a Java code example of Stanford CoreNLP coreference resolution (as suggested by mbatchkarov):
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.CorefChain.CorefMention;
import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
public class StanfordExample {
protected StanfordCoreNLP pipeline;
public StanfordExample() {
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
this.pipeline = new StanfordCoreNLP(props);
}
public void getCoref(String documentText)
{
Annotation document = new Annotation(documentText);
this.pipeline.annotate(document);
Map<Integer, CorefChain> sentences = document.get(CorefChainAnnotation.class);
for(CorefChain chain : sentences.values())
{
List<CorefMention> mentionsInTextualOrder = chain.getMentionsInTextualOrder();
for (CorefMention corefMention : mentionsInTextualOrder)
{
System.out.println(corefMention.toString());
}
}
}
public static void main(String[] args) {
String text = "The room was not nice. It was bright, but cold.";
StanfordExample slem = new StanfordExample();
slem.getCoref(text);
}
}
Use StanfordCoreNlpDemo.java in the downloaded CoreNLP's zip
I am trying to open Microsoft Word document using jacob.
Below is code:
import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.ComThread;
import com.jacob.com.Dispatch;
import com.jacob.com.Variant;
public class openWordDocument {
private static final Integer wdNewBlankDocument = new Integer(0);
private static final Variant vTrue = new Variant(true);
private static final Variant vFalse = new Variant(false);
private static ActiveXComponent activeXWord = null;
private static Object activeXWordObject = null;
public static void main(String[] args) {
try {
activeXWord = new ActiveXComponent("Word.Application");
activeXWordObject = activeXWord.getObject();
Dispatch.put(activeXWordObject, "Visible", vTrue);
//activeXWordObject = null;
}
catch (Exception e) {
quit();
}
}
public static void quit() {
if (activeXWord != null) {
System.out.println("quit word");
//calls the Quit method of MS Word, this will close MS Word
activeXWord.invoke("Quit", new Variant[] {});
ComThread.Release();
activeXWord.release();
System.out.println("quit word");
}
}
}
When I am running above code getting error Error: Could not find or load main class openWordDocument
It's my mistake, I added .dll file in the classpath so I am unable to compile the java file. I removed that dll file after that, jvm started compiling and able to fine the class file.
Warning !!!
Check your external libraries (such as .jar files) path were added to to your project.
The path should have regular format. For example it supposed to not have special characters like as "+", ... or space.
I had that serious problem before in Eclipse IDE, change the path directory of my project library and then everything is okay again.