Text Segmentation using Gate - java

I am trying to write my own program using Java in order to segment set of text files into sentences. I have make a search on the available NLP tools and I found that GATE but i couldn't use it to just segment using the pipeline.
Any ideas how to limit the functionality of the pipeline
Any piece of codes that can help me to write my program

Adapted from a different answer:
import gate.*;
import gate.creole.SerialAnalyserController;
import java.io.File;
import java.util.*;
public class Segmenter {
public static void main(String[] args) throws Exception {
Gate.setGateHome(new File("C:\\Program Files\\GATE_Developer_8.0"));
Gate.init();
regiterGatePlugin("ANNIE");
SerialAnalyserController pipeline = (SerialAnalyserController) Factory.createResource("gate.creole.SerialAnalyserController");
pipeline.add((ProcessingResource) Factory.createResource("gate.creole.tokeniser.DefaultTokeniser"));
pipeline.add((ProcessingResource) Factory.createResource("gate.creole.splitter.SentenceSplitter"));
Corpus corpus = Factory.newCorpus("SegmenterCorpus");
Document document = Factory.newDocument("Text to be segmented.");
corpus.add(document);
pipeline.setCorpus(corpus);
pipeline.execute();
AnnotationSet defaultAS = document.getAnnotations();
AnnotationSet sentences = defaultAS.get("Sentence");
for (Annotation sentence : sentences) {
System.err.println(Utils.stringFor(document, sentence));
}
//Clean up
Factory.deleteResource(document);
Factory.deleteResource(corpus);
for (ProcessingResource pr : pipeline.getPRs()) {
Factory.deleteResource(pr);
}
Factory.deleteResource(pipeline);
}
public static void regiterGatePlugin(String name) throws Exception {
Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), name).toURI().toURL());
}
}

Related

How to make changes with Track revision in Apache POI

I am trying to edit a run using Apache POI, and I want this edit to show up as a tracked revision (so that I can accept/reject this suggestion) in a .docx file.
package apache;
import java.io.FileOutputStream;
import java.util.List;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class Main {
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument(OPCPackage.open("input.docx"));
doc.setTrackRevisions(true);
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
if (runs != null) {
for (XWPFRun r : runs) {
String text = r.getText(0);
if (text != null && text.contains("needle")) {
text = text.replace("needle", "haystack");
r.setText(text, 0);
}
}
}
doc.write(new FileOutputStream("/Users/srt/Desktop/output.docx"));
}
}
This code is able to replace the text perfectly, but I cannot show this edit as a tracked revision. I made use of the doc.setTrackRevisions(true) method, but it still does not track the revision. Any help here will be really appreciated!

IllegalArgumentException: PTBLexer: Invalid options key in constructor: asciiQuotes Stanford NLP

I'm trying to test the Hello word of Stanford POS tagger API in Java (I used the same .jar in python and it worked well) on french sentences.
Here is my code
public class TextPreprocessor {
private static MaxentTagger tagger=new MaxentTagger("../stanford-tagger-4.1.0/stanford-postagger-full-2020-08-06/models/french-ud.tagger");
public static void main(String[] args) {
String taggedString = tagger.tagString("Salut à tous, je suis coincé");
System.out.println(taggedString);
}
}
But I get the following exception:
Loading POS tagger from C:/Users/_Nprime496_/Downloads/Compressed/stanford-tagger-4.1.0/stanford-postagger-full-2020-08-06/models/french-ud.tagger ... done [0.3 sec].
Exception in thread "main" java.lang.IllegalArgumentException: PTBLexer: Invalid options key in constructor: asciiQuotes
at edu.stanford.nlp.process.PTBLexer.<init>(PTBLexer.java)
at edu.stanford.nlp.process.PTBTokenizer.<init>(PTBTokenizer.java:285)
at edu.stanford.nlp.process.PTBTokenizer$PTBTokenizerFactory.getTokenizer(PTBTokenizer.java:698)
at edu.stanford.nlp.process.DocumentPreprocessor$PlainTextIterator.<init>(DocumentPreprocessor.java:271)
at edu.stanford.nlp.process.DocumentPreprocessor.iterator(DocumentPreprocessor.java:226)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tokenizeText(MaxentTagger.java:1148)
at edu.stanford.nlp.tagger.maxent.MaxentTagger$TaggerWrapper.apply(MaxentTagger.java:1332)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagString(MaxentTagger.java:999)
at modules.generation.preprocessing.TextPreprocessor.main(TextPreprocessor.java:19)
Can you help me?
You can use this code and the full CoreNLP package:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class PipelineExample {
public static String text = "Paris est la capitale de la France.";
public static void main(String[] args) {
// set up pipeline properties
Properties props = StringUtils.argsToProperties("-props", "french");
// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,mwt,pos");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
CoreDocument document = pipeline.processToCoreDocument(text);
// display tokens
for (CoreLabel tok : document.tokens()) {
System.out.println(String.format("%s\t%s", tok.word(), tok.tag()));
}
}
}
You can download CoreNLP here: https://stanfordnlp.github.io/CoreNLP/
Make sure to download the latest French models.
I am not sure why your example with the standalone tagger does not work. What jars were you using?

Exception while calling Parser method outside main class

In my application I have a method which I cant execute without main method. It only runs inside the main method. When I call that method inside my servlet class. It show an exception
My class with Main Method
package com.books.servlet;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.HashSet;
import java.util.Set;
import opennlp.tools.cmdline.parser.ParserTool;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
public class ParserTest {
// download
public void download(String url, File destination) throws IOException, Exception {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
fos.close();
rbc.close();
}
public static Set<String> nounPhrases = new HashSet<>();
private static String line = "The Moon is a barren, rocky world ";
public void getNounPhrases(Parse p) {
if (p.getType().equals("NN") || p.getType().equals("NNS") || p.getType().equals("NNP")
|| p.getType().equals("NNPS")) {
nounPhrases.add(p.getCoveredText());
}
for (Parse child : p.getChildren()) {
getNounPhrases(child);
}
}
public void parserAction() throws Exception {
// InputStream is = new FileInputStream("en-parser-chunking.bin");
File modelFile = new File("en-parser-chunking.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("https://drive.google.com/uc?export=download&id=0B4uQtYVPbChrY2ZIWmpRQ1FSVVk", modelFile);
}
ParserModel model = new ParserModel(modelFile);
Parser parser = ParserFactory.create(model);
Parse topParses[] = ParserTool.parseLine(line, parser, 1);
for (Parse p : topParses) {
// p.show();
getNounPhrases(p);
}
}
public static void main(String[] args) throws Exception {
new ParserTest().parserAction();
System.out.println("List of Noun Parse : " + nounPhrases);
}
}
It gives me below output
List of Noun Parse : [barren,, world, Moon]
Then I commented the main method and. Called the ParserAction() method in my servlet class
if (name.equals("bkDescription")) {
bookDes = value;
try {
new ParserTest().parserAction();
System.out.println("Nouns Are"+ParserTest.nounPhrases);
} catch (Exception e) {
}
It gives me the below exceptions
And below error in my Browser
Why is this happening ? I can run this with main method. But when I remove main method and called in my servlet. it gives an exception. Is there any way to fix this issue ?
NOTE - I have read below instructions in OpenNLP documentation , but I have no clear idea about it. Please help me to fix his issue.
Unlike the other components to instantiate the Parser a factory method
should be used instead of creating the Parser via the new operator.
The parser model is either trained for the chunking parser or the tree
insert parser the parser implementation must be chosen correctly. The
factory method will read a type parameter from the model and create an
instance of the corresponding parser implementation.
Either create an object of ParserTest class or remove new keyword in this line new ParserTest().parserAction();

Apache commons CSV: quoted input doesn't work

import java.io.IOException;
import java.io.StringReader;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
I try to parse a simple csv file with Apache CSV parser. It works fine as long as I don't use quotes. When I try to add a quote to the input
"a";42
it gives me the error:
invalid char between encapsulated token and delimiter
Here is a simple, complete code:
public class Test {
public static void main(String[] args) throws IOException {
String DATA = "\"a\";12";
CSVParser csvParser =
CSVFormat.EXCEL
.withIgnoreEmptyLines()
.withIgnoreHeaderCase()
.withRecordSeparator('\n').withQuote('"')
.withEscape('\\').withRecordSeparator(';').withTrim()
.parse(new StringReader(DATA));
}
}
I simply can't find out what I've missed in the code.
The problem was so trivial I missed it.
I used withRecordSeparator instead of withDelimiter to set the field separator.
This works as I expected:
public class Test {
public static void main(String[] args) throws IOException {
String DATA = "\"a\";12";
CSVParser csvParser =
CSVFormat.EXCEL
.withIgnoreEmptyLines()
.withIgnoreHeaderCase()
.withRecordSeparator('\n').withQuote('"')
.withEscape('\\').withDelimeter(';').withTrim()
.parse(new StringReader(DATA));
}
}

How to Archive Document in Java project (DMS)

i'm developing a DMS.
i'm currently working on the document management system aspect , like Managing PDFs and Docs;
Now I want my application to be able to show all the existing PDF and
DOC files on the computer in my application. so that they can be
opened when the user clicks on them.
i'm currently just focusing on PDFs And Docs
import java.io.File;
import java.util.Collection;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.IOFileFilter;
public class SearchDocFiles {
public static String[] EXTENSIONS = { "doc", "docx" };
public Collection<File> searchFilesWithExtensions(final File directory, final String[] extensions) {
return FileUtils.listFiles(directory,
extensions,
true);
}
public static void main(String... args) {
Collection<File> documents = new SearchDocFiles().searchFilesWithExtensions(
new File("/path/to/document/folder"),
SearchDocFiles.EXTENSIONS);
for (File document: documents) {
System.out.println(document.getName() + " - " + document.length());
}
}
}
this uses Apache Commons IO expectially FileUtil

Categories