How to use OpenNLP with Java? - java

I want to POStag an English sentence and do some processing. I would like to use openNLP. I have it installed
When I execute the command
I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt
It gives output POSTagging the input in Text.txt
Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.
Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s
I hope it installed properly?
Now how do i do this POStagging from inside a java application? I have added the openNLPtools, jwnl, maxent jar to the project but how do i invoke the POStagging?

Here's some (old) sample code I threw together, with modernized code to follow:
package opennlp;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;
public class OpenNlpTest {
public static void main(String[] args) throws IOException {
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
ObjectStream<String> lineStream =
new PlainTextByLineStream(new StringReader(input));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
}
}
The output is:
Loading POS Tagger model ... done (2.045s)
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN
Average: 76.9 sent/s
Total: 1 sent
Runtime: 0.013s
This is basically working from the POSTaggerTool class included as part of OpenNLP. The sample.getTags() is a String array that has the tag types themselves.
This requires direct file access to the training data, which is really, really lame.
An updated codebase for this is a little different (and probably more useful.)
First, a Maven POM:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.javachannel</groupId>
<artifactId>opennlp-example</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>[6.8.21,)</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
And here's the code, written as a test, therefore located in ./src/test/java/org/javachannel/opennlp/example:
package org.javachannel.opennlp.example;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Stream;
public class POSTest {
private void download(String url, File destination) throws IOException {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
#DataProvider
Object[][] getCorpusData() {
return new Object[][][]{{{
"Can anyone help me dig through OpenNLP's horrible documentation?"
}}};
}
#Test(dataProvider = "getCorpusData")
public void showPOS(Object[] input) throws IOException {
File modelFile = new File("en-pos-maxent.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile);
}
POSModel model = new POSModel(modelFile);
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
Stream.of(input).map(line -> {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
return sample.toString();
}).forEach(System.out::println);
perfMon.stopAndPrintFinalResult();
}
}
This code doesn't actually test anything - it's a smoke test, if anything - but it should serve as a starting point. Another (potentially) nice thing is that it downloads a model for you if you don't have it downloaded already.

The URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html does not work anymore. I found the below on the 14th slide at http://www.slideshare.net/gagan1667/opennlp-demo

The above answer does provide a way to use the existing models from OpenNLP but if you need to train your own model, maybe the below can help:
Here is a detailed tutorial with full code:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.
Training data format
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model
The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file
public POSModel train(String filepath) {
POSModel model = null;
TrainingParameters parameters = TrainingParameters.defaultParams();
parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
try {
try (InputStream dataIn = new FileInputStream(filepath)) {
ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
#Override
public InputStream createInputStream() throws IOException {
return dataIn;
}
}, StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
Use model to do tagging.
Finally, we can see how the model can be used to tag unseen queries:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}

Related

Saving an H2O model directly from Java

I'm trying to create and save a generated model directly from Java. The documentation specifies how to do this in R and Python, but not in Java. A similar question was asked before, but no real answer was provided (beyond linking to H2O doc, which doesn't contain a code example).
It'd be sufficient for my present purpose get some pointers to be able to translate the following reference code to Java. I'm mainly looking for guidance on the relevant JAR(s) to import from the Maven repository.
import h2o
h2o.init()
path = h2o.system_file("prostate.csv")
h2o_df = h2o.import_file(path)
h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
model = h2o.glm(y = "CAPSULE",
x = ["AGE", "RACE", "PSA", "GLEASON"],
training_frame = h2o_df,
family = "binomial")
h2o.download_pojo(model)
I think I've figured out an answer to my question. A self-contained sample code follows. However, I'll still appreciate an answer from the community since I don't know if this is the best/idiomatic way to do it.
package org.name.company;
import hex.glm.GLMModel;
import water.H2O;
import water.Key;
import water.api.StreamWriter;
import water.api.StreamingSchema;
import water.fvec.Frame;
import water.fvec.NFSFileVec;
import hex.glm.GLMModel.GLMParameters.Family;
import hex.glm.GLMModel.GLMParameters;
import hex.glm.GLM;
import water.util.JCodeGen;
import java.io.*;
import java.util.Map;
public class Launcher
{
public static void initCloud(){
String[] args = new String [] {"-name", "h2o_test_cloud"};
H2O.main(args);
H2O.waitForCloudSize(1, 10 * 1000);
}
public static void main( String[] args ) throws Exception {
// Initialize the cloud
initCloud();
// Create a Frame object from CSV
File f = new File("/path/to/data.csv");
NFSFileVec nfs = NFSFileVec.make(f);
Key frameKey = Key.make("frameKey");
Frame fr = water.parser.ParseDataset.parse(frameKey, nfs._key);
// Create a GLM and output coefficients
Key modelKey = Key.make("modelKey");
try {
GLMParameters params = new GLMParameters();
params._train = frameKey;
params._response_column = fr.names()[1];
params._intercept = true;
params._lambda = new double[]{0};
params._family = Family.gaussian;
GLMModel model = new GLM(params).trainModel().get();
Map<String, Double> coefs = model.coefficients();
for(Map.Entry<String, Double> entry : coefs.entrySet()) {
System.out.format("%s: %f\n", entry.getKey(), entry.getValue());
}
String filename = JCodeGen.toJavaId(model._key.toString()) + ".java";
StreamingSchema ss = new StreamingSchema(model.new JavaModelStreamWriter(false), filename);
StreamWriter sw = ss.getStreamWriter();
OutputStream os = new FileOutputStream("/base/path/" + filename);
sw.writeTo(os);
} finally {
if (fr != null) {
fr.remove();
}
}
}
}
Would something like this do the trick?
public void saveModel(URI uri, Keyed<Frame> model)
{
Persist p = H2O.getPM().getPersistForURI(uri);
OutputStream os = p.create(uri.toString(), true);
model.writeAll(new AutoBuffer(os, true)).close();
}
Make sure the URI has a proper form otherwise H2O will break on an npe. As for Maven you should be able to get away with the h2o core.
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-core</artifactId>
<version>3.14.0.2</version>
</dependency>

SlopeOneRecommender not working

I am following the book Apache Mahout Cookbook by piero giacomelli. Now when i download the maven sources using netbeans as IDE, i guess the sources are from mahout version 1.0 and not 0.8 as it showing an error in SlopeOneRecommender import alone.
Here is the complete code -
package com.packtpub.mahout.cookbook.chapter01;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.List;
import org.apache.commons.cli2.OptionException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
public class App {
static final String inputFile = "/home/hadoop/ml-1m/ratings.dat";
static final String outputFile = "/home/hadoop/ml-1m/ratings.csv";
public static void main( String[] args ) throws IOException, TasteException, OptionException
{
CreateCsvRatingsFile();
// create data source (model) - from the csv file
File ratingsFile = new File(outputFile);
DataModel model = new FileDataModel(ratingsFile);
// create a simple recommender on our data
CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(model));
// for all users
for (LongPrimitiveIterator it = model.getUserIDs(); it.hasNext();){
long userId = it.nextLong();
// get the recommendations for the user
List<RecommendedItem> recommendations = cachingRecommender.recommend(userId, 10);
// if empty write something
if (recommendations.size() == 0){
System.out.print("User ");
System.out.print(userId);
System.out.println(": no recommendations");
}
// print the list of recommendations for each
for (RecommendedItem recommendedItem : recommendations) {
System.out.print("User ");
System.out.print(userId);
System.out.print(": ");
System.out.println(recommendedItem);
}
}
}
private static void CreateCsvRatingsFile() throws FileNotFoundException, IOException {
BufferedReader br = new BufferedReader(new FileReader(inputFile));
BufferedWriter bw = new BufferedWriter(new FileWriter(outputFile));
String line = null;
String line2write = null;
String[] temp;
int i = 0;
while (
(line = br.readLine()) != null
&& i < 1000
){
i++;
temp = line.split("::");
line2write = temp[0] + "," + temp[1];
bw.write(line2write);
bw.newLine();
bw.flush();
}
br.close();
bw.close();
}
}
The error is being shown only on import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
and hence on the line where i create an object using this. Error being shown is package does not exist.
Please help. Is it because i am using a newer version of mahout? I am even uncertain if I am using version 0.8 or a higher version as i followed all the links given in the book.
The SlopeOneRecommender was removed from mahout since v0.8. If you want to use it, you can switch to version such as 0.7.
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>0.7</version>
</dependency>
See http://permalink.gmane.org/gmane.comp.apache.mahout.user/20282
Exactly,
The SlopeOneRecommender was removed from mahout since v0.8. So either you get back to the version 0.7. Or if your purpose is only to try mahout u can try with other recommenders, such as :ItemAverageRecommender.

Can't get correct Key-Value Pairs with Tika

I'm trying to get the Metadata Values from an Office Document and all it shows as key-value pair is this one:
Content-Type: application/zip
I just can't tell the issue in this one. Why does it only show the Content-Type?
What i'm interested in are Keys like title.
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class App
{
private static final String PATH = "C:/docs/myDocument.docx";
public static void main( String[] args ) throws IOException, SAXException, TikaException
{
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
InputStream fileStream = new FileInputStream(PATH);
BodyContentHandler handler = new BodyContentHandler();
parser.parse(fileStream, handler, metadata);
String[] metadataNames = metadata.names();
for (String key : metadataNames) {
String value = metadata.get(key);
System.out.println(key + ": " + value);
}
}
}
Promoting a comment to an answer - you appear to be missing some key Apache Tika jars or their dependencies.
If you're using Maven, then your pom should have (as of January 2015) should have something like:
<properties>
<tika.version>1.7</tika.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>${tika.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${tika.version}</version>
</dependency>
</dependencies>
The tika-core artifact gives you everything you need to run Tika, and develop your own parsers, but doesn't come with any parsers. It's the tika-parsers artifact (+dependencies!) which provides all the built-in Tika parsers, which you need to process files liek yours

Convert DOCX to HTML incliding IMAGES

I am using DOCX4J to convert the DOCX to HTML .I have successfully done the conversion and got the html format.I will be using the html format to embed it as EMAIL body to send an email.But I have some issues which are listed below....
Unable to display images in email body
Losing the spaces and bullets
Please find the code which I have written,
WordprocessingMLPackage wordMLPackage;
wordMLPackage = Docx4J.load(new java.io.File(resourcePath2));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath(imageFolder + resourcePath2 + "_files");
htmlSettings.setImageTargetUri(imageFolder +resourcePath2.substring(resourcePath2.lastIndexOf("/")+1) + "_files");
htmlSettings.setWmlPackage(wordMLPackage);
OutputStream os;
os = new ByteArrayOutputStream();
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_SAVE_FLAT_XML);
DOCX = ((ByteArrayOutputStream)os).toString();
You may add like this in your code
package tcg.doc.web.managedBeans;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Scope;
import org.springframework.stereotype.Component;
#Component
#Scope("session")
#Qualifier("ConvertWord")
public class ConvertWord {
private static final String docName = "TestDocx.docx";
private static final String outputlFolderPath = "d:/";
String htmlNamePath = "docHtml.html";
String zipName="_tmp.zip";
File docFile = new File(outputlFolderPath+docName);
File zipFile = new File(zipName);
public void ConvertWordToHtml() {
try {
// 1) Load DOCX into XWPFDocument
InputStream doc = new FileInputStream(new File(outputlFolderPath+docName));
System.out.println("InputStream"+doc);
XWPFDocument document = new XWPFDocument(doc);
// 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
// Extract image
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
// URI resolver
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File(htmlPath()));
XHTMLConverter.getInstance().convert(document, out, options);
System.out.println("OutputStream "+out.toString());
} catch (FileNotFoundException ex) {
} catch (IOException ex) {
}
}
public static void main(String[] args) {
ConvertWord cwoWord=new ConvertWord();
cwoWord.ConvertWordToHtml();
System.out.println();
}
public String htmlPath(){
// d:/docHtml.html
return outputlFolderPath+htmlNamePath;
}
public String zipPath(){
// d:/_tmp.zip
return outputlFolderPath+zipName;
}
}
For maven Dependency on pom.xml
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
<version>1.0.4</version>
</dependency>
or download it from Here
For images to work in an email body, I guess you need to use either a data URI or publish them to a web-reachable location.
In either case, you'll need to write an implementation of:
public interface ConversionImageHandler {
/**
* #param picture
* #param relationship of the image
* #param part of the image, if it is an internal image, otherwise null
* #return uri for the image we've saved, or null
* #throws Docx4JException this exception will be logged, but not propagated
*/
public String handleImage(AbstractWordXmlPicture picture, Relationship relationship, BinaryPart part) throws Docx4JException;
}
and configure docx4j to use it with htmlSettings.setImageHandler.
You can look at some of the existing implementations in the docx4j source code, and take advantage of the helper methods in AbstractConversionImageHandler (eg createEncodedImage if you want data URIs).

Java : Microsoft word document to html converter style sheet

As required I am trying to convert doc or docx (Microsoft word) files to html format with Apache tika
I end up with following code which works fine,
But its not adding any style sheet to result html.
import javax.xml.transform.OutputKeys;
import java.io.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.detect.DefaultDetector;
public class DocxConvert
{
public static void main(String []args)
{
InputStream input=null;
try
{
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"yes");
handler.setResult(new StreamResult(sw));
input = new FileInputStream("f:\\file.doc");
DefaultDetector detector = new DefaultDetector();
Metadata metadata = new Metadata();
org.apache.tika.parser.Parser parser = new AutoDetectParser(detector);
parser.parse(input, handler, metadata, new ParseContext());
System.out.print(sw.toString());
}
catch (Exception ex)
{
ex.printStackTrace();
}
finally {
try {
input.close();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
Is there any way to add/generate style sheet to output? kindly help !
I used version 1.6 of Tika and that worked fine for me. Here is the pom dependency I used.
http://tika.apache.org/download.html
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.6</version>
</dependency>
</dependencies>
You can use unoconv and it requires Openoffice or Libreoffice. Download from here and it provides doc,docx,xls etc. to pdf conversion from command line in your server. if you want to show embedding pdf file with apache or apache tomcat, i think pdf.js is good solution.

Categories