Can't get correct Key-Value Pairs with Tika - java

I'm trying to get the Metadata Values from an Office Document and all it shows as key-value pair is this one:
Content-Type: application/zip
I just can't tell the issue in this one. Why does it only show the Content-Type?
What i'm interested in are Keys like title.
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class App
{
private static final String PATH = "C:/docs/myDocument.docx";
public static void main( String[] args ) throws IOException, SAXException, TikaException
{
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
InputStream fileStream = new FileInputStream(PATH);
BodyContentHandler handler = new BodyContentHandler();
parser.parse(fileStream, handler, metadata);
String[] metadataNames = metadata.names();
for (String key : metadataNames) {
String value = metadata.get(key);
System.out.println(key + ": " + value);
}
}
}

Promoting a comment to an answer - you appear to be missing some key Apache Tika jars or their dependencies.
If you're using Maven, then your pom should have (as of January 2015) should have something like:
<properties>
<tika.version>1.7</tika.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>${tika.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${tika.version}</version>
</dependency>
</dependencies>
The tika-core artifact gives you everything you need to run Tika, and develop your own parsers, but doesn't come with any parsers. It's the tika-parsers artifact (+dependencies!) which provides all the built-in Tika parsers, which you need to process files liek yours

Related

Missing many metadata Key-value pairs with Apache Tika

I am trying to get metadata of a file in JAVA using Apache Tika.The code for getting that is given below,
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class tika {
public static void main(final String[] args) throws IOException, SAXException, TikaException {
File file = new File("C:\\HTML\\viewer\\files\\doc.doc");
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String[] metadataNames = metadata.names();
System.out.println(metadataNames.length);
for (String key : metadataNames) {
String value = metadata.get(key);
System.out.println(key + ": " + value);
}
}
}
The problem is I am getting only two key-value pairs which are given below,
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2
X-TIKA:Parsed-By: org.apache.tika.parser.EmptyParser
Content-Type: application/x-tika-msoffice
I am getting a warning from SLF4J which I think is not a problem for getting the desired output and If so please tell me.
Its a maven project and the dependencies I have are given below,
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.3.0</version>
<type>pom</type>
</dependency>
I am getting only the 'X-TIKA:Parsed-By' and 'Content-Type' key-values. I want more metadata values like the author, comments, size, etc.
You have to use a specific parser for retrieving the Microsoft Document properties.
First add the following dependencies:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.3.0</version>
</dependency>
Next, you can do something like this:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("C:\\HTML\\viewer\\files\\doc.doc"));
ParseContext pcontext = new ParseContext();
// Specific parser
OOXMLParser msOfficeParser = new OOXMLParser();
msOfficeParser .parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
You will obtain something like this:
Metadata of the document: cp:revision: 3
extended-properties:AppVersion: 16.0000 meta:paragraph-count: 1
meta:word-count: 83 extended-properties:Application: Microsoft Office
Word meta:last-author: Microsoft Office User
References:
https://www.tutorialspoint.com/tika/tika_extracting_ms_office_files.htm
https://tika.apache.org/2.3.0/gettingstarted.html

How to determine extension from fileBytes

My application allows users to download files. While creating headers I am using Tika to set extension as shown below.
This works fine for pdf files. Fails for DOC and EXCEL files.
private HttpHeaders getHeaderData(byte[] fileBytes) throws IOException, MimeTypeException {
final HttpHeaders headers = new HttpHeaders();
TikaInputStream tikaStream = TikaInputStream.get(fileBytes);
Tika tika = new Tika();
String mimeType = tika.detect(tikaStream);
headers.setContentType(MediaType.valueOf(mimeType));
MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
String extension = defaultMimeTypes.forName(mimeType).getExtension();
headers.add("file-ext", extension);
return headers;
}
I see that the mimeType is resolved to "application/pdf" for pdf files but resolves to "application/x-tika-ooxml" for excel and word files which is the problem.
How can I get word(.docx) and excel (xlx, xlsx) formats if I have a file in bytes.
Why does this work for pdf?
Summary
The short answer is: You have to use Tika's detector with its MediaType class - not MimeTypes.
The slightly longer answer is: Even that will not get you all the way, because of how older MS-Office files are structured. For those you have to also parse the files, and inspect their metadata.
The term "media type" has replaced the term "MIME type" - see here:
[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
Office 97-2003
When Tika inspects Excel and Word 97-2003 files using its detector, it will return a media type of application/x-tika-msoffice. I assume (perhaps incorrectly) that this is its way of handling a file-type group, where the detector cannot determine the specific flavor of MS-Office 97-2003 file, based on its analysis. This is similar to the application/x-tika-ooxml in your question.
Expected Results
Based on the IANA list here, and a Mozilla list here, these are the media types we expect to get for the following file types:
.pdf :: application/pdf
.xls :: application/vnd.ms-excel
.doc :: application/msword
.xlsx :: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.docx :: application/vnd.openxmlformats-officedocument.wordprocessingml.document
The Program
The program shown below uses the following Maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>javax.ws.rs</groupId>
<artifactId>javax.ws.rs-api</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
The program (just for this demo - not production ready) is shown below. Specifically, look at the tikaDetect() and tikaParse() methods.
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.util.Set;
import java.util.HashSet;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.detect.Detector;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;
public class Main {
private final Set<File> msOfficeFiles = new HashSet();
public static void main(String[] args) throws IOException, MimeTypeException,
SAXException, TikaException {
Main main = new Main();
main.doFileDetection();
}
private void doFileDetection() throws IOException, MimeTypeException, SAXException, TikaException {
File file1 = new File("C:/tmp/foo.pdf");
File file2 = new File("C:/tmp/baz.xlsx");
File file3 = new File("C:/tmp/bat.docx");
// Excel 97-2003 format:
File file4 = new File("C:/tmp/bar.xls");
// Word 97-2003 format:
File file5 = new File("C:/tmp/daz.doc");
Set<File> files = new HashSet();
files.add(file1);
files.add(file2);
files.add(file3);
files.add(file4);
files.add(file5);
for (File file : files) {
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
}
for (File file : msOfficeFiles) {
tikaParse(file);
}
}
private void tikaDetect(File file, BufferedInputStream bis)
throws IOException, SAXException, TikaException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(bis, metadata);
if (mediaType.toString().equals("application/x-tika-msoffice")) {
msOfficeFiles.add(file);
} else {
System.out.println("Media Type for " + file.getName()
+ " is: " + mediaType.toString());
}
}
private void tikaParse(File file) throws SAXException, TikaException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
parser.parse(bis, handler, metadata, context);
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Media Type for " + file.getName()
+ " is: " + metadata.get("Content-Type"));
}
}
Actual Results
The program generates some warnings and information messages. If we ignore these for this exercise, we get the following print statements:
Media Type for bat.docx is: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Media Type for baz.xlsx is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Media Type for foo.pdf is: application/pdf
Media Type for bar.xls is: application/vnd.ms-excel
Media Type for daz.doc is: application/msword
These match the expected official media (MIME) types.
Tika official usages:
https://tika.apache.org/1.26/detection.html
Tika supported formats:
https://tika.apache.org/1.26/formats.html
You could get the answers by simply reading the above 2 pages.
Here are some key quotes:
Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format. The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents.
That means you need to include also Apache POI jars or Maven dependencies for MS office files.
Tika provides a wrapping detector in the form of org.apache.tika.detect.DefaultDetector. This uses the service loader to discover all available detectors, including any available container aware ones, and tries them in turn. For container aware detection, include the Tika Parsers jar and its dependencies in your project, then use DefaultDetector along with a TikaInputStream.
That means you need to include the Tika Parsers jar or Maven dependencies.
Then use
new DefaultDetector().detect(TikaInputStream.get(file), new Metadata());

tika: disable looking inside zip files

Currently, tika is processing zip files looking inside them.
I'd like to disable this features and only gets me application/zip mime type.
I'm using this code right now:
public String getMimeType(InputStream is) {
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
}
This code returns me zipped mime type file.
Any ideas?
Based on your example I wrote a dummy app. I then used a large Zip file that doesn't have the zip Extension. I do not see the behavior that Tika parses the whole file.
I looked with a debugger. Tika only reads 65536 bytes to determine the file type.
See: Tika MimeTypes.class:154
public MediaType detect(InputStream input, Metadata metadata) throws IOException {
List<MimeType> possibleTypes = null;
// Get type based on magic prefix
if (input != null) {
input.mark(getMinLength());
try {
byte[] prefix = readMagicHeader(input);
possibleTypes = getMimeType(prefix);
} finally {
input.reset();
}
}
dummy app
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class Main {
public static void main(String[] args) throws IOException {
System.out.println(new Main().getMimeType(new FileInputStream(new File("C:\\temp\\apache-tomcat-8.0.47-windows-x64"))));
}
public String getMimeType(InputStream is) throws IOException {
final TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
return mediaType.getType();
}
Maven:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.18</version>
</dependency>
</dependencies>
Can you show me an example that reads and parses the whole zip-file? I can understand that that would be a problem when files exceed a certain size.
Unfortunately I cant help you if I can't reproduce the problem.

Exception in thread "main" java.lang.reflect.InvocationTargetException

I am using iText PDF 5.5.11 to convert PDF to XML.I already checked similar answers on stackoverflow. I am getting below error when I run jar file using command line on ubuntu. java version "1.8.0_101"
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
Caused by: java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1Encodable
at com.itextpdf.text.pdf.PdfEncryption.<init>(PdfEncryption.java:147)
at com.itextpdf.text.pdf.PdfReader.readDecryptedDocObj(PdfReader.java:1063)
at com.itextpdf.text.pdf.PdfReader.readDocObj(PdfReader.java:1469)
at com.itextpdf.text.pdf.PdfReader.readPdf(PdfReader.java:751)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:198)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:236)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:224)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:214)
at test.pdfreader.readXml(pdfreader.java:34)
at test.pdfreader.main(pdfreader.java:30)
I am not much familiar with java. I call this jar file from PHP using PHP exec function.
Below is the code I use to convert PDF to XML.
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.AcroFields;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.XfaForm;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class pdfreader {
public static void main(String[] args) throws IOException, DocumentException, TransformerException {
String SRC = "";
String DEST = "";
for (String s : args) {
SRC = args[0];
DEST = args[1];
}
File file = new File(DEST);
file.getParentFile().mkdirs();
new pdfreader().readXml(SRC, DEST);
}
public void readXml(String src, String dest) throws IOException, DocumentException, TransformerException {
PdfReader reader = new PdfReader(src);
AcroFields form = reader.getAcroFields();
XfaForm xfa = form.getXfa();
Node node = xfa.getDatasetsNode();
NodeList list = node.getChildNodes();
for (int i = 0; i < list.getLength(); ++i) {
if ("data".equals(list.item(i).getLocalName())) {
node = list.item(i);
break;
}
}
list = node.getChildNodes();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty("encoding", "UTF-8");
tf.setOutputProperty("indent", "yes");
FileOutputStream os = new FileOutputStream(dest);
tf.transform(new DOMSource(node), new StreamResult(os));
reader.close();
}
}
When you use Maven for your Java project, then all you need to do, is add a dependency to iText. Maven will then take care of all transitive dependencies like BouncyCastle. Maven takes away all the heavy lifting.
The same principle applies for other build systems like Gradle etc.
Now, if you want to do it all manually and put the correct jars on your classpath, then you need to do some homework. This means looking at the pom.xml of each and every of your dependencies, see which transitive dependencies they have, which dependencies those dependencies have, and so on ad nauseam.
In case of iText, you take a look at the pom.xml that you can find on Maven Central: https://search.maven.org/#artifactdetails%7Ccom.itextpdf%7Citextpdf%7C5.5.11%7Cjar
In particular this part:
<dependencies>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcprov-jdk15on</artifactId>
<version>1.49</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcpkix-jdk15on</artifactId>
<version>1.49</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.santuario</groupId>
<artifactId>xmlsec</artifactId>
<version>1.5.1</version>
<optional>true</optional>
</dependency>
</dependencies>
This tells you that iText 5.5.11 has an optional dependency on BouncyCastle 1.49.
BouncyCastle has a bad reputation of randomly changing and breaking their API even with minor updates, that is why you need to be very precise with your BouncyCastle version.
Hi Just change in zookeeper.service file as Environment="KAFKA_ARGS=-javaagent:/home/ec2-user/prometheus/jmx_prometheus_javaagent-0.3.1.jar=8080:/home/ec2-user/prometheus/kafka-0-8-2.yml" to below and the issue resolved:
Environment="KAFKA_OPTS=-javaagent:/home/ec2-user/prometheus/jmx_prometheus_javaagent-0.3.1.jar=8080:/home/ec2-user/prometheus/zookeeper.yml"

How to use OpenNLP with Java?

I want to POStag an English sentence and do some processing. I would like to use openNLP. I have it installed
When I execute the command
I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt
It gives output POSTagging the input in Text.txt
Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.
Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s
I hope it installed properly?
Now how do i do this POStagging from inside a java application? I have added the openNLPtools, jwnl, maxent jar to the project but how do i invoke the POStagging?
Here's some (old) sample code I threw together, with modernized code to follow:
package opennlp;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;
public class OpenNlpTest {
public static void main(String[] args) throws IOException {
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
ObjectStream<String> lineStream =
new PlainTextByLineStream(new StringReader(input));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
}
}
The output is:
Loading POS Tagger model ... done (2.045s)
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN
Average: 76.9 sent/s
Total: 1 sent
Runtime: 0.013s
This is basically working from the POSTaggerTool class included as part of OpenNLP. The sample.getTags() is a String array that has the tag types themselves.
This requires direct file access to the training data, which is really, really lame.
An updated codebase for this is a little different (and probably more useful.)
First, a Maven POM:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.javachannel</groupId>
<artifactId>opennlp-example</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>[6.8.21,)</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
And here's the code, written as a test, therefore located in ./src/test/java/org/javachannel/opennlp/example:
package org.javachannel.opennlp.example;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Stream;
public class POSTest {
private void download(String url, File destination) throws IOException {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
#DataProvider
Object[][] getCorpusData() {
return new Object[][][]{{{
"Can anyone help me dig through OpenNLP's horrible documentation?"
}}};
}
#Test(dataProvider = "getCorpusData")
public void showPOS(Object[] input) throws IOException {
File modelFile = new File("en-pos-maxent.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile);
}
POSModel model = new POSModel(modelFile);
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
Stream.of(input).map(line -> {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
return sample.toString();
}).forEach(System.out::println);
perfMon.stopAndPrintFinalResult();
}
}
This code doesn't actually test anything - it's a smoke test, if anything - but it should serve as a starting point. Another (potentially) nice thing is that it downloads a model for you if you don't have it downloaded already.
The URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html does not work anymore. I found the below on the 14th slide at http://www.slideshare.net/gagan1667/opennlp-demo
The above answer does provide a way to use the existing models from OpenNLP but if you need to train your own model, maybe the below can help:
Here is a detailed tutorial with full code:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.
Training data format
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model
The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file
public POSModel train(String filepath) {
POSModel model = null;
TrainingParameters parameters = TrainingParameters.defaultParams();
parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
try {
try (InputStream dataIn = new FileInputStream(filepath)) {
ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
#Override
public InputStream createInputStream() throws IOException {
return dataIn;
}
}, StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
Use model to do tagging.
Finally, we can see how the model can be used to tag unseen queries:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}

Categories