Missing many metadata Key-value pairs with Apache Tika - java

I am trying to get metadata of a file in JAVA using Apache Tika.The code for getting that is given below,
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class tika {
public static void main(final String[] args) throws IOException, SAXException, TikaException {
File file = new File("C:\\HTML\\viewer\\files\\doc.doc");
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String[] metadataNames = metadata.names();
System.out.println(metadataNames.length);
for (String key : metadataNames) {
String value = metadata.get(key);
System.out.println(key + ": " + value);
}
}
}
The problem is I am getting only two key-value pairs which are given below,
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2
X-TIKA:Parsed-By: org.apache.tika.parser.EmptyParser
Content-Type: application/x-tika-msoffice
I am getting a warning from SLF4J which I think is not a problem for getting the desired output and If so please tell me.
Its a maven project and the dependencies I have are given below,
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.3.0</version>
<type>pom</type>
</dependency>
I am getting only the 'X-TIKA:Parsed-By' and 'Content-Type' key-values. I want more metadata values like the author, comments, size, etc.

You have to use a specific parser for retrieving the Microsoft Document properties.
First add the following dependencies:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.3.0</version>
</dependency>
Next, you can do something like this:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("C:\\HTML\\viewer\\files\\doc.doc"));
ParseContext pcontext = new ParseContext();
// Specific parser
OOXMLParser msOfficeParser = new OOXMLParser();
msOfficeParser .parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
You will obtain something like this:
Metadata of the document: cp:revision: 3
extended-properties:AppVersion: 16.0000 meta:paragraph-count: 1
meta:word-count: 83 extended-properties:Application: Microsoft Office
Word meta:last-author: Microsoft Office User
References:
https://www.tutorialspoint.com/tika/tika_extracting_ms_office_files.htm
https://tika.apache.org/2.3.0/gettingstarted.html

Related

How can i make this terrier example work?

Trying to figure out how to embody terrier Indexing and Retrieval in my app but cant even run properly the documentation demo (https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart-integratedsearch.md)
I know there are some serious updates now but feels like mess to me.
Can anyone help?
This is the most recent example you can find***
import java.io.File;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Iterator;
import org.terrier.indexing.Document;
import org.terrier.indexing.TaggedDocument;
import org.terrier.indexing.tokenisation.Tokeniser;
import org.terrier.querying.LocalManager;
import org.terrier.querying.Manager;
import org.terrier.querying.ManagerFactory;
import org.terrier.querying.ScoredDoc;
import org.terrier.querying.ScoredDocList;
import org.terrier.querying.SearchRequest;
import org.terrier.realtime.memory.MemoryIndex;
import org.terrier.utility.ApplicationSetup;
import org.terrier.utility.Files;
public class IndexingAndRetrievalExample {
public static void main(String[] args) throws Exception {
// Directory containing files to index
String aDirectoryToIndex = "/my/directory/containing/files/";
// Configure Terrier
ApplicationSetup.setProperty("indexer.meta.forward.keys", "docno");
ApplicationSetup.setProperty("indexer.meta.forward.keylens", "30");
// Create a new Index
MemoryIndex memIndex = new MemoryIndex();
// For each file
for (String filename : new File(aDirectoryToIndex).list() ) {
String fullPath = aDirectoryToIndex+filename;
// Convert it to a Terrier Document
Document document = new TaggedDocument(Files.openFileReader(fullPath), new HashMap(), Tokeniser.getTokeniser());
// Add a meaningful identifier
document.getAllProperties().put("docno", filename);
// index it
memIndex.indexDocument(document);
}
// Set up the querying process
ApplicationSetup.setProperty("querying.processes", "terrierql:TerrierQLParser,"
+ "parsecontrols:TerrierQLToControls,"
+ "parseql:TerrierQLToMatchingQueryTerms,"
+ "matchopql:MatchingOpQLParser,"
+ "applypipeline:ApplyTermPipeline,"
+ "localmatching:LocalManager$ApplyLocalMatching,"
+ "filters:LocalManager$PostFilterProcess");
// Enable the decorate enhancement
ApplicationSetup.setProperty("querying.postfilters", "org.terrier.querying.SimpleDecorate");
// Create a new manager run queries
Manager queryingManager = ManagerFactory.from(memIndex.getIndexRef());
// Create a search request
SearchRequest srq = queryingManager.newSearchRequestFromQuery("search for document");
// Specify the model to use when searching
srq.setControl(SearchRequest.CONTROL_WMODEL, "BM25");
// Enable querying processes
srq.setControl("terrierql", "on");
srq.setControl("parsecontrols", "on");
srq.setControl("parseql", "on");
srq.setControl("applypipeline", "on");
srq.setControl("localmatching", "on");
srq.setControl("filters", "on");
// Enable post filters
srq.setControl("decorate", "on");
// Run the search
queryingManager.runSearchRequest(srq);
// Get the result set
ScoredDocList results = srq.getResults();
// Print the results
System.out.println("The top "+results.size()+" of documents were returned");
System.out.println("Document Ranking");
for(ScoredDoc doc : results) {
int docid = doc.getDocid();
double score = doc.getScore();
String docno = doc.getMetadata("docno")
System.out.println(" Rank "+i+": "+docid+" "+docno+" "+score);
}
}
}
Indexing part seems to be okay.
This lines of retrieval part seem problematic.
Manager queryingManager = ManagerFactory.from(memIndex.getIndexRef());
cursor message: Cannot resolve method 'getIndexRef' in 'MemoryIndex
srq.setControl(SearchRequest.CONTROL_WMODEL, "BM25");
cursor message: Cannot resolve symbol 'CONTROL_WMODEL'
ScoredDocList results = srq.getResults();
cursor message: Cannot resolve method 'getResults' in 'SearchRequest'
I think the problem is that there are new ways to do this and some methods are now deprecated.
Could anyone try this code and see if it works?
It is a Maven project.
These are the dependencies :
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>5.5</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>5.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>5.1</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-realtime</artifactId>
<version>5.1</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>4.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>4.2</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-batch-indexers</artifactId>
<version>5.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-batch-retrieval</artifactId>
<version>5.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-index-api</artifactId>
<version>5.5</version>
</dependency>
</dependencies>

Exception in thread "main" java.lang.reflect.InvocationTargetException

I am using iText PDF 5.5.11 to convert PDF to XML.I already checked similar answers on stackoverflow. I am getting below error when I run jar file using command line on ubuntu. java version "1.8.0_101"
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
Caused by: java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1Encodable
at com.itextpdf.text.pdf.PdfEncryption.<init>(PdfEncryption.java:147)
at com.itextpdf.text.pdf.PdfReader.readDecryptedDocObj(PdfReader.java:1063)
at com.itextpdf.text.pdf.PdfReader.readDocObj(PdfReader.java:1469)
at com.itextpdf.text.pdf.PdfReader.readPdf(PdfReader.java:751)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:198)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:236)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:224)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:214)
at test.pdfreader.readXml(pdfreader.java:34)
at test.pdfreader.main(pdfreader.java:30)
I am not much familiar with java. I call this jar file from PHP using PHP exec function.
Below is the code I use to convert PDF to XML.
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.AcroFields;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.XfaForm;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class pdfreader {
public static void main(String[] args) throws IOException, DocumentException, TransformerException {
String SRC = "";
String DEST = "";
for (String s : args) {
SRC = args[0];
DEST = args[1];
}
File file = new File(DEST);
file.getParentFile().mkdirs();
new pdfreader().readXml(SRC, DEST);
}
public void readXml(String src, String dest) throws IOException, DocumentException, TransformerException {
PdfReader reader = new PdfReader(src);
AcroFields form = reader.getAcroFields();
XfaForm xfa = form.getXfa();
Node node = xfa.getDatasetsNode();
NodeList list = node.getChildNodes();
for (int i = 0; i < list.getLength(); ++i) {
if ("data".equals(list.item(i).getLocalName())) {
node = list.item(i);
break;
}
}
list = node.getChildNodes();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty("encoding", "UTF-8");
tf.setOutputProperty("indent", "yes");
FileOutputStream os = new FileOutputStream(dest);
tf.transform(new DOMSource(node), new StreamResult(os));
reader.close();
}
}
When you use Maven for your Java project, then all you need to do, is add a dependency to iText. Maven will then take care of all transitive dependencies like BouncyCastle. Maven takes away all the heavy lifting.
The same principle applies for other build systems like Gradle etc.
Now, if you want to do it all manually and put the correct jars on your classpath, then you need to do some homework. This means looking at the pom.xml of each and every of your dependencies, see which transitive dependencies they have, which dependencies those dependencies have, and so on ad nauseam.
In case of iText, you take a look at the pom.xml that you can find on Maven Central: https://search.maven.org/#artifactdetails%7Ccom.itextpdf%7Citextpdf%7C5.5.11%7Cjar
In particular this part:
<dependencies>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcprov-jdk15on</artifactId>
<version>1.49</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcpkix-jdk15on</artifactId>
<version>1.49</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.santuario</groupId>
<artifactId>xmlsec</artifactId>
<version>1.5.1</version>
<optional>true</optional>
</dependency>
</dependencies>
This tells you that iText 5.5.11 has an optional dependency on BouncyCastle 1.49.
BouncyCastle has a bad reputation of randomly changing and breaking their API even with minor updates, that is why you need to be very precise with your BouncyCastle version.
Hi Just change in zookeeper.service file as Environment="KAFKA_ARGS=-javaagent:/home/ec2-user/prometheus/jmx_prometheus_javaagent-0.3.1.jar=8080:/home/ec2-user/prometheus/kafka-0-8-2.yml" to below and the issue resolved:
Environment="KAFKA_OPTS=-javaagent:/home/ec2-user/prometheus/jmx_prometheus_javaagent-0.3.1.jar=8080:/home/ec2-user/prometheus/zookeeper.yml"

Can't get correct Key-Value Pairs with Tika

I'm trying to get the Metadata Values from an Office Document and all it shows as key-value pair is this one:
Content-Type: application/zip
I just can't tell the issue in this one. Why does it only show the Content-Type?
What i'm interested in are Keys like title.
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class App
{
private static final String PATH = "C:/docs/myDocument.docx";
public static void main( String[] args ) throws IOException, SAXException, TikaException
{
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
InputStream fileStream = new FileInputStream(PATH);
BodyContentHandler handler = new BodyContentHandler();
parser.parse(fileStream, handler, metadata);
String[] metadataNames = metadata.names();
for (String key : metadataNames) {
String value = metadata.get(key);
System.out.println(key + ": " + value);
}
}
}
Promoting a comment to an answer - you appear to be missing some key Apache Tika jars or their dependencies.
If you're using Maven, then your pom should have (as of January 2015) should have something like:
<properties>
<tika.version>1.7</tika.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>${tika.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${tika.version}</version>
</dependency>
</dependencies>
The tika-core artifact gives you everything you need to run Tika, and develop your own parsers, but doesn't come with any parsers. It's the tika-parsers artifact (+dependencies!) which provides all the built-in Tika parsers, which you need to process files liek yours

Java : Microsoft word document to html converter style sheet

As required I am trying to convert doc or docx (Microsoft word) files to html format with Apache tika
I end up with following code which works fine,
But its not adding any style sheet to result html.
import javax.xml.transform.OutputKeys;
import java.io.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.detect.DefaultDetector;
public class DocxConvert
{
public static void main(String []args)
{
InputStream input=null;
try
{
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"yes");
handler.setResult(new StreamResult(sw));
input = new FileInputStream("f:\\file.doc");
DefaultDetector detector = new DefaultDetector();
Metadata metadata = new Metadata();
org.apache.tika.parser.Parser parser = new AutoDetectParser(detector);
parser.parse(input, handler, metadata, new ParseContext());
System.out.print(sw.toString());
}
catch (Exception ex)
{
ex.printStackTrace();
}
finally {
try {
input.close();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
Is there any way to add/generate style sheet to output? kindly help !
I used version 1.6 of Tika and that worked fine for me. Here is the pom dependency I used.
http://tika.apache.org/download.html
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.6</version>
</dependency>
</dependencies>
You can use unoconv and it requires Openoffice or Libreoffice. Download from here and it provides doc,docx,xls etc. to pdf conversion from command line in your server. if you want to show embedding pdf file with apache or apache tomcat, i think pdf.js is good solution.

How to use OpenNLP with Java?

I want to POStag an English sentence and do some processing. I would like to use openNLP. I have it installed
When I execute the command
I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt
It gives output POSTagging the input in Text.txt
Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.
Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s
I hope it installed properly?
Now how do i do this POStagging from inside a java application? I have added the openNLPtools, jwnl, maxent jar to the project but how do i invoke the POStagging?
Here's some (old) sample code I threw together, with modernized code to follow:
package opennlp;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;
public class OpenNlpTest {
public static void main(String[] args) throws IOException {
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
ObjectStream<String> lineStream =
new PlainTextByLineStream(new StringReader(input));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
}
}
The output is:
Loading POS Tagger model ... done (2.045s)
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN
Average: 76.9 sent/s
Total: 1 sent
Runtime: 0.013s
This is basically working from the POSTaggerTool class included as part of OpenNLP. The sample.getTags() is a String array that has the tag types themselves.
This requires direct file access to the training data, which is really, really lame.
An updated codebase for this is a little different (and probably more useful.)
First, a Maven POM:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.javachannel</groupId>
<artifactId>opennlp-example</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>[6.8.21,)</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
And here's the code, written as a test, therefore located in ./src/test/java/org/javachannel/opennlp/example:
package org.javachannel.opennlp.example;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Stream;
public class POSTest {
private void download(String url, File destination) throws IOException {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
#DataProvider
Object[][] getCorpusData() {
return new Object[][][]{{{
"Can anyone help me dig through OpenNLP's horrible documentation?"
}}};
}
#Test(dataProvider = "getCorpusData")
public void showPOS(Object[] input) throws IOException {
File modelFile = new File("en-pos-maxent.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile);
}
POSModel model = new POSModel(modelFile);
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
Stream.of(input).map(line -> {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
return sample.toString();
}).forEach(System.out::println);
perfMon.stopAndPrintFinalResult();
}
}
This code doesn't actually test anything - it's a smoke test, if anything - but it should serve as a starting point. Another (potentially) nice thing is that it downloads a model for you if you don't have it downloaded already.
The URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html does not work anymore. I found the below on the 14th slide at http://www.slideshare.net/gagan1667/opennlp-demo
The above answer does provide a way to use the existing models from OpenNLP but if you need to train your own model, maybe the below can help:
Here is a detailed tutorial with full code:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.
Training data format
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model
The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file
public POSModel train(String filepath) {
POSModel model = null;
TrainingParameters parameters = TrainingParameters.defaultParams();
parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
try {
try (InputStream dataIn = new FileInputStream(filepath)) {
ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
#Override
public InputStream createInputStream() throws IOException {
return dataIn;
}
}, StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
Use model to do tagging.
Finally, we can see how the model can be used to tag unseen queries:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}

Categories