I am using iText PDF 5.5.11 to convert PDF to XML.I already checked similar answers on stackoverflow. I am getting below error when I run jar file using command line on ubuntu. java version "1.8.0_101"
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
Caused by: java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1Encodable
at com.itextpdf.text.pdf.PdfEncryption.<init>(PdfEncryption.java:147)
at com.itextpdf.text.pdf.PdfReader.readDecryptedDocObj(PdfReader.java:1063)
at com.itextpdf.text.pdf.PdfReader.readDocObj(PdfReader.java:1469)
at com.itextpdf.text.pdf.PdfReader.readPdf(PdfReader.java:751)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:198)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:236)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:224)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:214)
at test.pdfreader.readXml(pdfreader.java:34)
at test.pdfreader.main(pdfreader.java:30)
I am not much familiar with java. I call this jar file from PHP using PHP exec function.
Below is the code I use to convert PDF to XML.
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.AcroFields;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.XfaForm;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class pdfreader {
public static void main(String[] args) throws IOException, DocumentException, TransformerException {
String SRC = "";
String DEST = "";
for (String s : args) {
SRC = args[0];
DEST = args[1];
}
File file = new File(DEST);
file.getParentFile().mkdirs();
new pdfreader().readXml(SRC, DEST);
}
public void readXml(String src, String dest) throws IOException, DocumentException, TransformerException {
PdfReader reader = new PdfReader(src);
AcroFields form = reader.getAcroFields();
XfaForm xfa = form.getXfa();
Node node = xfa.getDatasetsNode();
NodeList list = node.getChildNodes();
for (int i = 0; i < list.getLength(); ++i) {
if ("data".equals(list.item(i).getLocalName())) {
node = list.item(i);
break;
}
}
list = node.getChildNodes();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty("encoding", "UTF-8");
tf.setOutputProperty("indent", "yes");
FileOutputStream os = new FileOutputStream(dest);
tf.transform(new DOMSource(node), new StreamResult(os));
reader.close();
}
}
When you use Maven for your Java project, then all you need to do, is add a dependency to iText. Maven will then take care of all transitive dependencies like BouncyCastle. Maven takes away all the heavy lifting.
The same principle applies for other build systems like Gradle etc.
Now, if you want to do it all manually and put the correct jars on your classpath, then you need to do some homework. This means looking at the pom.xml of each and every of your dependencies, see which transitive dependencies they have, which dependencies those dependencies have, and so on ad nauseam.
In case of iText, you take a look at the pom.xml that you can find on Maven Central: https://search.maven.org/#artifactdetails%7Ccom.itextpdf%7Citextpdf%7C5.5.11%7Cjar
In particular this part:
<dependencies>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcprov-jdk15on</artifactId>
<version>1.49</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcpkix-jdk15on</artifactId>
<version>1.49</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.santuario</groupId>
<artifactId>xmlsec</artifactId>
<version>1.5.1</version>
<optional>true</optional>
</dependency>
</dependencies>
This tells you that iText 5.5.11 has an optional dependency on BouncyCastle 1.49.
BouncyCastle has a bad reputation of randomly changing and breaking their API even with minor updates, that is why you need to be very precise with your BouncyCastle version.
Hi Just change in zookeeper.service file as Environment="KAFKA_ARGS=-javaagent:/home/ec2-user/prometheus/jmx_prometheus_javaagent-0.3.1.jar=8080:/home/ec2-user/prometheus/kafka-0-8-2.yml" to below and the issue resolved:
Environment="KAFKA_OPTS=-javaagent:/home/ec2-user/prometheus/jmx_prometheus_javaagent-0.3.1.jar=8080:/home/ec2-user/prometheus/zookeeper.yml"
Related
As a RPGLE programmer I was asked to convert an IMAP mail poller to EWS. So I'm a bit out of my comfort zone.
With all documentation I managed to cobble a lot of pieces together, but it looks like I'm missing a vital piece because the program keep crashing at Folder folder = service.bindToFolder(folderId, propertySet);
It first I thought it had something to do with authorization. But then I found a little program EwsEditor at Github which works just fine.
Could someone point me to what I am missing?
With all documentation I managed to cobble a lot of pieces together. My testing code:
package test.ewstest;
import com.microsoft.aad.msal4j.ClientCredentialFactory;
import com.microsoft.aad.msal4j.ClientCredentialParameters;
import com.microsoft.aad.msal4j.ConfidentialClientApplication;
import java.io.File;
import java.io.FileInputStream;
import java.net.URI;
import java.util.Collections;
import java.util.Properties;
import microsoft.exchange.webservices.data.core.ExchangeService;
import microsoft.exchange.webservices.data.core.PropertySet;
import microsoft.exchange.webservices.data.core.enumeration.misc.ConnectingIdType;
import microsoft.exchange.webservices.data.core.enumeration.misc.ExchangeVersion;
import microsoft.exchange.webservices.data.core.enumeration.property.BasePropertySet;
import microsoft.exchange.webservices.data.core.enumeration.property.WellKnownFolderName;
import microsoft.exchange.webservices.data.core.enumeration.search.FolderTraversal;
import microsoft.exchange.webservices.data.core.service.schema.FolderSchema;
import microsoft.exchange.webservices.data.credential.TokenCredentials;
import microsoft.exchange.webservices.data.misc.ImpersonatedUserId;
import microsoft.exchange.webservices.data.property.complex.FolderId;
import microsoft.exchange.webservices.data.property.complex.Mailbox;
import microsoft.exchange.webservices.data.search.FindFoldersResults;
import microsoft.exchange.webservices.data.search.FolderView;
import microsoft.exchange.webservices.data.core.service.folder.Folder;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
public class EwsTest {
static String exchangeUser;
static String exchangeTenant;
static String exchangeClientId;
static String exchangeClientSecret;
static ExchangeService service;
static Logger logger = LogManager.getLogger(EwsTest.class.getName());
static Properties properties = new Properties();
public static void main(String[] args) {
try {
properties.load(new FileInputStream(System.getProperty("user.dir") + File.separator + "EwsTest.properties"));
exchangeUser = properties.getProperty("user");
exchangeTenant = properties.getProperty("tenant");
exchangeClientId = properties.getProperty("client");
exchangeClientSecret = properties.getProperty("secret");
createConnection();
displayFolders();
}
catch (Exception ex) {
logger.error(ex.getMessage(), ex);
}
}
static void createConnection() throws Exception {
service = new ExchangeService(ExchangeVersion.Exchange2010_SP2);
service.getHttpHeaders().put("X-AnchorMailbox",exchangeUser);
service.getHttpHeaders().put("X-PublicFolderMailbox",exchangeUser);
service.setCredentials(new TokenCredentials(createOauthToken()));
service.setImpersonatedUserId(new ImpersonatedUserId(ConnectingIdType.SmtpAddress, exchangeUser));
service.setUrl(new URI("https://outlook.office365.com/EWS/Exchange.asmx"));
}
static String createOauthToken() throws Exception {
ConfidentialClientApplication app = ConfidentialClientApplication.builder(
exchangeClientId,
ClientCredentialFactory.createFromSecret(exchangeClientSecret))
.authority("https://login.microsoftonline.com/" + exchangeTenant + "/")
.build();
ClientCredentialParameters clientCredentialParam = ClientCredentialParameters.builder(
Collections.singleton("https://outlook.office365.com/.default"))
.build();
return app.acquireToken(clientCredentialParam).get().accessToken();
}
static void displayFolders() throws Exception {
PropertySet propertySet = new PropertySet(BasePropertySet.IdOnly);
propertySet.add(FolderSchema.DisplayName);
FolderView view = new FolderView(100);
view.setPropertySet(propertySet);
view.setTraversal(FolderTraversal.Deep);
Mailbox mailbox = new Mailbox(exchangeUser);
FolderId folderId = new FolderId(WellKnownFolderName.MsgFolderRoot, mailbox);
logger.info("service.bindToFolder");
Folder folder = service.bindToFolder(folderId, propertySet);
logger.info("Ok");
logger.info("service.findFolders");
FindFoldersResults findFolderResults = service.findFolders(folder.getId(), view);
logger.info("Ok");
// find specific folder
for (Folder f : findFolderResults)
{
logger.info(f.getId());
}
}
}
And my Netbeans maven dependencies:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>EwsTest</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
<version>2.11.2</version>
</dependency>
<dependency>
<groupId>javax.xml.ws</groupId>
<artifactId>jaxws-api</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>msal4j</artifactId>
<version>1.13.3</version>
</dependency>
<dependency>
<groupId>com.microsoft.ews-java-api</groupId>
<artifactId>ews-java-api</artifactId>
<version>2.0</version>
</dependency>
</dependencies>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>19</maven.compiler.source>
<maven.compiler.target>19</maven.compiler.target>
<exec.mainClass>test.ewstest.EwsTest</exec.mainClass>
</properties>
</project>
I created my EwsTest.properties file and filled it with the keys:
user=example#example.nl
client=9999x9xx-xx99-x99x-xx99-xx99x9999x9x
tenant=x99x9999-xx99-9999-xx99-9999xx9x9999
secret=XXxxX!9xx-...-
When I run the program, I see that the connection is being build and a token is being exchanged. But it crashes at the statement:
Folder folder = service.bindToFolder(folderId, propertySet);
With an error: The request failed. An internal server error occurred. The operation failed.
microsoft.exchange.webservices.data.core.exception.service.remote.ServiceRequestException: The request failed. An internal server error occurred. The operation failed.
at microsoft.exchange.webservices.data.core.request.SimpleServiceRequestBase.internalExecute(SimpleServiceRequestBase.java:74) ~[ews-java-api-2.0.jar:?]
at microsoft.exchange.webservices.data.core.request.MultiResponseServiceRequest.execute(MultiResponseServiceRequest.java:158) ~[ews-java-api-2.0.jar:?]
at microsoft.exchange.webservices.data.core.ExchangeService.bindToFolder(ExchangeService.java:504) ~[ews-java-api-2.0.jar:?]
at test.ewstest.EwsTest.displayFolders(EwsTest.java:91) ~[classes/:?]
at test.ewstest.EwsTest.main(EwsTest.java:49) [classes/:?]
Caused by: microsoft.exchange.webservices.data.core.exception.service.remote.ServiceResponseException: An internal server error occurred. The operation failed.
at microsoft.exchange.webservices.data.core.request.ServiceRequestBase.processWebException(ServiceRequestBase.java:548) ~[ews-java-api-2.0.jar:?]
at microsoft.exchange.webservices.data.core.request.ServiceRequestBase.validateAndEmitRequest(ServiceRequestBase.java:641) ~[ews-java-api-2.0.jar:?]
at microsoft.exchange.webservices.data.core.request.SimpleServiceRequestBase.internalExecute(SimpleServiceRequestBase.java:62) ~[ews-java-api-2.0.jar:?]
... 4 more
I tried several folders like inbox and root but they all give that same error:
FolderId folderId = new FolderId(WellKnownFolderName.MsgFolderRoot, mailbox);
But since EwsEditor seems to work, I'm clearly missing something. I tried looking at the code of EwsEditor but that's C# which is even harder to understand for me.
Trying to figure out how to embody terrier Indexing and Retrieval in my app but cant even run properly the documentation demo (https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart-integratedsearch.md)
I know there are some serious updates now but feels like mess to me.
Can anyone help?
This is the most recent example you can find***
import java.io.File;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Iterator;
import org.terrier.indexing.Document;
import org.terrier.indexing.TaggedDocument;
import org.terrier.indexing.tokenisation.Tokeniser;
import org.terrier.querying.LocalManager;
import org.terrier.querying.Manager;
import org.terrier.querying.ManagerFactory;
import org.terrier.querying.ScoredDoc;
import org.terrier.querying.ScoredDocList;
import org.terrier.querying.SearchRequest;
import org.terrier.realtime.memory.MemoryIndex;
import org.terrier.utility.ApplicationSetup;
import org.terrier.utility.Files;
public class IndexingAndRetrievalExample {
public static void main(String[] args) throws Exception {
// Directory containing files to index
String aDirectoryToIndex = "/my/directory/containing/files/";
// Configure Terrier
ApplicationSetup.setProperty("indexer.meta.forward.keys", "docno");
ApplicationSetup.setProperty("indexer.meta.forward.keylens", "30");
// Create a new Index
MemoryIndex memIndex = new MemoryIndex();
// For each file
for (String filename : new File(aDirectoryToIndex).list() ) {
String fullPath = aDirectoryToIndex+filename;
// Convert it to a Terrier Document
Document document = new TaggedDocument(Files.openFileReader(fullPath), new HashMap(), Tokeniser.getTokeniser());
// Add a meaningful identifier
document.getAllProperties().put("docno", filename);
// index it
memIndex.indexDocument(document);
}
// Set up the querying process
ApplicationSetup.setProperty("querying.processes", "terrierql:TerrierQLParser,"
+ "parsecontrols:TerrierQLToControls,"
+ "parseql:TerrierQLToMatchingQueryTerms,"
+ "matchopql:MatchingOpQLParser,"
+ "applypipeline:ApplyTermPipeline,"
+ "localmatching:LocalManager$ApplyLocalMatching,"
+ "filters:LocalManager$PostFilterProcess");
// Enable the decorate enhancement
ApplicationSetup.setProperty("querying.postfilters", "org.terrier.querying.SimpleDecorate");
// Create a new manager run queries
Manager queryingManager = ManagerFactory.from(memIndex.getIndexRef());
// Create a search request
SearchRequest srq = queryingManager.newSearchRequestFromQuery("search for document");
// Specify the model to use when searching
srq.setControl(SearchRequest.CONTROL_WMODEL, "BM25");
// Enable querying processes
srq.setControl("terrierql", "on");
srq.setControl("parsecontrols", "on");
srq.setControl("parseql", "on");
srq.setControl("applypipeline", "on");
srq.setControl("localmatching", "on");
srq.setControl("filters", "on");
// Enable post filters
srq.setControl("decorate", "on");
// Run the search
queryingManager.runSearchRequest(srq);
// Get the result set
ScoredDocList results = srq.getResults();
// Print the results
System.out.println("The top "+results.size()+" of documents were returned");
System.out.println("Document Ranking");
for(ScoredDoc doc : results) {
int docid = doc.getDocid();
double score = doc.getScore();
String docno = doc.getMetadata("docno")
System.out.println(" Rank "+i+": "+docid+" "+docno+" "+score);
}
}
}
Indexing part seems to be okay.
This lines of retrieval part seem problematic.
Manager queryingManager = ManagerFactory.from(memIndex.getIndexRef());
cursor message: Cannot resolve method 'getIndexRef' in 'MemoryIndex
srq.setControl(SearchRequest.CONTROL_WMODEL, "BM25");
cursor message: Cannot resolve symbol 'CONTROL_WMODEL'
ScoredDocList results = srq.getResults();
cursor message: Cannot resolve method 'getResults' in 'SearchRequest'
I think the problem is that there are new ways to do this and some methods are now deprecated.
Could anyone try this code and see if it works?
It is a Maven project.
These are the dependencies :
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>5.5</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>5.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>5.1</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-realtime</artifactId>
<version>5.1</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>4.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-core</artifactId>
<version>4.2</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-batch-indexers</artifactId>
<version>5.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-batch-retrieval</artifactId>
<version>5.4</version>
</dependency>
<dependency>
<groupId>org.terrier</groupId>
<artifactId>terrier-index-api</artifactId>
<version>5.5</version>
</dependency>
</dependencies>
I have trying selenium code with #Test, but eclipse is not running it and asking for Main method.
I've added Maven Project and added Selenium & TestNG dependencies to pom.xml
Please help with the issue I'm facing
Sample code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Properties;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.testng.annotations.Test;
public class BaseTest {
#Test
public void openBrowser() {
File file = new File("./src/test/resources/config.properties");
FileInputStream fileInput = null;
try {
fileInput = new FileInputStream(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File not found");
}
Properties prop = new Properties();
//load properties file
try {
prop.load(fileInput);
} catch (IOException e) {
e.printStackTrace();
System.out.println("Values not found");
}
System.setProperty("webdriver.chrome.driver","./src/test/resources/drivers/chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get(prop.getProperty("baseURL"));
}
}
On running it,I'm facing
Error: Main method not found in class com.digivalsolutions.digiassess.LoginTest, please define the main method as:
public static void main(String[] args)
or a JavaFX application class must extend javafx.application.Application
My pom.xml file:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>7.3.0</version>
<scope>test</scope>
</dependency>
I think you should install TestNG for your IDE. You can check this link for more details.
I'm trying to get the Metadata Values from an Office Document and all it shows as key-value pair is this one:
Content-Type: application/zip
I just can't tell the issue in this one. Why does it only show the Content-Type?
What i'm interested in are Keys like title.
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class App
{
private static final String PATH = "C:/docs/myDocument.docx";
public static void main( String[] args ) throws IOException, SAXException, TikaException
{
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
InputStream fileStream = new FileInputStream(PATH);
BodyContentHandler handler = new BodyContentHandler();
parser.parse(fileStream, handler, metadata);
String[] metadataNames = metadata.names();
for (String key : metadataNames) {
String value = metadata.get(key);
System.out.println(key + ": " + value);
}
}
}
Promoting a comment to an answer - you appear to be missing some key Apache Tika jars or their dependencies.
If you're using Maven, then your pom should have (as of January 2015) should have something like:
<properties>
<tika.version>1.7</tika.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>${tika.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${tika.version}</version>
</dependency>
</dependencies>
The tika-core artifact gives you everything you need to run Tika, and develop your own parsers, but doesn't come with any parsers. It's the tika-parsers artifact (+dependencies!) which provides all the built-in Tika parsers, which you need to process files liek yours
I want to POStag an English sentence and do some processing. I would like to use openNLP. I have it installed
When I execute the command
I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt
It gives output POSTagging the input in Text.txt
Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.
Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s
I hope it installed properly?
Now how do i do this POStagging from inside a java application? I have added the openNLPtools, jwnl, maxent jar to the project but how do i invoke the POStagging?
Here's some (old) sample code I threw together, with modernized code to follow:
package opennlp;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;
public class OpenNlpTest {
public static void main(String[] args) throws IOException {
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
ObjectStream<String> lineStream =
new PlainTextByLineStream(new StringReader(input));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
}
}
The output is:
Loading POS Tagger model ... done (2.045s)
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN
Average: 76.9 sent/s
Total: 1 sent
Runtime: 0.013s
This is basically working from the POSTaggerTool class included as part of OpenNLP. The sample.getTags() is a String array that has the tag types themselves.
This requires direct file access to the training data, which is really, really lame.
An updated codebase for this is a little different (and probably more useful.)
First, a Maven POM:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.javachannel</groupId>
<artifactId>opennlp-example</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>[6.8.21,)</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
And here's the code, written as a test, therefore located in ./src/test/java/org/javachannel/opennlp/example:
package org.javachannel.opennlp.example;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Stream;
public class POSTest {
private void download(String url, File destination) throws IOException {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
#DataProvider
Object[][] getCorpusData() {
return new Object[][][]{{{
"Can anyone help me dig through OpenNLP's horrible documentation?"
}}};
}
#Test(dataProvider = "getCorpusData")
public void showPOS(Object[] input) throws IOException {
File modelFile = new File("en-pos-maxent.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile);
}
POSModel model = new POSModel(modelFile);
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
Stream.of(input).map(line -> {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
return sample.toString();
}).forEach(System.out::println);
perfMon.stopAndPrintFinalResult();
}
}
This code doesn't actually test anything - it's a smoke test, if anything - but it should serve as a starting point. Another (potentially) nice thing is that it downloads a model for you if you don't have it downloaded already.
The URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html does not work anymore. I found the below on the 14th slide at http://www.slideshare.net/gagan1667/opennlp-demo
The above answer does provide a way to use the existing models from OpenNLP but if you need to train your own model, maybe the below can help:
Here is a detailed tutorial with full code:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.
Training data format
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model
The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file
public POSModel train(String filepath) {
POSModel model = null;
TrainingParameters parameters = TrainingParameters.defaultParams();
parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
try {
try (InputStream dataIn = new FileInputStream(filepath)) {
ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
#Override
public InputStream createInputStream() throws IOException {
return dataIn;
}
}, StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
Use model to do tagging.
Finally, we can see how the model can be used to tag unseen queries:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}