poi XWPF double space - java

I am using the apache POI.XWPF library to create word documents. For the last couple of days I have been searching how to do double spaces for the whole document. I have checked the Apache javadocs and just by searching the internet but can't seem to find any answers.
I found the addBreak() method but it won't work because the user will input multiple paragraphs and breaking those paragraphs into single lines seemed unreasonable. If this method is used per paragraph then it won't create the double space between each line but between each paragraph.
Here is a small part of the code I currently have.
public class Paper {
public static void main(String[] args) throws IOException, XmlException {
ArrayList<String> para = new ArrayList<String>();
para.add("The first paragraph of a typical business letter is used to state the main point of the letter. Begin with a friendly opening; then quickly transition into the purpose of your letter. Use a couple of sentences to explain the purpose, but do not go in to detail until the next paragraph.");
para.add("Beginning with the second paragraph, state the supporting details to justify your purpose. These may take the form of background information, statistics or first-hand accounts. A few short paragraphs within the body of the letter should be enough to support your reasoning.");
XWPFDocument document = new XWPFDocument();
//Calls on createParagraph() method which creates a single paragraph
for(int i=0; i< para.size(); i++){
createParagraph(document, para.get(i));
}
FileOutputStream outStream = null;
try {
outStream = new FileOutputStream("ResearchPaper.docx");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
document.write(outStream);
outStream.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
//Creates a single paragraph with a one tab indentation
private static void createParagraph(XWPFDocument document, String para) {
XWPFParagraph paraOne = document.createParagraph();
paraOne.setFirstLineIndent(700); // Indents first line of paragraph to the equivalence of one tab
XWPFRun one = paraOne.createRun();
one.setFontSize(12);
one.setFontFamily("Times New Roman");
one.setText(para);
}
}
Just to make sure my question is clear, I am trying to find out how to double space a word document (.docx). So between each line there should be one line of space. This is the same thing as pressing ctrl+2 when editing a word document.
Thank you for any help.

It appears that there isn't a high level method available for what you're trying to achieve. In that case, you'll need to delve into the low-level API of Apache POI. Below is a way of doing that. I'm not saying this is the best way to go about it, I've only found that it works for me when I want to recreate some bizarre feature of MS Word.
1. Find out where the information is stored in the document
If you need to tweak something manually, create 2 documents with as little content as possible: one that contains what you want to do, and one that doesn't. Save both as Office XML documents, because that makes it easy to read. Diff those files - there should only be a handful of changes, and you should have your location in the document structure.
For your case, this is the thing you're looking for.
Unmodified document:
<w:body><w:p> <!-- only included here so you know where to look -->
<w:pPr>
<w:jc w:val="both" />
<w:rPr>
<w:lang w:val="nl-BE" />
</w:rPr>
</w:pPr>
Modified document:
<w:body><w:p>
<w:pPr>
<w:spacing w:line="480" w:lineRule="auto" /> <!-- BINGO -->
<w:jc w:val="both" />
<w:rPr>
<w:lang w:val="nl-BE" />
</w:rPr>
</w:pPr>
So now you know that you need an object called spacing, that it has some properties, and that it's stored somewhere in the Paragraph object.
2. Get to that location through the low-level API, if at all possible
This part is tricky, because the XML node names are somewhat cryptic and maybe you don't know the terminology very well. Also the API isn't always a 1:1 mapping of the node names, so you have to take some guesses and just try to step through the method calls. Pro tip: download the source code for Apache POI !! You WILL step into some dead ends and you might not get where you want to be by the shortest path, but when you do you feel like an arcane Master of POI. And then you write gloating posts about it on Q&A sites.
For your case in MS Word, this is a path you might take (not necessarily the best, I'm not an expert on the high-level API):
// you probably don't need this first line
// but you'd need it if you were manipulating an existing document
IBody body = doc.getBodyElements().get(0).getBody();
for (XWPFParagraph par : body.getParagraphs()) {
// work the crazy abbreviated API magic
CTSpacing spacing = par.getCTP().getPPr().getSpacing();
if (spacing == null) {
// it looks hellish to create a CTSpacing object yourself
// so let POI do it by setting any Spacing parameter
par.setSpacingLineRule(LineSpacingRule.AUTO);
// now the Paragraph's spacing shouldn't be null anymore
spacing = par.getCTP().getPPr().getSpacing();
}
// you can set your value, as demonstrated by the XML
spacing.setAfter(BigInteger.valueOf(480));
// not sure if this one is necessary
spacing.setLineRule(STLineSpacingRule.Enum.forString("auto"));
}
3. Bask in your glory !

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

How create a table of contents page in the PDF file from the bookmarks with iText?

I need to create a page in the PDF to the content of table. I will create reading bookmark in PDF.
With iText I use:
tmp = SimpleBookmark.getBookmark (reader);
Testing with this PDF :
Download file PDF
Returns this MAP:
[{Action = GoTo, Named = __ WKANCHOR_2, Title = Secretariat Teste0}, {Action = GoTo, Named = __ WKANCHOR_4, Title = Secretariat TestBook1}, {Action = GoTo, Named = __ WKANCHOR_6, Title = Secretariat Test2}, {Action = GoTo , Named = __ WKANCHOR_8 ...
Without the page number.
How could show one content of table with title and page number?
I would like to show this:
Please read the answer to this question: Java: Reading PDF bookmark names with itext
It explains how you can use the SimpleBookmark method to get the titles of an outline tree (this is how "bookmarks" are called in the PDF specification).
public void inspectPdf(String filename) throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
List<HashMap<String,Object>> bookmarks = SimpleBookmark.getBookmark(reader);
for (int i = 0; i < bookmarks.size(); i++){
showTitle(bookmarks.get(i));
}
reader.close();
}
public void showTitle(HashMap<String, Object> bm) {
System.out.println((String)bm.get("Title"));
List<HashMap<String,Object>> kids = (List<HashMap<String,Object>>)bm.get("Kids");
if (kids != null) {
for (int i = 0; i < kids.size(); i++) {
showTitle(kids.get(i));
}
}
}
Then read the answer to this question: Set inherit Zoom(action property) to bookmark in the pdf file
You'll see that the HashMap<String, Object> doesn't only contain an entry with key "Title", but that it can also contain an entry with key "Page". That is the case when the bookmark points at a page. The value will be an explicit destination. It will consist of the page number, a value such as Fit, FitH, FitB, XYZ, followed by some parameters that mark the position.
If you look at the CreateOutlineTree example, you'll see that you can also extract the bookmarks as an XML file:
public void createXml(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
List<HashMap<String, Object>> list = SimpleBookmark.getBookmark(reader);
SimpleBookmark.exportToXML(list,
new FileOutputStream(dest), "ISO8859-1", true);
reader.close();
}
This is a screenshot from a book I wrote about iText that shows you which keys you can expect in a bookmark entry:
As you can tell from this table, a link can also be expressed as a named destination. In that case, you won't get the page number, but a name. To get the page number, you need to extract the list of named destinations. This list will get you the explicit destination corresponding with the named destination.
That is also explained in the book, as well as in the official documentation.
Once you have the titles and the page numbers (retrieved with code written based on the above pointers), you can insert pages to the PDF file using PdfStamper and the insertPage() method. You can put the TOC on those pages using ColumnText, or you can create a separate PDF for the TOC and merge it with the original one. See How to add a cover/PDF in a existing iText document to find out more about these two techniques.
You will also benefit from this example: Create Index File(TOC) for merged pdf using itext library in java
As for the dashed line between the title and the page number, that's done using a separator, more specifically a dotted line separator. You should read this question first: iTextSharp - Is it possible to set a different alignment in the same cell for text
Then read this question: How to Generate Table-of-Figures Dot Leaders in a PdfPCell for the Last Line of Wrapped Text (or this question It is possible with itext 5 which at the end of a paragraph justified the remaining space is filled with scripts?)
Note that your question is actually off-topic. It's phrased as a "home work" question. It invites people to do your work in your place. Now that you have all the elements you need, you should be able to do the work yourself. If you don't succeed, you should write an on topic Stack Overflow question. That's a question in which you show what you've tried and explain the technical problem you experience.
Update:
You shared a document with the following outline tree:
As you can see, the bookmarks are defined using Named destinations, such as /__WKANCHOR_2, /__WKANCHOR_4, and so on. As you can tell from the / character, the names are stored as PDF name objects (PDF 1.1), not as PDF string objects (since 1.2). The most recent PDF standards recommend to use PDF string objects instead of PDF name objects, you may want to ask the vendor of your PDF generation software to update the software to meet the recommendations of the most recent PDF standards.
Nevertheless, we can easily get the explicit destinations that correspond with those named destinations. They are stored in the /Dests entry of the root dictionary:
When you look at the way the destinations you see another problem that should be reported to wkhtmltopdf. Let's take a look at what the ISO standard tells us about the syntax to be used for destinations:
The concept of page numbers doesn't exist in PDF. Pages are described using page dictionaries, and the page number is derived from the position of the page in the page tree. The first page that is encountered in the page tree is page 1, the second page that is encountered is page 2, and so on.
In your example, the explication destinations are defined like this: [9/XYZ 30.2400000 524.179999 0], [9/XYZ 30.2400000 231.379999 0], and so on.
This is wrong. The ISO standard says that the first value in the array needs to be an indirect reference. An indirect reference has the format 9 0 R, not 9. I looked at the structure of the document, and I see that wkhtmltopdf uses a page number - 1 instead of an indirect reference. If I look at /__WKANCHOR_2, it refers to [0/XYZ 30.240000 781.459999 0] and that 0 is supposed to point to page 1. As Adobe Reader tolerates crappy software, this works in Adobe Reader, but as the file is in violation with ISO-32000, iText doesn't know what to do with those misleading destinations, at least, the convience class SimpleNamedDEstination doesn't know what to do with it.
Fortunately, iText is a very versatile library that allows you to go deep under the hood of a PDF. In this case, we only have to go one level deeper. Instead of SimpleNamedDestination.getNamedDestination(reader, true), we can use the following approach:
HashMap<String, PdfObject> names = reader.getNamedDestinationFromNames();
for (Map.Entry<String, PdfObject> entry: names.entrySet()) {
System.out.print(entry.getKey());
System.out.print(": p");
PdfArray arr = (PdfArray)entry.getValue();
System.out.println(arr.getAsNumber(0).intValue() + 1);
}
reader.close();
The output of this method is:
__WKANCHOR_w: p7
__WKANCHOR_y: p7
__WKANCHOR_2: p1
__WKANCHOR_4: p1
__WKANCHOR_16: p9
__WKANCHOR_14: p8
__WKANCHOR_18: p9
__WKANCHOR_1s: p13
__WKANCHOR_a: p2
__WKANCHOR_1q: p13
__WKANCHOR_1o: p12
__WKANCHOR_12: p8
__WKANCHOR_1m: p12
__WKANCHOR_e: p3
__WKANCHOR_10: p7
__WKANCHOR_1k: p12
__WKANCHOR_c: p3
__WKANCHOR_1i: p11
__WKANCHOR_i: p4
__WKANCHOR_8: p2
__WKANCHOR_g: p3
__WKANCHOR_1g: p11
__WKANCHOR_6: p1
__WKANCHOR_1e: p10
__WKANCHOR_m: p5
__WKANCHOR_1c: p10
__WKANCHOR_k: p4
__WKANCHOR_q: p5
__WKANCHOR_1a: p9
__WKANCHOR_o: p5
__WKANCHOR_u: p6
__WKANCHOR_s: p6
If we check __WKANCHOR_2, we see that it correctly points at page 1. I checked the final link in the outlines, it points at the named destination with name __WKANCHOR_1s and indeed: that should link to page 13.
Your problem is a clear example of a "garbage in-garbage out" problem. Your tool produces PDFs that are in violation with the ISO standard for PDF, and as a result you lose plenty of time trying to figure out what's wrong. But what's even worse: you made me lose time because of someone else's fault.

JSoup seems to ignore character codes?

I'm building a small CMS-like application in Java, that takes a .txt file with shirt names/descriptions and loads the names/descriptions into an ArrayList of customShirts (small class I made). Then, it iterates through the ArrayList, and uses JSoup to parse a template (template.html) and insert the unique details of the shirt into the HTML. Finally, it pumps out each shirt into its own HTML file in an output folder.
When the descriptions are loaded into the ArrayList of customShirts, I replace all special characters with the appropriate character codes so they can be properly displayed (for example, replacing apostrophes with ’). The issue is, I've noticed that JSoup seems to automatically turn the character codes into the actual character, which is an issue, since I need the output to be displayable (which requires character codes). Is there anything I can do to fix this? I've looked at other workarounds, like at: Jsoup unescapes special characters, but they seem to require parsing the file before insertion with replaceAll, and I insert the character-code sensitive text with JSoup, which doesn't seem to make this an option.
Below is the code for the HTML generator I made:
public void generateShirtHTML(){
for(int i = 0; i < arrShirts.size(); i++){
File input = new File("html/template/template.html");
Document doc = null;
try {
doc = Jsoup.parse(input, "UTF-8", "");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Element title = doc.select("title").first();
title.append(arrShirts.get(i).nameToCapitalized());
Element headingTitle = doc.select("h1#headingTitle").first();
headingTitle.html(arrShirts.get(i).nameToCapitalized());
Element shirtDisplay = doc.select("p#alt1").first();
shirtDisplay.html(arrShirts.get(i).name);
Element descriptionBox = doc.select("div#descriptionbox p").first();
descriptionBox.html(arrShirts.get(i).desc);
System.out.println(arrShirts.get(i).desc);
PrintWriter output;
try {
output = new PrintWriter("html/output/" + arrShirts.get(i).URL);
output.println(doc.outerHtml());
//System.out.println(doc.outerHtml());
output.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Shirt " + i + " HTML generated!");
}
}
Thanks in advance!
Expanding a little on my comment (since Stephan encouraged me..), you can use
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
To tell Jsoup to escape / encode special characters in the output, eg. left double quotes (“) as “. To make Jsoup encode all special characters, you may also need to add
doc.outputSettings().charset("ASCII");
In order to ensure that all Unicode special characters will be HTML encoded.
For larger projects where you have to fill in data into HTML files, you can look at using a template engine such as Thymeleaf - it's easier to use for this kind of job (less code and such), and it offers many more features specifically for this process. For small projects (like yours), Jsoup is good (I've used it like this in the past), but for bigger (or even small) projects, you'll want to look into some more specialized tools.

docx4java's SectionWrapper.getHeaderFooterPolicy -- can I use this to remove headers & footers

Rewritten to look more like a programming question
Okay, so I have done a little more research and it looks like the java package I need to use is docx4j. Unfortunately, my lack of familiarity with the package as well as the underpinnings of the PDF format makes it difficult for me to figure out exactly how to make use of the headers and footers returned SectionWrapper.getHeaderFooterPolicy(). It's not entirely clear whether the HeaderPart and FooterPart objects returned are writeable or how to modify them.
There is this code which offers an example of how to create a header part but it creates a new HeaderPart and adds it to the document.
I want to find existing header/footer parts and either remove them if possible or empty them out. Ideally they would be entirely gone from the document.
This code is similar and allows you to set the text of a headerpart using setJaxbElement but so much of this terminology is unfamiliar and I'm concerned the end result will be me creating headers (albeit empty ones) in each document rather than removing them.
Original Question Below
I am dealing with a set of wildly varying MS Word documents. I am compiling them into a single PDF and want to make sure that none of them have headers or footers before doing so.
Ideally, I'd also like to override their default font if it isn't Times New Roman.
Is there any way to do this programmatically or using some sort of batch process?
I will be running this on a Windows server that doesn't currently have Office or Word installed (although I think it might have an install of OpenOffice, and of course it's easy to just add an install as well).
Right now I'm using some version of iText (java) to convert the files to PDF. I know that apparently iText can't do things like removing headers/footers, but since the underlying structure of modern .doc files is XML, I'm wondering if there is an API (or even a XML parsing/editing API or, if all else fails, a RegEx [horrors]) for removing the headers and footers and setting some default styles.
Here is some code hot off the press to do what you want:
public class HeaderFooterRemove {
public static void main(String[] args) throws Exception {
// A docx or a dir containing docx files
String inputpath = System.getProperty("user.dir") + "/testHF.docx";
StringBuilder sb = new StringBuilder();
File dir = new File(inputpath);
if (dir.isDirectory()) {
String[] files = dir.list();
for (int i = 0; i<files.length; i++ ) {
if (files[i].endsWith("docx")) {
sb.append("\n\n" + files[i] + "\n");
removeHFFromFile(new java.io.File(inputpath + "/" + files[i]));
}
}
} else if (inputpath.endsWith("docx")) {
sb.append("\n\n" + inputpath + "\n");
removeHFFromFile(new java.io.File(inputpath ));
}
System.out.println(sb.toString());
}
public static void removeHFFromFile(File f) throws Exception {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(f);
MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();
// Remove from sectPr
SectPrFinder finder = new SectPrFinder(mdp);
new TraversalUtil(mdp.getContent(), finder);
for (SectPr sectPr : finder.getSectPrList()) {
sectPr.getEGHdrFtrReferences().clear();
}
// Remove rels
List<Relationship> hfRels = new ArrayList<Relationship>();
for (Relationship rel : mdp.getRelationshipsPart().getRelationships().getRelationship() ) {
if (rel.getType().equals(Namespaces.HEADER)
|| rel.getType().equals(Namespaces.FOOTER)) {
hfRels.add(rel);
}
}
for (Relationship rel : hfRels ) {
mdp.getRelationshipsPart().removeRelationship(rel);
}
wordMLPackage.save(f);
}
}
The above code relies on SectPrFinder, so copy that somewhere.
I've left the imports out, for brevity. But you can copy those from GitHub
When it comes to making the set of docx into a single PDF, obviously you can either merge them into a single docx, then convert that to PDF, or convert them all to PDF, then merge those PDFs. If you prefer the former approach (for example, because end-users want to be able to edit the package of documents), then you may wish to consider our commercial extension for docx4j, MergeDocx.
To remove the header/footer, there is a quite easy solution:
Open the docx as a Zip, and remove the files named header*.xml/footer*.xml (situated in word folder).
Structure of a unzipped docx: https://stackoverflow.com/tags/docx/info
To really remove the link (if you won't do it it will probably corrupted):
You need to edit the document.xml.rels file, and remove all the RelationsShips that include a footer/header. This is a relationship that you should remove:
<Relationship Id="rId13" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer2.xml"/>
and more generally all that contain type='footer' or type='header'

How I classify a word of a text in things like names, number, money, date,etc?

I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.
The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.
What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.
What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don't know what use or what do.
WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:
//This is pseudo-code for what I want, and not a implementation
classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
classifiedWord = classifier.classify(word);
System.out.println(classifiedWord.getType());
}
Somebody can help me? I'm confused with various APIs, classifiers and algorithms.
You should try Apache OpenNLP. It is easy to use and customize.
If you are doing it for Portuguese there are information on how to do it on the project documentation using Amazonia Corpus. The types supported are:
Person, Organization, Group, Place, Event, ArtProd, Abstract, Thing, Time and Numeric.
Download the OpenNLP and the Amazonia Corpus. Extract both and copy the file amazonia.ad to the apache-opennlp-1.5.1-incubating folder.
Execute the TokenNameFinderConverter tool to convert the Amazonia corpus to the OpenNLP format:
bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt
Train you model (Change the encoding to the encoding of the corpus.txt file, that should be your system default encoding. This command can take several minutes):
bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
Executing it from command line (You should execute only one sentence and the tokens should be separated):
$ bin/opennlp TokenNameFinder pt-ner.bin
Loading Token Name Finder model ... done (1,112s)
Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .
Executing it using the API:
InputStream modelIn = new FileInputStream("pt-ner.bin");
try {
TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
// load the name finder
NameFinderME nameFinder = new NameFinderME(model);
// pass the token array to the name finder
String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};
// the Span objects will show the start and end of each name, also the type
Span[] nameSpans = nameFinder.find(toks);
To evaluate your model you can use 10-fold cross validation: (only available in 1.5.2-INCUBATOR, to use it today you need to use the SVN trunk) (it can take several hours)
bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
Improve the precision/recall by using the Custom Feature Generation (check documentation), for example by adding a name dictionary.
You can use a Named Entity Recognizer (NER) approach for this task, I would highly recommend you to take a look at Stanford Core NLP page and use the ner functionality in the modules for your task. You can break up your sentences into tokens and then pass them to the Stanford NER system.
I think the Stanford Core NLP page has lot of examples that can help you otherwise, please let me know if you need a toy code.
Here goes the sample code this is just the snippet of the whole code:
// creates a StanfordCoreNLP object, with NER
Properties props = new Properties();
props.put("annotators", "ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
Annotation document = new Annotation(word);
pipeline.annotate(document);
System.out.println(Annotation);
}
This problem falls at the intersection of several ideas from different areas. You mention named entity recognition, that is one. However, you are probably looking at a mixture of part of speech tagging (for nouns, names and the like) and information extraction (for numbers, phone numbers, emails).
Unfortunately doing this and making it work on real work data will require some effort, and it is not as simple as use this or that API.
You have to create specific functions for extracting and detecting each data type and their errors.
Or as its well known name object orientated way.
I.e. for detecting currency what we do is checking for a dollar sign at the beginning or end and check if there are attached non-numeric characters which means error.
You should write what you already do with your mind. It's not that hard if you follow the rules.
There are 3 golden rules in Robotics/AI:
analyse it.
simplify it
digitalize it.
That way you can talk with computers.

Categories