Parsing structured documents in Java

Parsing structured documents in Java - java

I would like to parse some legal documents with a Java library into pieces of text that represent headers, paragraphs etc. Legal documents are usually well-structured, so I would like to use something a bit easier than JavaCC (or other parser generators). Are there any which would allow to (nearly) automatically detect such a structure?
Thanks.

I think there is no tool that can "nearly automatically" extract such structures. If it is realy easy to extract the structure you would not need any tool, you can easely code it yourself. If it is not so easy you need a tool that is powerfull enough (JavaCC, ANTLR ...).
I think parsing the text yourself with custom code is the best way. Maybe read beforehand a bit about parsing (recursive decent, lexer/parser seperation...). For simple structures it is not hard to get a working solution quickly.

Apache POI - the Java API for Microsoft Documents
Apache PDFBox - Java PDF Library
easier one will be Apache Tika - a content analysis toolkit, toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
it uses pdfbox and poi internally
use: java -jar tika-app-0.9.jar [option] [file] -t
will parse the file(s) specified on the
command line and output the extracted text content

Related

Difference between Apache POI api and Apache Tika Api?

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.
While going thru tika, I came across POI API and found more friendly to use it.
we may have requirement to parse PDF file in further.
I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.
Thanks,
Krishna

Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content
Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.
If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.

Apache POI is full blown parser/writer for most of the Microsoft Documents. It supports both newly introduced 2007 (XSSF) format and Microsoft 2003 file formats (HSSF). Apache POI provides two level of API for parsing and generating Microsoft files. One that is higher level API that is bit memory intensive which reads the whole file and keeps in the memory something similar to DOM parsing in XML and lower level API for memory intensive use which is similar to SAX/StAX parsing.
On the other hand Apache Tika is content analysis tool which I guess only supports Microsoft Excel and lot of other extraction components. There is no support for writing new files or generating content from Tika, anyway that is not the their use case at all.
So, you have to choose depending on your need.

How to convert pdf to doc file in java

need to convert a pdf file to a doc file. I found different type of example to generate pdf file but not got pdf to doc.

What your asking is actually very difficult
I recommend you start here and look for a good parsing library. then you would have to write it out in .doc format. Inevitably a lot of the formatting and extra information would be lost. it would be a lot easier to output to docx format, but i assume thats not what your looking for.

I see few possible solutions:
Davisor Publishor 6.2 probably can be used, but it is commercial, and seems that generates only txt from pdf... just have a look
parse pdf with iText, and then
generate doc with Apache POI -
another way to try (free one ;)
look for command line tools, like
Convert PDF To DOC and execute
them from java
Otherwise take a look at Con's answer, there is a link to the list with java pdf processing libraries, maybe some library can do it directly, or can be used to parse pdf (better than iText), and then just use Apache POI to generate doc. Hope it helps ;)

Using the same API to write both Word and PDF documents

HI all
is there any kind of abstraction API over Apache POI/FOP allowing one to use the same API to write both Word and PDF documents ?

I'm not aware of a unified API for the two libraries you have mentioned.
However you may still have a couple of options using a single API:
Use Apache POI to generate the documents in Word format and then use a Word to PDF conversion library to create a PDF from the word document. Another commenter has suggested IText
Use OpenOffice via its Java API to create documents and export them in Microsoft Word or PDF format.

Docmosis will do what you require, assuming you mean a Java (or command line) API. It reads doc and odt files as templates, populates/manipulates via the Java API, and produces the output formats OpenOffice supports. Have a look at the online demo on the web site which lets you see various output formats to render a document in.

When I was working on previous project, I was sure the Apache/POI can be used for Microsoft Documents.
we have IText.jar which we can use it for PDF generation and alteration. please check this will help you.

Creating PDF for Java applications

How to create pdf with complex design views in Java?I have tried it using jasper reports.Is there Any Ideas for creating PDF for Income tax forms?.

A commonly used Java API to create PDF files is iText. Give it a look. API documentation can be found here, code examples can be found here, a tutorial can be found here.
A good but less widely known Java API is OOo API wherein you can create any OOo document to your taste and finally export to PDF.

Have you taken a look at the Apache PDFBox project. I believe you can create PDFs using this library, although it is more commonly used in Lucene to convert PDFs to text to allow indexing.

You could also try Docmosis or JODConverter to do the conversion as long as you can install OpenOffice somewhere. They work on many platforms and can be Java controlled and will save you the hassle of learning the OOo UNO API.

Design your complex PDF Form with the appropriate tools, something like Acrobat Professional. Then from your Java code, you generate an FDF file (Form Data Format) and let the PDF Reader do the merging or you do it from the server-side and stream back the result.
Possible solutions to process FDF are Adobe Java FDF Toolkit or Apache PDFBox.

one approach that requires very little programming is converting your Java object to XML using the Java Binding API for XML (JABX) and then use apache FOP (XSL-FO) to create the PDF from XML. The adavantage of this approach is that is almost 100% declarative, .i.e no programming involved other than executing jabx and apache fop. If you want a tool to create the XSL-FO template, look at J4L FO Designer

You can try ITextPDF.jar Add this jar to your application and please go through the examples to know more about the tags and design procedure used for creating a PDF Document. Check this link for a simple exmaple http://itextpdf.com/examples/iia.php?id=12

Spreadsheet Parser in Java/Groovy

Hi I'm looking to parse spreadsheets (xls/ods) in Groovy. I have been using the Roo library for Ruby and was looking to try the same tasks in Groovy, as Java is already installed on a development server I use, and I would like to keep the number of technologies on the server to a simple core few.
I am aware that the ods format is zipped XML, and so can be parsed as such, but I would like to process the file using spreadsheet concepts, not XML concepts.
The ability to process xls files is not of major importance, but would save me having to save multiple xls files to ods (as this is for parsing data from clients).
Thanks

I would suggest Apache POI for access to .xls files.
I've never had to work with the .ods format, so no information on that one.

There's also JExcelAPI, which has a nice, clean, simple interface (for the most part).
Can't help you with ODS Files though.

How about looking at 'odftoolkit' ? http://odftoolkit.openoffice.org/

Groovy in Action has a chapter named "Groovy on Windows" that discusses using Scriptom, a Groovy/COM bridge (using JACOB under the covers), to access several Windows apps including Excel.
For OpenOffice, you can use ODF Toolkit, as Amit pointed out.

I second jdmichal's vote for Apache POI. I have selected it as our library of choose to handle Excel file input (.XLS). The project is also working on the .XLSX file format if you ever decide you want to support that. Based on your specifications, I don't think you want to get into converting things into CSV and it seems like you have established input and output paths. For anyone who hasn't had the joy of dealing with CSV to Excel conversion, it can get a bit dicey. I have spent hours dealing with issues created by Excel converting string data to numeric data. You can see other testimonies to this effect on the POI Case Studies page. Beyond these issues, I simply don't want to personally have to handle these inputs. I'd rather invest the programming effort and streamline the workflow for the future.
I too have not dealt with ODF and have no plans to support it in my current project. You might want to check out the OpenOffice.org ODF Toolkit Project.
Good luck and have fun,
- D.

I suggest you to take a look at SimpleXlsBuilder and SimpleXlsSlurper, both are based on apache POI and can fit your basic needs for reading from and writing to Excel 97 spreadsheets in a concise way.

If your spreadsheets are simple enught - without charts and other embedded contents - you should simply convert the spreadsheet to CSV.
Pros:
Both xls and ods will produce the same CSV - You'll have to handle just one input type.
You won't have to mess with new versions of (Open) Office.
Handling plaintext is always more fun than other obscure formats.
Cons:
One that I can think of - finding a reliable converter from xls and odf to csv. Shouldn't be too hard - OpenOffice has a built in one.

A couple things:
1) I agree that using a CSV format can simplify some of the development work. OpenCSV can help with processing CSV files. There are other good CSV parsers for Java out there. Just remember that anything that's available for Java can be used by Groovy due to Groovy's unparalleled integration with Java.
2) I know you said you wanted to avoid handling XML, but Groovy makes XML processing exceedingly simple.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing structured documents in Java - java

Related

Difference between Apache POI api and Apache Tika Api?

How to convert pdf to doc file in java

Using the same API to write both Word and PDF documents

Creating PDF for Java applications

Spreadsheet Parser in Java/Groovy

Categories

Resources