I have some document of .doc and .pdf file and my requirement is to read a particular page from the .doc or .pdf file which i will provide at the run time .This can be possible by reading page by page and at the end of each page if i do numbering .but some i am getting some document where numbering is not their so how can i do that?
is their any api or any other logic so that i can fixed this problem?
hello all
i have .DOC file but i am not supposed to read entire file instead i am given a page number.
therefore i got to read only that particular page from the doc file.
I am using apache.poi api.
file = new File("c://doc/assignment/afternoon_24.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
i need to read the page X of this file and write to a text file?
I guess there is a missunderstanding: You can not read a DOC (or PDF) simply as an Inputstream and skip pages (unless you know and evaluate the fileformat).
Both files have a format (encoding the formatting and meta info into some binary formats). Just try to open a PDF in notepad or another plain text editor. You will see it.
As mkl suggested: to access the contents of a DOC (or PDF) you need a library that can handle that fileformat. For Microsoft Office formats there is for example the open source library Apache POI, for PDF there is for example PDF box among others and a full thread about it. There are different libraries for each of the formats with different features and licensing models.
Related
I have the Data.xml file and a pdf file filled with informations. I'm trying to embed the data.xml file in the XMP metadata stream of the PDF because this data should be hidden.
I used iText to create the pdf and to add the usual metadata such as author etc. But I'm not able to understand how to add the xml as metadata in the xmp stream. Is there a function in the iText or xmlworker library that allows me to do this? I've tried but I can't fin the way to do this.
(I have no code to post because all code written to create the pdf and so on works perfectly, just dunno how to proceed to do what I described before. Is there something in the iText library that provides it, or i should use other tools?)
"In PDF/A-3, the data is added as a document-level attachment. That makes much more sense than to put it into an XMP stream.
The document-level attachment won't be visible on any page, but people will be able to select it in the attachment panel, just like they'll be able to see the contents of the XMP (it's easy to add a document-level attachment with iText). There are of course many other ways to add data to a PDF that isn't visible. For instance Adobe Illustrator adds proprietary artifacts as a /PieceInfo entry in the root dictionary of the PDF. That's also possible with iText. There are many solutions, all are better than abusing the XMP stream"
Attaching at the document level solved the problem.
I am wanting to make a simple word counter for my latex documents so that I can double check my word count is accurate. More generally it is useful to discover whether java can interpret text from pdf files anyway. A google on it brought nothing up so I am thinking maybe not? If not, why?
You can't read text from a .pdf without a PDF file reader. Here are a couple of Java .pdf libraries:
Apache PDF Box
iText
See also this link, for an example of Java text extraction with PDF Box:
http://pdfbox.apache.org/userguide/text_extraction.html
I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.
We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.
Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.
I have an app that generates a docx file base on user input. It uses Apache POI to generate the docx file and I can get the FileOutputStream from that, the document opens perfectly on a local machine when I write it to a file.
The webapp is using Dojo xhrPost to send the necessary data to the server to generate the document. What I am wondering is how I get the docx file to the client.
I know I can do it be creating a temp file and passing the location of that file to the client to download, but I would think there would be a way to do it by piping the FileOutputStream straight to the client, which would be much cleaner.
Any suggestions?
The answer from Mr Shiny in this SO question has an example streaming an excel file, should be very similar for a docx:
How can I get an Input Stream from HSSFWorkbook Object
Except that a docx content type should, probably, be application/vnd.ms-word
I need to convert a bundle of static HTML documents into a single PDF file programmatically on the server side on a Java/J2EE platform using a batch process preferably. The pdf files would be distributed to site users for offline browsing of the web pages.
The major points of the requirements are:
The banner at the top should not be present in the final pdf document.
The navigation bar on the left should be transformed into pdf bookmarks from html hyperlinks.
All hyperlinked contents (html/pdf/doc/docx etc.) present in the web pages should be part of the final pdf document with pdf bookmarks.
Is there any standard open source way of doing this?
Try Apache FOP. I just used it to convert XML to PDF and I think you can do the same with HTML/DOM. The website has a whole section on running FOP in a Java application and there's example code for DOM to PDF.
You can try iText - but I am not sure whether it handles all that you require.
Moreover, it is always better if you explore many options and then decide what you can and cannot do. In many cases there won't be any library/API that will out of the box support all that you ask for.
You can try www.alt-soft.com Xml2PDF for this