Need to find corrupt document(docx file format)

Need to find corrupt document(docx file format) - java

I am using xslt to convert my html to docx file format(which is in open office xml format). When i open some generated docx file in word, its showing error(may be a mistake in xml nodes).Is it possible to find whether the created document will open or show errors while opening or Is it possible to recover the document programmatically(what word do if the document contains error)? or any word api to use in our code to recover
Please help me.. Thanks in advance...

Try checking the relationships xml file within word/_rels and contrast it versus a working docx. My docx files get corrupted when I forget to add the corresponding entries in there.
Update:
Also check all your image file extensions are defined in the [Content_Types].xml file.

Is it possible to find whether the created document will open or show errors while opening
In theory, you should be able to use a validating XML parser to validate your created document against the XML schemas for OOXML. In practice:
You might need to do to searching to locate machine readable versions of the relevant schema.
It is not inconceivable that the problems are due to things that would not be picked up by schema validation.
Is it possible to recover the document programmatically(what word do if the document contains error)?
In general no. If the document is sufficiently different from what MS Office expects, it won't be able to "make head nor tail of it". (It ain't magical ...)
or any word api to use in our code to recover
Again, no. If the document is sufficiently different from the schema, a schema-conforming reader / writer API won't be able to cope with it.
The real solution is to figure out what the errors in your conversion software are and rectify them. Apart from validating against the schema, there are unlikely to be any real short-cuts.

Probably your file may gone have corrupted. For this you need to recover it using some third part word recovery tool.

Related

Why does adding new image to docx with java cause corruption?

This is kind of a weird question and I don't expect to get a good answer, but I thought I'd try anyway.
I've developed a parser to parse the contents of a DOCX file into HTML and another parser to parse the contents back into docx format. I know there are libraries out there, but in order to keep a lot of the data from being lost I developed my own so information needed in docx wouldn't be lost.
Long story short, I'm having trouble with corruption when the user adds an image to the html.
I've noticed that as the tags in the document.xml are read from top to bottom, the relationship IDs are generated for the document.xml.rels file. So I've made it so when an image is added, it changes the relationship IDs for the rest of the tags as it parses through and creates the document.xml. Likewise I'm re-creating the document.xml.rels file so the relationships have their rIds correspond to the rIds in the document.xml.
After it's saved and I open it, it gives me the message that there's corruption. When I click on details it tells me "The file is corrupt and cannot be opened". It then proceeds to tell me the following:
I click yes and the document looks completely fine with no information lost. When I save that document and look at the document.xml and document.xml.rels files of both the corrupted docx and the fixed docx I can't find any differences. I'll look through every file inside the file structure with a file comparing tool and don't see any differences.
It makes me think it may be the way I'm zipping up the contents, but I'm not sure.
Here's a video that shows why I'm confused. I grab a corrupted file, extract the contents, then re-zip the contents to find the all corruption is gone. There's a video in the description that shows how the files were created.
Any help is appreciated.

How to extract text from pdf and doc file without downloading

I have searched a lot before asking that question. I have a program(java) which crawls some wep pages and trying to find some .doc and .pdf files and it can download them but only one .pdf or .doc can cover up to 3-4mb which is not good because there are millions of files.. so I decied to extract their text without downloading the whole file. Basically, I need to see pdf or doc file online and download their text only but I could not figure out how to do that. If necessary I can provide my code.
Edit:This question can be closed now since I got the idea and (no)solution.
Thanks for help.
And What's up with those downgrades on question ?

That is not possible. You can only start extracting the document once you download the bytes.
(unless you also have control over the server, you could do the extraction server-side and provide a txt download link)

Reading a file from a website on the Internet without downloading it is impossible.
If you have control of the server you could write a web service that can parse the files on demand and extract the parts you are interested in, which would then be sent to the client.
If not, and if you're up for a more challenging problem, you could write an HTTP client that starts downloading the file and parses it on the fly, downloading only as much as you need to extract the part(s) you need. This might or might not be feasible (or worthwhile) depending on where in the files the "interesting" bits were located. If they're close to the beginning in most cases then you might be able to reduce the download size significantly.
A detailed explanation of how to accomplish this is probably beyond the guidelines for StackOverflow answer length.

Validation of files based on their file extensions

I get files from queues in Java. They may be of following formats.
docx
pdf
doc
xls
xlsx
txt
rtf
After reading their extensions, I want to validate whether they are actually files of these types.
For example, I got a file and checked that it has extension .xls. Afterwards, I want to check whether it is actually an .xls file or someone uploaded file of some other format after changing its extension.
EDIT: I'd like to check the file's MIME type by actually checking its content, not its extension. How it can be done?

I don't think this is a problem you should be solving. Any solution to this problem would be brittle and based upon your current understand of what constitutes a valid file of a particular type.
For example, take a XLS file. Do you know for sure what Excel accepts when opening such a file? Can you be sure you'll keep abreast of any changes in future releases that might support a different encoding style?
Ask yourself - what's the worse that could happen if the user uploads a file of the wrong type? Perhaps you'll pass the file to the application that handles that file extension and you'll get an error? Not a problem, just pass that to the user!

Without using external libraries:
You can get the file mimetype using MimetypesFileTypeMap:
File f = new File(...);
System.out.println(new MimetypesFileTypeMap().getContentType(f));
You can get a similar result with:
URLConnection.guessContentTypeFromName
Both these solutions, according to the documentation, look only at the extension.
A better option: URLConnection.guessContentTypeFromStream
File f= new File(...);
System.out.println(URLConnection.guessContentTypeFromStream(new FileInputStream(f)));
This try to guess from the first bytes of the file - be warned this is only a guess - I found it works in most cases, but fails to detect some obvious types.
I recommend a combination of both.

PDF Open Office or MS Word

I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.

We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.

Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.

Verifying integrity of documents

What are the steps to verify integrity of these documents ? doc,docx,docm,odt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm
Or at least of some of them. Usually when uploaded to a content repository.
I guess that inputStream is always 99,99% read properly from MultiPart http request otherwise exception would be thrown and action taken. But user can upload already corrupted file - do I use third party libraries for checking that? I didn't see anything like that in odftoolkit, itextpdf, pdfbox, apache poi or tika

There are many kinds of "corrupt".
Some corruptions should be easy to detect. For instance a truncated ODF file will most likely fail when you attempt to open it because the ZIP reader can't read it.
Others will be literally impossible to detect. For instance a one character corruption in an RTF file will be undetectable, and so (I think) will most RTF file truncations.
I'd be surprised if you found a single (free) tool to do this job for all of those file types, even to the extent that it is technically possible. The current generation of open source libraries for reading / writing document formats tend to focus on one family of formats only. If you are serious about this, you probably need to use a commercial library.

For all of the above listed file formats there are 3rd-party libraries which can open etc. - I don't know of a "verification only" but I think being able to open them without exceptions etc. is at least a basic check that the file is within the specified format... One such (commercial) library is Aspose - not affiliated, just a happy customer...

You can do checksums/hashes (that is, a secure hash) of the file before uploading, then upload the checksum separately. If the subsequently downloaded file has the same checksum, it has not been changed (to a certain high probability, depending on the checksum/hash used) from the original.

Go to check LibreOffice project (that already handles these archives), it has parts written in Java, and for sure you could find and use their mecanisms to check for corrupted files.
I think you can get the code from here:
http://www.libreoffice.org/get-involved/developers/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.