I have generated two PDFs by using this example (FirstPDF) removing the "new Date()" sentence.
They look equal but when calculating a md5 hash on them, they are really different.
I've examinated them and they register a creationDate, even if the sentence document.addCreationDate() is not included in the source code.
The question is very simple: is it possible in any way with any API to generate two PDFs that are exactly equal byte to byte?
This is how it SHOULD be. Apart from the date in the metadata, there's
also a unique ID that is added every time a PDF is generated from
scratch.
from
If you need two identical files giving you the same MD5 hash, why not copy one that's been created already?
If you need to create two identical files by two separate API calls, then you can use any PDF-creating API that's worth it's money:
Because each of these APIs have contain a call to set the creation and the modification date of the output PDF to any value that you need... Just don't let this setting happen automatically! Use the same setting two times.
Attention! PDF also supports the setting of a document UUID. Some of these APIs also do set an arbitrary UUID for each new document (which would break your MD5 hash), unless you actively prevent this to happen.
As described here, files are not equals because they have different identifiers (having two files, created on a different moment, should have a different ID as defined in the PDF specification).
The file identifier is usually a hash created based on the date, a path name, the size of the file, part of the content of the PDF file (e.g. the entries in the information dictionary).
.
File identifiers are involved (and mandatory) in document encryption. As a result, encrypted PDF files with different file identifiers will have streams that are completely different.
By design, you should never be able to create two identical PDFs using the same code.
Related
I was writing a program that implements a dictionary.
Actually what I did is just to write a java applet to show the words which is defined in a .xml file. And I did that with the org.w3c.dom package.
Now, I want to add a new feature that users can modify a word in the dictionary in the the program then the modification will be saved to the original .xml file.
Here is my question: what should I do to save the changes? Note that users can only modify one word a time so I don't want to load the whole file and modify the certain part and re-write the whole file to the disk. Is there a novel way to do that?
An XML file is a sequential text file. This means that there is no formula or other convenient way to locate the n-th word in a dictionary stored in XML. Elements need to be written one after the other, character by character (and one character may or may not result in a byte). Thus, what is called a random update, is out.
Look at JAXB for a most convenient way to read and write XML, and invest some work so that a user cannot update in memory and terminate the program without saving.
Reading and writing files in specific formats is a little bit trickier that what you portray.
Seen with "XML eyes" you are only changing a portion of the file - but to do that on the file level you need to seek to the position of change and write new bytes from there. The problem with that is that the content after that position won't adjust according to the new portion you write.
TL;DR - no - you need to read+write the complete XML file when making changes.
I am using FileUtils to compare two identical pdfs. This is the code:
boolean comparison = FileUtils.contentEquals(pdfFile1, pdfFile2);
Despite the fact that both pdf files are identical, I keep getting false. I also noticed that when I execute:
byte[] byteArray = FileUtils.readFileToByteArray(pdfFile1);
byte[] byteArrayTwo = FileUtils.readFileToByteArray(pdfFile2);
System.out.println(byteArray);
System.out.println(byteArrayTwo);
I get the following bytecode for the two pdf files:
[B#3a56f631
[B#233d28e3
So even though both pdf files are absolutely identical visually, their byte-code is different and hence failing the boolean test. Is there any way to test whether the identical pdf files are identical?
Unfortunately for PDF there is a big difference between having "identical files" and having files that are "visually identical". So the first question is what you are looking for.
One very simple example, information in a PDF file can be compressed or not, and can be compressed with different compression filters. Taking a file where some of the content is not compressed, and compressing that content with a ZIP compression filter for example, would give you two files that are very different on a byte level, yet very much the same visually.
So you can do a number of different things to compare PDF files:
1) If you want to check whether you have "the same file", read them in and calculate some sort of checksum as answered before by Peter Petrov.
2) If you want to know whether or know files are visually identical, the most common method is some kind of rendering. Render all pages to images and compare the images. In practice this is not as simple as it sounds and there are both simple (for example callas pdfToolbox) and complex (for example Global Vision DigitalPage) applications that implement some kind of "sameness" algorithm (caution, I'm related to both of those vendors).
So define very well what exactly you need first, then choose carefully which approach would work best.
Yes, generate md5 sum from both files.
See if these sums are identical.
If they are, then your files are identical
too with a certainty which is practically 100%.
If the sums are not identical, then
your files are different for sure.
To generate the md5 sums, on Linux there's an md5sum
command, for Windows there's a small tool called fciv.
http://www.microsoft.com/en-us/download/details.aspx?id=11533
Just to note, the two identifiers you wrote
[B#3a56f631
[B#233d28e3
are different because they belong to two different objects. These are object identifiers, not bytecode. Two objects can be logically equal even if they are not exactly the same objects (e.g. they have different objectIDs).
Otherwise, calculating an MD5 checksum as peter.petrov wrote is a good idea.
I have two large XML files (3GB, 80000 records). One is updated version of another. I want to identify which records changed (were added/updated/deleted). There are some timestamps in the files, but I am not sure they can be trusted. Same with order of records within the files.
The files are too large to load into memory as XML (even one, never mind both).
The way I was thinking about it is to do some sort of parsing/indexing of content offset within the first file on record-level with in-memory map of IDs, then stream the second file and use random-access to compare those records that exist in both. This would probably take 2 or 3 passes but that's fine. But I cannot find easy library/approach that would let me do it. vtd-xml with VTDNavHuge looks interesting, but I cannot understand (from documentation) whether it supports random-access revisiting and loading of records based on pre-saved locations.
Java library/solution is preferred, but C# is acceptable too.
Just parse both documents simultaneously using SAX or StAX until you encounter a difference, then exit. It doesn't keep the document in memory. Any standard XML library will support S(t)AX. The only problem would be if you consider different order of elements to be insignificant...
I am currently writing a program which takes user input and creates rows of a comma delimited .csv file. I am in need of a way to save this data in a way in which users are not able to easily edit this data. It does not need to be super secure, just enough so that it couldn't accidentally be edited. I also need another file (or the same file?) created to then be easily accessible (in the file system) by the user so that they may then email this file to a system admin who can then open the .csv file. I could provide this second person with a conversion program if necessary.
The file I save data in and the file to be sent can be two different files if there are any advantages to this. I was currently considering just using a file with a weird file extension, but saving it as a text file so that the user will only be able to open it if they know to try that. The other option being some sort of encryption, but I'm not sure if this is necessary and even if it was where I would start.
Thanks for the help :)
Edit: This file is meant to store the actual data being entered. Currently the data is being gathered on paper forms which are then sent to the admin to manually enter all of the data. This little app is meant to have someone else enter the data from the paper form and then tell them if they've entered it all correctly. After they've entered it all they then need to send the data to the admin. It would be preferable if the sending was handled automatically, but this app needs to be very simple and low budget and I don't want an internet connection to be a requirement.
You could store your data in a serializable object and save that. It would resist casual editing and be very simple to read and write from your app. This page should get you started: http://java.sun.com/developer/technicalArticles/Programming/serialization/
From your question, I am guessing that the uneditable file's purpose is to store some kind of system config and you don't want it to get messed up easily. From your own suggestions, it seems that even knowing that the file has been edited would help you, since you can then avoid using it. If that is the case, then you can use simple checks, such as save the total number of characters in the line as the first or last comma delimited value. Then, before you use the file, you just run a small validation code on it to verify that the file is indeed unaltered.
Another approach may just be to use a ZIP (file) of a "plain text format" (CSV, XML, other serialization method, etc) and, optionally, utilize a well-known (to you) password.
This approach could be used with other stream/package types: the idea behind using a ZIP (as opposed to an object serializer directly) is so that one can open/inspect/modify said data/file(s) easily without special program support. This may or may not be a benefit and using a password may or may not even be required, see below.
Some advantages of using a ZIP (or CAB):
The ability for multiple resources (aids in extensibility)
The ability to save the actual data in a "text format" (XML, perhaps)
Maintain competitive file-sizes for "common data"
Re-use existing tooling support (also get checksum validation for free!)
Additionally, using a non-ZIP file extension will prevent most users from casually associating the file (a similar approach to what is presented in the original post, but subtly different because the ZIP format itself is not "plain text") with the ZIP format and being able to open it. A number of modern Microsoft formats utilize the fact that the file-extension plays an important role and use CAB (and sometimes ZIP) formats as the container format for the document. That is, an ".XSN" or ".WSP" or ".gadget" file can be opened with a tool like 7-zip, but are generally only done so by developers who are "in the know". Also, just consider ".WAR" and ".JAR" files as other examples of this approach, since this is Java we're in.
Traditional ZIP passwords are not secure, and more-so is using a static password embedded in the program. However, if this is just a deterrent (e.g. not for "security") then those issues are not important. Coupled with an "un-associated" file-type/extension, I believe this offers the protection asked for in the question while remaining flexible. It may be possible to entirely drop the password usage and still prevent "accidental modifications" just by using a ZIP (or other) container format, depending upon requirement/desires.
Happy coding.
Can you set file permissions to make it read-only?
Other than doing a binary output file, the file system that Windows runs (I know for sure it works from XP through x64 Windows 7) has a little trick that you can use to hide data from anyone simply perusing through your files:
Append your output and input files with a colon and then an arbitrary value, eg if your filename is "data.csv", make it instead "data.csv:42". Any existing or non-existing file can be appended to to access a whole hidden area (and every file for every value after the colon is distinct, so "data.csv:42" != "data.csv:carrots" != "second.csv:carrots").
If this file doesn't exist, it will be created and initialized to have 0 bytes of data with it. If you open up the file in Notepad you will indeed see that it holds exactly the data it held before writing to the :42 file, no more, no less, but in reality subsequent data read from this "data.csv:42" file will persist. This makes it a perfect place to hide data from any annoying user!
Caveats: If you delete "data.csv", all associated hidden data will be deleted too. Also, there are indeed programs that will find these files, but if your user goes through all that trouble to manually edit some csv file, I say let them.
I also have no idea if this will work on other platforms, I've never thought to try it.
I am making a java program that has a collection of flash-card like objects. I store the objects in a jtree composed of defaultmutabletreenodes. Each node has a user object attached to it with has a few string/native data type parameters. However, i also want each of these objects to have an image (typical formats, jpg, png etc).
I would like to be able to store all of this information, including the images and the tree data to the disk in a single file so the file can be transferred between users and the entire tree, including the images and parameters for each object, can be reconstructed.
I had not approached a problem like this before so I was not sure what the best practices were. I found XLMEncoder (http://java.sun.com/j2se/1.4.2/docs/api/java/beans/XMLEncoder.html) to be a very effective way of storing my tree and the native data type information. However I couldn't figure out how to save the image data itself inside of the XML file, and I'm not sure it is possible since the data is binary (so restricted characters would be invalid). My next thought was to associate a hash string instead of an image within each user object, and then gzip together all of the images, with the hash strings as the names and the XMLencoded tree in the same compmressed file. That seemed really contrived though.
Does anyone know a good approach for this type of issue?
THanks!
Thanks!
Assuming this isn't just a serializable graph, consider bundling the files together in Jar format. If you already have your data structures working with XMLEncoder, you can reuse this code by saving the data as a jar entry.
If memory serves, the jar library has better support for Unicode name entries than the zip package, which is why I would favour it.
You might consider using an MS JET database (.mdb file) and storing all the stuff in there. That'll also make it easy to examine and edit the data in (for example) MS Access.
You can employ some virtual file system, which stores it's data in a single container. We develop and offer one of such files sytems, SolFS, however right now there's no Java binding for it. We will release Java JNI interface for SolFS within a month.