Single Instance File Storage with JAVA

Single Instance File Storage with JAVA - java

I was storing some files based on a checksum but I found a flaw that 2 checksums can be identical sometimes.
I always try looking for API instead of reinventing the wheel, but I can't find anything.
I know theres the JSR 268 and JackRabbit as a standard for content storage but my app is light-years of using such thing.
So, are there approaches for single Instance File Storage with Java or should I just keep searching for new algorithms for my checksum?
EDIT:
When numcheck is not working: 2 files are exactly the same, just in different file system locations. However when sent from the client is impossible on server side to know the path they were before, so it is the same file twice, same checksum.
If you wanna retrieve either one, how you check that?
Wanted to know if there was an standard approach, API, or an algorithm that could help me spot the difference

No matter how strong a hashing algorithm is, there is always a chance of a collision. A hashing algorithm generates a finite number of hashes from an infinite number of inputs.

The only way to ensure that two files are not identical is to compare them bit by bit. Hashing them is easier and faster, but carries with it the risk of collision.

Related

Reliable and fast method for identification of file duplicates

I have a continuously growing set of files and have to ensure that there are no duplicates. By duplicate I mean identical at byte level.
The files are getting collected from various system, some of them also providing hash codes of the files (but some don't). Some files may exist at multiple systems but should be imported only once.
I want do avoid unnecessary file transfers and I thought that I just compare hash codes before actually copying. However, as I said some of these systems don't provide a hash code and some use MD5 which I read isn't secure anymore.
My questions:
Is comparing hash codes enought to determine identical files?
What should I do when systems use different hash codes?
What should I do when systems don't provide a hash code?

Firstly, the only way to conclusively proof two files are identical is to compare them bit for bit. As such you can't really avoid transferring the files if you want to compare them. So if you need absolute certainty you cannot avoid transferring the files. Unless you can make certain assumptions about the files that's just a mathematical truth.
And then we have hash functions. What a hash function tries to do is calculate some value which is highly likely to be different when the files are different. How likely depends on the actual function, a really stupid hash function might have a change of one in ten to produce the same hash for different files, for a good hash function those changes are insanely small. For md5 the change of finding two different files with the same hash is one in 2^128. I'm guessing that's good enough for your system, so you can safely assume the files are the same when the hash is the same.
Now for a secure hash, and md5 being broken. Hash functions are not just used as a quick way to check if things are the same. They are also used in cryptographic systems to verify things are the same. It's only in that sense that md5 is broken, it is possible to generate a file with a specific md5 hash relatively quick. If you fear someone might intentionally create a file with the same hash as another file to trick you into skipping it you shouldn't rely on md5. But that doesn't seem to be the case here. If no one is deliberately messing with the files md5 still works fine.
So to your first question, theoretically no, but realistically yes.
To the second question, you should calculate all the different hashes that might be used for each file you stored locally. E.g. calculate the md5 hash and the sha1 hash (or whatever hashes are being used on the remote systems). That way you will always have the correct type of hash to check against for each file you already have.
For the files which don't have a hash you can't do anything to avoid transferring them. Until you do there is nothing you know about those files. Once you transferred them you can still calculate a hash yourself so you can quickly check if you got that file.

How to remove duplicate words using Java when words are more than 200 million?

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.
In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.
Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:
With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.
I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.
Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)
Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.
Process each of the letter files separately using a Set to remove duplicates.

You might be able to use a trie data structure to do the job in one pass. It has advantages that recommend it for this type of problem. Lookup and insert are quick. And its representation is relatively space efficient. You might be able to represent all of your words in RAM.

If you sort the items, duplicates will be easy to detect and remove, as the duplicates will bunch together.
There is code here you could use to mergesort the large file:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

For large files I try not to read the data into memory but instead operate on a memory mapped file and let the OS page in/out memory as needed. If your set structures contain offsets into this memory mapped file instead of the actual strings it would consume significantly less memory.
Check out this article:
http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html

Question: Are these really WORDS, or are they something else -- phrases, part numbers, etc?
For WORDS in a common spoken language one would expect that after the first couple of thousand you'd have found most of the unique words, so all you really need to do is read a word in, check it against a dictionary, if found skip it, if not found add it to the dictionary and write it out.
In this case your dictionary is only a few thousand words large. And you don't need to retain the source file since you write out the unique words as soon as you find them (or you can simply dump the dictionary when you're done).

If you have the posibility to insert the words in a temporary table of a database (using batch inserts), then it would be a select distinct towards that table.

One classic way to solve this kind of problem is a Bloom filter. Basically you hash your word a number of times and for each hash result set some bits in a bit vector. If you're checking a word and all the bits from its hashes are set in the vector you've probably (you can set this probability arbitrarily low by increasing the number of hashes/bits in the vector) seen it before and it's a duplicate.
This was how early spell checkers worked. They knew if a word was in the dictionary, but they couldn't tell you what the correct spelling was because it only tell you if the current word is seen.
There are a number of open source implementations out there including java-bloomfilter

I'd tackle this in Java the same way as in every other language: Write a deduplication filter and pipe it as often as necessary.
This is what I mean (in pseudo code):
Input parameters: Offset, Size
Allocate searchable structure of size Size (=Set, but need not be one)
Read Offset (or EOF is encountered) elements from stdin and just copy them to stdout
Read Size elments from stdin (or EOF), store them in Set. If duplicate, drop, else write to stdout.
Read elements from stdin until EOF, if they are in Set then drop, else write to stdout
Now pipe as many instances as you need (If storage is no problem, maybe only as many as you have cores) with increasing Offsets and sane Size. This lets you use more cores, as I suspect the process is CPU bound. You can even use netcat and spread processing over more machines, if you are in a hurry.

Even in English, which has a huge number of words for a natural language, the upper estimates are only about 80000 words. Based on that, you could just use a HashSet and add all your words it (probably in all lower case to avoid case issues):
Set<String> words = new HashSet<String>();
while (read-next-word) {
words.add(word.toLowerCase());
}
If they are real words, this isn't going to cause memory problems, will will be pretty fast too!

To not have to worry to much about implementation you should use a database system, either plain old relational SQL or a No-SQL solution. Im pretty sure you could use e.g. Berkeley DB java edition and then do (pseudo code)
for(word : stream) {
if(!DB.exists(word)) {
DB.put(word)
outstream.add(word)
}
}
The problem is in essence easy, you need to store things on disk because there is not enough memory, then either use sorting O(N log N) (unecessary) or hashing O(N) to find the unique words.
If you want a solution that will very likely work but is not guaranteed to do so use a LRU type hash table. According to the empirical Zpif's law you should be OK.
A follow up question to some smart guy out there, what if I have 64-bit machine and set heap size to say 12GB, shouldn't virtual memory take care of the problem (although not in an optimal way) or is java not designed this way?

Quicksort would be a good option over Mergesort in this case because it needs less memory. This thread has a good explanation as to why.

Most performant solutions arise from omiting unecessary stuff. You look only for duplicates, so just do not store words itself, store hashes. But wait, you are not interested in hashes either, only if they awere seen already - do not store them. Treat hash as really large number, and use bitset to see whether you already seen this number.
So your problem boils down to really big sparse populated bitmap - with size depending on hash width. If your hash is up to 32 bit, you can use riak bitmap.
... gone thinking about really big bitmap for 128+ bit hashes %) (I'll be back )

To check if two image files are same..Checksum or Hash?

I am doing some image processing code where in I download some images(as BufferedImage) from URLs and pass it on to a image processor.
I want to avoid passing of the same image more than once to the image processor(as the image processing operation is of high cost). The URL end points of the images(if they are same images) may vary and hence I can prevent this by the URL. So I was planning to do a checksum or hash to identify if the code is encountering the same image again.
For md5 I tried Fast MD5, and it generated a 20K+ character length hex checksum value for the image(some sample). Obviously storing this 20K+ character hash would be an issue when it comes to database storage. Hence I tried the CRC32(from java.util.zip.CRC32). And it did generate quite smaller length check sum than the hash.
I do understand checksum and hash are for different purposes. For the purpose explained above can I just use the CRC32? Would it solve the purpose or I have to try something more than these two?
Thanks,
Abi

The difference between CRC and, say, MD5, is that it is more difficult to tamper a file to match a "target" MD5 than to tamper it to match a "target" checksum. Since this does not seem a problem for your program, it should not matter which method do you use. Maybe MD5 might be a little more CPU intensive, but I do not know if that different will matter.
The main question should be the number of bytes of the digest.
If you are doing a checksum in an integer will mean that, for a file of 2K size, you are fitting 2^2048 combinations into 2^32 combinations --> for every CRC value, you will have 2^64 possible files that match it. If you have a 128 bits MD5, then you have 2^16 possible collisions.
The bigger the code that you compute, the less possible collisions (given that the codes computed are distributed evenly), so the safer the comparation.
Anyway, in order to minimice possible errors, I think the first classification should be using file size... first compare file sizes, if they match then compare checksums/hash.

A checksum and a hash are basically the same. You should be able to calculate any kind of hash. A regular MD5 would normally suffice. If you like, you could store the size and the md5 hash (which is 16 bytes, I think).
If two files have different sizes, thay are different files. You will not even need to calculate a hash over the data. If it is unlikely that you have many duplicate files, and the files are of the larger kind (like, JPG pictures taken with a camera), this optimization may spare you a lot of time.
If two or more files have the same size, you can calculate the hashes and compare them.
If two hashes are the same, you could compare the actual data to see if this is different after all. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, while CR32 is only 4), the less likely that two different files will have the same hash.
It will take only 10 minutes of programming to perform this extra check though, so I'd say: better safe than sorry. :)
To further optimize this, if exactly two files have the same size, you can just compare their data. You will need to read the files anyway to calculate their hashes, so why not compare them directly if they are the only two with that specific size.

"Fastest" hash function implemented in Java, comparing part of file

I need to compare two different files of the instance "File" in Java and want to do this with a fast hash function.
Idea:
- Hashing the 20 first lines in File 1
- Hashing the 20 first lines in File 2
- Compare the two hashes and return true if those are equal.
I want to use the "fastest" hash function ever been implemented in Java. Which one would you choose?

If you want speed, do not hash! Especially not a cryptographic hash like MD5. These hashes are designed to be impossible to reverse, not fast to calculate. What you should use is a Checksum - see java.util.zip.Checksum and its two concrete implementations. Adler32 is extremely fast to compute.
Any method based on checksums or hashes is vulnerable to collisions, but you can minimise the risk by using two different methods in the way RSYNC does.
The algorithm is basically:
Check file sizes are equal
Break the files into chunks of size N bytes
Compute checksum on each pair of matching blocks and compare. Any differences prove files are not the same.
This allows for early detection of a difference. You can improve it by computing two checksums at once with different algorithms, or different block sizes.
More bits in the result mean less chance of a collision, but as soon as you go over 64 bits you are outside what Java (and the computer's CPU) can handle natively and hence get slow, so FNV-1024 is less likely to give you a false negative but is much slower.
If it is all about speed, just use Adler32 and accept that very rarely a difference will not be detected. It really is rare. Checksums like these are used to ensure the internet can spot transmission errors, and how often do you get the wrong data turning up?
It it is all about accuracy really, you will have to compare every byte. Nothing else will work.
If you can compromise between speed and accuracy, there is a wealth of options out there.

If you're comparing two files at the same time on the same system there's no need to hash both of them. Just compare the bytes in both files are equal as you read both. If you're looking to compare them at different times or they're in different places then MD5 would be both fast and adequate. There's not much reason to need a faster one unless you're dealing with really large files. Even my laptop can hash hundreds of megabytes per second.
You also need to hash the whole file if you want to verify they're identical. Otherwise you might as well just check the size and last modified time if you want a really quick check. You could also check the beginning and end of the file if they're just really large and you trust that the middle won't be changing. If you're not dealing with hundreds of megabytes though, you may as well check every byte of each file.

Java: Most efficient way to store/retrieve workout information from a file?

I'm working on a Java project for class that stores workout information in a flat file. Each file will have the information for one exercise (BenchPress.data) that holds the time (milliseconds since epoch), weight and repetitions.
Example:
1258355921365:245:12
1258355921365:245:10
1258355921365:245:8
What's the most efficient way to store and retrieve this data? It will be graphed and searched through to limit exercises to specific dates or date ranges.
Ideas I had was to write most recent information to the top of the file instead of appending at the end. This way when I start reading from the top, I'll have the most recent information, which will match most of the searches (assumption).
There's no guarantee on the order of the dates, though. A user could enter exercises for today and then go in and enter last weeks exercises, for whatever reason. Should I take the hit upon saving to order all of the information by the date?
Should I go a completely different direction? I know a database would be ideal, but this is a group project and managing a database installation and data synch amongst us all would not be ideal. The others have no experience with databases and it'll make grading difficult.
So thanks for any advice or suggestions.
-John

Don't overcomplicate things. Unless you are dealing with million records you can just read the whole thing into memory and sort it any way you like. And always add records in the end, this way you are less likely to damage your file.

For simple projects, using an embedded like JavaDB / Apache Derby may be a good idea. Configuration for the DB is absolutely minimal and in your case, you may need a maximum of just 2 tables (User and Workout). Exporting data to file is also fairly simple for sync between team members.
As yu_sha pointed out though, unless expect to have a large dataset ( to run on a PC , > 50000), you can just use the file and read everything into memory.

Read in every line via BufferedReader and parse with StringTokenizer. Looking at the data, I'd likely store an array of fields in a List that can be iterated and sorted according to your preference.

If you must store the file in this format, you're likely best off just reading the entire thing into memory at startup and storing it in a TreeMap or some other sorted, searchable map. Then you can use TreeMap's convenience methods such as ceilingKey or the similar floorKey to find matches near certain dates/times.

Use flatworm, a Java library allowing to parse and create flat files. Describe the format with a simple XML definition file, and there you go.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.