Reliable and fast method for identification of file duplicates

Reliable and fast method for identification of file duplicates - java

I have a continuously growing set of files and have to ensure that there are no duplicates. By duplicate I mean identical at byte level.
The files are getting collected from various system, some of them also providing hash codes of the files (but some don't). Some files may exist at multiple systems but should be imported only once.
I want do avoid unnecessary file transfers and I thought that I just compare hash codes before actually copying. However, as I said some of these systems don't provide a hash code and some use MD5 which I read isn't secure anymore.
My questions:
Is comparing hash codes enought to determine identical files?
What should I do when systems use different hash codes?
What should I do when systems don't provide a hash code?

Firstly, the only way to conclusively proof two files are identical is to compare them bit for bit. As such you can't really avoid transferring the files if you want to compare them. So if you need absolute certainty you cannot avoid transferring the files. Unless you can make certain assumptions about the files that's just a mathematical truth.
And then we have hash functions. What a hash function tries to do is calculate some value which is highly likely to be different when the files are different. How likely depends on the actual function, a really stupid hash function might have a change of one in ten to produce the same hash for different files, for a good hash function those changes are insanely small. For md5 the change of finding two different files with the same hash is one in 2^128. I'm guessing that's good enough for your system, so you can safely assume the files are the same when the hash is the same.
Now for a secure hash, and md5 being broken. Hash functions are not just used as a quick way to check if things are the same. They are also used in cryptographic systems to verify things are the same. It's only in that sense that md5 is broken, it is possible to generate a file with a specific md5 hash relatively quick. If you fear someone might intentionally create a file with the same hash as another file to trick you into skipping it you shouldn't rely on md5. But that doesn't seem to be the case here. If no one is deliberately messing with the files md5 still works fine.
So to your first question, theoretically no, but realistically yes.
To the second question, you should calculate all the different hashes that might be used for each file you stored locally. E.g. calculate the md5 hash and the sha1 hash (or whatever hashes are being used on the remote systems). That way you will always have the correct type of hash to check against for each file you already have.
For the files which don't have a hash you can't do anything to avoid transferring them. Until you do there is nothing you know about those files. Once you transferred them you can still calculate a hash yourself so you can quickly check if you got that file.

Related

Duplication detection for 3K incoming requests per second, recommended data structure/algorithm?

Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).
These requests will then be stored into mysql.
They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.
I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?
The posted data will look like:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).
I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.
Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.
Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.
Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.
Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.

Interesting question.
I would probably be looking at some kind of HashMap of HashMaps structure here where the first level of HashMaps would use the sources as keys and the second level would contain the actual data (the minimal for detecting duplicates) and use your hashcode function for hashing. For actual implementation, Java's ConcurrentHashMap would probably be the choice.
This way you have also set up the structure to partition your incoming load depending on sources if you need to distribute the load over several machines.
With regards to purging I think you have to measure the exact behavior with production like data. You need to learn how quickly the data grows when you successfully eliminate duplicates and how it becomes distributed in the HashMaps. With a good distribution and a not too quick growth I can imagine it is good enough to do a cleanup occasionally. Otherwise maybe a LRU policy would be good.

Sounds like you need a hashing structure that can add and check the existence of a key in constant time. In that case, try to implement a Bloom filter. Be careful that this is a probabilistic structure i.e. it may tell you that a key exists when it does not, but you can make the probability of failure extremely low if you tweak the parameters carefully.
Edit: Ok, so bloom filters are not acceptable. To still maintain constant lookup (albeit not a constant insertion), try to look into Cuckoo hashing.

1) Setup your database like this
ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);
INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (#prop1, #prop2, #prop3, #body)
ON DUPLICATE KEY UPDATE Body=#body
2) You don't need any algorithms or fancy hashing ADTs
shell> mysqlimport [options] db_name textfile1 [textfile2 ...]
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.
3) All your Java will do is...
a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv
b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command
c) delete the old CSV files if space is an issue, or archive them to network storage/backup device

Well you're basically looking for some kind of extremely large Hashmap and something like
if (map.put(key, val) != null) // send data
There are lots of different Hashmap implementations available, but you could look at NBHM. Non-blocking puts and designed with large, scalable problems in mind could work just fine. The Map also has iterators that do NOT throw a ConcurrentModificationException while using them to traverse the map which is basically a requirement for removing old data as I see it. Also putIfAbsent is all you actually need - but no idea if that's more efficient than just a simple put, you'd have to ask Cliff or check the source.
The trick then is to try to avoid resizing of the Map by making it large enough - otherwise the throughput will suffer while resizing (which could be a problem). And think about how to implement the removing of old data - using some idle thread that traverses an iterator and removes old data probably.

Use a java.util.ConcurrentHashMap for building a map of your hashes, but make sure you have the correct initialCapacity and concurrencyLevel assigned to the map at creation time.
The api docs for ConcurrentHashMap have all the relevant information:
initialCapacity - the initial capacity. The implementation performs
internal sizing to accommodate this many elements.
concurrencyLevel - the estimated number of concurrently updating threads. The
implementation performs internal sizing to try to accommodate this
many threads.
You should be able to use putIfAbsent for handling 3K requests as long as you have initialized the ConcurrentHashMap the right way - make sure this is tuned as part of your load testing.
At some point, though, trying to handle all the requests in one server may prove to be too much, and you will have to load-balance across servers. At that point you may consider using memcached for storing the index of hashes, instead of the CHP.
The interesting problems that you will still have to solve, though, are:
loading all of the hashes into memory at startup
determining when to knock off hashes from the in-memory map

If you use a strong hash formula, such as MD5 or SHA-1, you will not need to store any data at all. The probability of duplicate is virtually null, so if you find the same hash result twice, the second is a duplicate.
Given that MD5 is 16 bytes, and SHA-1 20 bytes, it should decrease memory requirements, therefore keeping more elements in the CPU cache, therefore dramatically improving speed.
Storing these keys requires little else than a small hash table followed by trees to handle collisions.

Single Instance File Storage with JAVA

I was storing some files based on a checksum but I found a flaw that 2 checksums can be identical sometimes.
I always try looking for API instead of reinventing the wheel, but I can't find anything.
I know theres the JSR 268 and JackRabbit as a standard for content storage but my app is light-years of using such thing.
So, are there approaches for single Instance File Storage with Java or should I just keep searching for new algorithms for my checksum?
EDIT:
When numcheck is not working: 2 files are exactly the same, just in different file system locations. However when sent from the client is impossible on server side to know the path they were before, so it is the same file twice, same checksum.
If you wanna retrieve either one, how you check that?
Wanted to know if there was an standard approach, API, or an algorithm that could help me spot the difference

No matter how strong a hashing algorithm is, there is always a chance of a collision. A hashing algorithm generates a finite number of hashes from an infinite number of inputs.

The only way to ensure that two files are not identical is to compare them bit by bit. Hashing them is easier and faster, but carries with it the risk of collision.

To check if two image files are same..Checksum or Hash?

I am doing some image processing code where in I download some images(as BufferedImage) from URLs and pass it on to a image processor.
I want to avoid passing of the same image more than once to the image processor(as the image processing operation is of high cost). The URL end points of the images(if they are same images) may vary and hence I can prevent this by the URL. So I was planning to do a checksum or hash to identify if the code is encountering the same image again.
For md5 I tried Fast MD5, and it generated a 20K+ character length hex checksum value for the image(some sample). Obviously storing this 20K+ character hash would be an issue when it comes to database storage. Hence I tried the CRC32(from java.util.zip.CRC32). And it did generate quite smaller length check sum than the hash.
I do understand checksum and hash are for different purposes. For the purpose explained above can I just use the CRC32? Would it solve the purpose or I have to try something more than these two?
Thanks,
Abi

The difference between CRC and, say, MD5, is that it is more difficult to tamper a file to match a "target" MD5 than to tamper it to match a "target" checksum. Since this does not seem a problem for your program, it should not matter which method do you use. Maybe MD5 might be a little more CPU intensive, but I do not know if that different will matter.
The main question should be the number of bytes of the digest.
If you are doing a checksum in an integer will mean that, for a file of 2K size, you are fitting 2^2048 combinations into 2^32 combinations --> for every CRC value, you will have 2^64 possible files that match it. If you have a 128 bits MD5, then you have 2^16 possible collisions.
The bigger the code that you compute, the less possible collisions (given that the codes computed are distributed evenly), so the safer the comparation.
Anyway, in order to minimice possible errors, I think the first classification should be using file size... first compare file sizes, if they match then compare checksums/hash.

A checksum and a hash are basically the same. You should be able to calculate any kind of hash. A regular MD5 would normally suffice. If you like, you could store the size and the md5 hash (which is 16 bytes, I think).
If two files have different sizes, thay are different files. You will not even need to calculate a hash over the data. If it is unlikely that you have many duplicate files, and the files are of the larger kind (like, JPG pictures taken with a camera), this optimization may spare you a lot of time.
If two or more files have the same size, you can calculate the hashes and compare them.
If two hashes are the same, you could compare the actual data to see if this is different after all. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, while CR32 is only 4), the less likely that two different files will have the same hash.
It will take only 10 minutes of programming to perform this extra check though, so I'd say: better safe than sorry. :)
To further optimize this, if exactly two files have the same size, you can just compare their data. You will need to read the files anyway to calculate their hashes, so why not compare them directly if they are the only two with that specific size.

How to protect decryption key from decompilation?

I'm a beginner java programmer. I'm working on an application that decrypts some data.
The decryption key is hardcoded into the software and thus can be seen by analyzing the bytecode.
I know that reverse engineering cannot be prevented entirely so what I'm trying to do is to make the process as hard as possible.
My idea is not to directly put the key into my code but have it go through some kind of transformation.
For example, I could write -
private static final byte[] HC256A = Hex
.decode("8589075b0df3f6d82fc0c5425179b6a6"
+ "3465f053f2891f808b24744e18480b72"
+ "ec2792cdbf4dcfeb7769bf8dfa14aee4"
+ "7b4c50e8eaf3a9c8f506016c81697e32");
This way someone looking at the bytecode can't read it straight away. But will have to follow the logic and apply transformations to it, which won't be that much easier at byte level.
So what do you guys think? Is this useful? What could be the be the best transformation other than hex decoding?
Are there any other methods available to protect hardcoded decryption keys?
Thanks for all your suggestions.

Right way to attack such obfuscation (especially in bytecode languages) is to attach debugger to the place to which the key is passed (if debugging is not possible, start analyzing code from that place). This way the attacker doesn't need to look for the key at all and he doesn't care how obfuscated the key is. So you need to re-think your design.
If you only want to protect from the amateur lurkers, then splitting the key and XORing it's parts (possibly with different keys), would be enough. One more trick - derive the key from text constants already present in the code (such as application name). This makes the key less obvious than splitting or XORing.

Don't code the key into the source code at all. Keep it separate, ship it separately, e.g. in a Java keystore, and only to customers/sites/clients you trust, and put some legalese in the licence that places the onus on them if they leak the keystore.

Faced with a similar problem (in c) I went with single use XOR pads. This is good because it looks like garbage... if you get really clever you can snoop for that (incorrect) key in use. I would avoid anything that injects human readable strings as those will invariably draw attention to that bit of code.

"Fastest" hash function implemented in Java, comparing part of file

I need to compare two different files of the instance "File" in Java and want to do this with a fast hash function.
Idea:
- Hashing the 20 first lines in File 1
- Hashing the 20 first lines in File 2
- Compare the two hashes and return true if those are equal.
I want to use the "fastest" hash function ever been implemented in Java. Which one would you choose?

If you want speed, do not hash! Especially not a cryptographic hash like MD5. These hashes are designed to be impossible to reverse, not fast to calculate. What you should use is a Checksum - see java.util.zip.Checksum and its two concrete implementations. Adler32 is extremely fast to compute.
Any method based on checksums or hashes is vulnerable to collisions, but you can minimise the risk by using two different methods in the way RSYNC does.
The algorithm is basically:
Check file sizes are equal
Break the files into chunks of size N bytes
Compute checksum on each pair of matching blocks and compare. Any differences prove files are not the same.
This allows for early detection of a difference. You can improve it by computing two checksums at once with different algorithms, or different block sizes.
More bits in the result mean less chance of a collision, but as soon as you go over 64 bits you are outside what Java (and the computer's CPU) can handle natively and hence get slow, so FNV-1024 is less likely to give you a false negative but is much slower.
If it is all about speed, just use Adler32 and accept that very rarely a difference will not be detected. It really is rare. Checksums like these are used to ensure the internet can spot transmission errors, and how often do you get the wrong data turning up?
It it is all about accuracy really, you will have to compare every byte. Nothing else will work.
If you can compromise between speed and accuracy, there is a wealth of options out there.

If you're comparing two files at the same time on the same system there's no need to hash both of them. Just compare the bytes in both files are equal as you read both. If you're looking to compare them at different times or they're in different places then MD5 would be both fast and adequate. There's not much reason to need a faster one unless you're dealing with really large files. Even my laptop can hash hundreds of megabytes per second.
You also need to hash the whole file if you want to verify they're identical. Otherwise you might as well just check the size and last modified time if you want a really quick check. You could also check the beginning and end of the file if they're just really large and you trust that the middle won't be changing. If you're not dealing with hundreds of megabytes though, you may as well check every byte of each file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.