I have a continuous stream of integers across the space of all the 32-bit integers, and upon each update I want to know either the exact or approximate entropy of the distributions of integers I have encountered. It can be global entropy across the lifetime or a windowed approximation that attenuates older information as time passes.
Does anyone know of a library that does this already or an algorithm that has this property?
Clearly, this is a streaming algorithm as it is too expensive to iterate over the range each time and calculate the entropy on each update. Does anyone know of such an algorithm or sketch data structure that can do this?
The motivation and use case is that I want to detect skew in the stream of integers. It is supposed to be uniform across the range of integers but at certain times, due to other conditions, the uniformity may be disturbed and I think entropy is the best way to detect this kind of condition. I'd ideally have an alert on low entropy for the calculating component.
Thanks for any help!
EDIT: I actually found a paper that does exactly this but I know of no existing implementation. Reusing tested, verified code would be way better than having to implement it myself. :)
I have a continuously growing set of files and have to ensure that there are no duplicates. By duplicate I mean identical at byte level.
The files are getting collected from various system, some of them also providing hash codes of the files (but some don't). Some files may exist at multiple systems but should be imported only once.
I want do avoid unnecessary file transfers and I thought that I just compare hash codes before actually copying. However, as I said some of these systems don't provide a hash code and some use MD5 which I read isn't secure anymore.
My questions:
Is comparing hash codes enought to determine identical files?
What should I do when systems use different hash codes?
What should I do when systems don't provide a hash code?
Firstly, the only way to conclusively proof two files are identical is to compare them bit for bit. As such you can't really avoid transferring the files if you want to compare them. So if you need absolute certainty you cannot avoid transferring the files. Unless you can make certain assumptions about the files that's just a mathematical truth.
And then we have hash functions. What a hash function tries to do is calculate some value which is highly likely to be different when the files are different. How likely depends on the actual function, a really stupid hash function might have a change of one in ten to produce the same hash for different files, for a good hash function those changes are insanely small. For md5 the change of finding two different files with the same hash is one in 2^128. I'm guessing that's good enough for your system, so you can safely assume the files are the same when the hash is the same.
Now for a secure hash, and md5 being broken. Hash functions are not just used as a quick way to check if things are the same. They are also used in cryptographic systems to verify things are the same. It's only in that sense that md5 is broken, it is possible to generate a file with a specific md5 hash relatively quick. If you fear someone might intentionally create a file with the same hash as another file to trick you into skipping it you shouldn't rely on md5. But that doesn't seem to be the case here. If no one is deliberately messing with the files md5 still works fine.
So to your first question, theoretically no, but realistically yes.
To the second question, you should calculate all the different hashes that might be used for each file you stored locally. E.g. calculate the md5 hash and the sha1 hash (or whatever hashes are being used on the remote systems). That way you will always have the correct type of hash to check against for each file you already have.
For the files which don't have a hash you can't do anything to avoid transferring them. Until you do there is nothing you know about those files. Once you transferred them you can still calculate a hash yourself so you can quickly check if you got that file.
I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.
In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.
Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:
With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.
I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.
Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)
Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.
Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).
Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.
Process each of the letter files separately using a Set to remove duplicates.
You might be able to use a trie data structure to do the job in one pass. It has advantages that recommend it for this type of problem. Lookup and insert are quick. And its representation is relatively space efficient. You might be able to represent all of your words in RAM.
If you sort the items, duplicates will be easy to detect and remove, as the duplicates will bunch together.
There is code here you could use to mergesort the large file:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
For large files I try not to read the data into memory but instead operate on a memory mapped file and let the OS page in/out memory as needed. If your set structures contain offsets into this memory mapped file instead of the actual strings it would consume significantly less memory.
Check out this article:
http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html
Question: Are these really WORDS, or are they something else -- phrases, part numbers, etc?
For WORDS in a common spoken language one would expect that after the first couple of thousand you'd have found most of the unique words, so all you really need to do is read a word in, check it against a dictionary, if found skip it, if not found add it to the dictionary and write it out.
In this case your dictionary is only a few thousand words large. And you don't need to retain the source file since you write out the unique words as soon as you find them (or you can simply dump the dictionary when you're done).
If you have the posibility to insert the words in a temporary table of a database (using batch inserts), then it would be a select distinct towards that table.
One classic way to solve this kind of problem is a Bloom filter. Basically you hash your word a number of times and for each hash result set some bits in a bit vector. If you're checking a word and all the bits from its hashes are set in the vector you've probably (you can set this probability arbitrarily low by increasing the number of hashes/bits in the vector) seen it before and it's a duplicate.
This was how early spell checkers worked. They knew if a word was in the dictionary, but they couldn't tell you what the correct spelling was because it only tell you if the current word is seen.
There are a number of open source implementations out there including java-bloomfilter
I'd tackle this in Java the same way as in every other language: Write a deduplication filter and pipe it as often as necessary.
This is what I mean (in pseudo code):
Input parameters: Offset, Size
Allocate searchable structure of size Size (=Set, but need not be one)
Read Offset (or EOF is encountered) elements from stdin and just copy them to stdout
Read Size elments from stdin (or EOF), store them in Set. If duplicate, drop, else write to stdout.
Read elements from stdin until EOF, if they are in Set then drop, else write to stdout
Now pipe as many instances as you need (If storage is no problem, maybe only as many as you have cores) with increasing Offsets and sane Size. This lets you use more cores, as I suspect the process is CPU bound. You can even use netcat and spread processing over more machines, if you are in a hurry.
Even in English, which has a huge number of words for a natural language, the upper estimates are only about 80000 words. Based on that, you could just use a HashSet and add all your words it (probably in all lower case to avoid case issues):
Set<String> words = new HashSet<String>();
while (read-next-word) {
words.add(word.toLowerCase());
}
If they are real words, this isn't going to cause memory problems, will will be pretty fast too!
To not have to worry to much about implementation you should use a database system, either plain old relational SQL or a No-SQL solution. Im pretty sure you could use e.g. Berkeley DB java edition and then do (pseudo code)
for(word : stream) {
if(!DB.exists(word)) {
DB.put(word)
outstream.add(word)
}
}
The problem is in essence easy, you need to store things on disk because there is not enough memory, then either use sorting O(N log N) (unecessary) or hashing O(N) to find the unique words.
If you want a solution that will very likely work but is not guaranteed to do so use a LRU type hash table. According to the empirical Zpif's law you should be OK.
A follow up question to some smart guy out there, what if I have 64-bit machine and set heap size to say 12GB, shouldn't virtual memory take care of the problem (although not in an optimal way) or is java not designed this way?
Quicksort would be a good option over Mergesort in this case because it needs less memory. This thread has a good explanation as to why.
Most performant solutions arise from omiting unecessary stuff. You look only for duplicates, so just do not store words itself, store hashes. But wait, you are not interested in hashes either, only if they awere seen already - do not store them. Treat hash as really large number, and use bitset to see whether you already seen this number.
So your problem boils down to really big sparse populated bitmap - with size depending on hash width. If your hash is up to 32 bit, you can use riak bitmap.
... gone thinking about really big bitmap for 128+ bit hashes %) (I'll be back )
I was storing some files based on a checksum but I found a flaw that 2 checksums can be identical sometimes.
I always try looking for API instead of reinventing the wheel, but I can't find anything.
I know theres the JSR 268 and JackRabbit as a standard for content storage but my app is light-years of using such thing.
So, are there approaches for single Instance File Storage with Java or should I just keep searching for new algorithms for my checksum?
EDIT:
When numcheck is not working: 2 files are exactly the same, just in different file system locations. However when sent from the client is impossible on server side to know the path they were before, so it is the same file twice, same checksum.
If you wanna retrieve either one, how you check that?
Wanted to know if there was an standard approach, API, or an algorithm that could help me spot the difference
No matter how strong a hashing algorithm is, there is always a chance of a collision. A hashing algorithm generates a finite number of hashes from an infinite number of inputs.
The only way to ensure that two files are not identical is to compare them bit by bit. Hashing them is easier and faster, but carries with it the risk of collision.
I am doing some image processing code where in I download some images(as BufferedImage) from URLs and pass it on to a image processor.
I want to avoid passing of the same image more than once to the image processor(as the image processing operation is of high cost). The URL end points of the images(if they are same images) may vary and hence I can prevent this by the URL. So I was planning to do a checksum or hash to identify if the code is encountering the same image again.
For md5 I tried Fast MD5, and it generated a 20K+ character length hex checksum value for the image(some sample). Obviously storing this 20K+ character hash would be an issue when it comes to database storage. Hence I tried the CRC32(from java.util.zip.CRC32). And it did generate quite smaller length check sum than the hash.
I do understand checksum and hash are for different purposes. For the purpose explained above can I just use the CRC32? Would it solve the purpose or I have to try something more than these two?
Thanks,
Abi
The difference between CRC and, say, MD5, is that it is more difficult to tamper a file to match a "target" MD5 than to tamper it to match a "target" checksum. Since this does not seem a problem for your program, it should not matter which method do you use. Maybe MD5 might be a little more CPU intensive, but I do not know if that different will matter.
The main question should be the number of bytes of the digest.
If you are doing a checksum in an integer will mean that, for a file of 2K size, you are fitting 2^2048 combinations into 2^32 combinations --> for every CRC value, you will have 2^64 possible files that match it. If you have a 128 bits MD5, then you have 2^16 possible collisions.
The bigger the code that you compute, the less possible collisions (given that the codes computed are distributed evenly), so the safer the comparation.
Anyway, in order to minimice possible errors, I think the first classification should be using file size... first compare file sizes, if they match then compare checksums/hash.
A checksum and a hash are basically the same. You should be able to calculate any kind of hash. A regular MD5 would normally suffice. If you like, you could store the size and the md5 hash (which is 16 bytes, I think).
If two files have different sizes, thay are different files. You will not even need to calculate a hash over the data. If it is unlikely that you have many duplicate files, and the files are of the larger kind (like, JPG pictures taken with a camera), this optimization may spare you a lot of time.
If two or more files have the same size, you can calculate the hashes and compare them.
If two hashes are the same, you could compare the actual data to see if this is different after all. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, while CR32 is only 4), the less likely that two different files will have the same hash.
It will take only 10 minutes of programming to perform this extra check though, so I'd say: better safe than sorry. :)
To further optimize this, if exactly two files have the same size, you can just compare their data. You will need to read the files anyway to calculate their hashes, so why not compare them directly if they are the only two with that specific size.