Related
I need a function in JAVA, something like this:
Input: .wav file (or byte[] fileBytes)
Output: true/false (the file consists of silence only)
What is the best way to do it?
Thank you.
UPDATE:
The command that I use for recording:
arecord --format=S16_LE --max-file-time=60 --rate=16000 --file-type=wav randomvalue_i.wav
Silent = no audio at all
Well, the short answer is you'll want to scan the .WAV data and do a min/max value on it. A "silent" file the values should essentially all be 0.
The longer answer is that you'll want to understand the .WAV format, which you can find described here (http://soundfile.sapp.org/doc/WaveFormat/). You can probably skip over the first 44 bytes (RIFF, 'fmt') to get down to the data, then start looking at the bytes. The 'bits-per-sample' value from the header might be important, as 16-bit samples would mean you'd need to consolidate 2 'bytes' together to get a single sample. But, even so, both bytes would be 0 for a silent, 16-bit sample file. Ditto for NumChannels - in theory you should understand it, but again, both should be 0 for true 'silent'. If all the data is '0', it's silent.
"Silent" is a bit ambiguous. Above, I was strict and assumed it meant true '0' only. However, in a silent room, there would still be very low levels of background ambient noise. In that case, you'd need to be a bit more forgiving about the comparison. e.g. calculate a min/max for each sample, and insure that the range is within some tolerance. It can still be determined, but it just adds code.
For completeness:
public boolean isSilent(byte[] info) {
for (int idx = 44; idx < info.length; ++idx) {
if (info[idx] != 0)
return false;
}
return true;
}
You could have a .wav file that is what you consider "silence" and compare it to the other .wav file to see if they have the same frequency.
I have the following data:
number1
I am writing line1 .
number2
First line .
number3
I am writing line2.
number4
Second line .
number5
I am writing line3 .
number6
Third line.
number7
I am writing line2 .
number8
Fourth line .
number9
I am writing line5 .
number10
Fifth line .
Now I want to remove the duplicate lines from this text file -- along with this I want to remove 1 preceding and 2 succeeding lines of the duplicate line. Such that after removal my data looks like:
number1
I am writing line1 .
number2
First line .
number3
I am writing line2.
number4
Second line .
number5
I am writing line3 .
number6
Third line.
number9
I am writing line5 .
number10
Fifth line .
The size of my file is 60 GB and I am using a server with 64 GB RAM. I am using the following code for removing the duplicates:
fOutput = open('myfile','w')
table_size = 2**16
seen = [False]*table_size
infile = open('test.ttl', 'r')
while True:
inFileLine1=infile.readline()
if not inFileLine1:
break #EOF
inFileLine2=infile.readline()
inFileLine3=infile.readline()
inFileLine4=infile.readline()
h = hash(inFileLine2) % table_size
if seen[h]:
dup = False
with open('test.ttl','r') as f:
for line1 in f:
if inFileLine2 == line1:
dup = True
break
if not dup:
fOutput.write(inFileLine1)
fOutput.write(inFileLine2)
fOutput.write(inFileLine3)
fOutput.write(inFileLine4)
else:
seen[h] = True
fOutput.write(inFileLine1)
fOutput.write(inFileLine2)
fOutput.write(inFileLine3)
fOutput.write(inFileLine4)
fOutput.close()
However, it turns out this code is very slow. Is there some way by which I may improve the efficiency of the code using parallelization i.e. using all 24 cores available to me on my system or using any other technique.
Although the above code is written in python -- but I am fine with efficient solutions in c++ or python or Java or using linux commands
Here test.ttl is my input file with size 60GB
It seems that your code is reading every line exactly once, and writing every line (that need to be written) also exactly once. Thus there is no way to optimize the algorithm on the file reading - writing part.
I strongly suspect that your code is slow because of the very bad use of Hash table. Your hash table only has size 2^16, while your file may contain about 2^28 lines, assuming an average of 240 bytes per line.
Since you have such a big RAM (enough to contain all the file), I suggest you change the hash table to a size of 2^30. This should help considerably.
Edit:
In this case, you could try to use some very simple Hash function. For example:
long long weight[] = {generate some random numbers};
long long Hash(char * s, int length)
{
long long result = 0;
int i = 0, j = 0;
while (i < length)
{
result += s[i] * weight[j ++];
i += j;
}
return result & ((1 << 30) - 1); // assume that your hash table has size 2^30
}
If duplicate lines are quite common, then I think the right way to solve the problem is similar to the one you have, but you must use a hash table that can grow on demand and will automatically handle collisions. Try using the Python set data type to store lines that were already reached. With set you will not need to confirm that duplicate lines really are duplicates; if they're in the set already, they are definitely duplicates. This will work, and be quite efficient. However, Python's memory management may not be very efficient, and the set data type might grow beyond the available memory, in which case a rethink will be required. Try it.
Edit: ok, so set grew too large.
For a good solution, you want to avoid repeatedly re-reading the input file. In your original solution, the input file is read again for each possible duplicate, so if there are N lines, the total number of lines read may be up to N^2. Optimization (profiling) and parallelism won't make this better. And, due to the massive file size, you also have a memory constraint which rules out simple tricks like storing all of the lines seen so far in a hash table (like set).
Here is my second suggestion. In this suggestion, memory requirements will scale to fit whatever you have available. You will need enough disk space for at least one copy of your input file. The steps form a pipeline - the output from one step is the input of the next.
Step 1. I think you are interested in working on groups of 4 lines. You want to keep the whole group of 4, or none of them. Your first step should be to combine each group of 4 lines into a single line. For example:
number1
I am writing line1 .
number2
First line .
number3
I am writing line2.
number4
Second line .
becomes
number1#I am writing line1 .#number2#First line .
number3#I am writing line2 .#number4#Second line .
Note that I used '#' to mark where the line breaks were. This is important. You can use any character here, provided it is not used in any other place in your input file.
Step 2. Prepend the line number to each line.
1#number1#I am writing line1 .#number2#First line .
2#number3#I am writing line2 .#number4#Second line .
Step 3. Use the Unix sort utility (or a Windows port of it). It's already highly optimized. There are even options to do the sort in parallel for extra speed. Sort with the following options:
sort '-t#' -k3
These sort options cause the program to consider only the 3rd field - which is the 2nd line in each group.
Step 4. Now step through the output of the previous stage, looking for duplicates, making use of the fact that they will be next to each other. Look at the 3rd field. If you find a duplicate line, discard it.
Step 5. Reconstruct the order of the original file using another sort:
sort '-t#' -k1 -n
This time, the sort uses the numerical value of the line number (the first field).
Step 6. Remove the line number from the start of each line.
Step 7. Turn each '#' character back into a newline character. Job done.
Though this seems like a lot of steps, all but steps 3 and 5 only involve a single pass through the input file, so they'll be very fast. N steps for N lines. The sorting steps (3 and 5) are also fast because the sort program has been heavily optimized and uses a good sorting algorithm (at most N log N steps for N lines).
fOutput = open('myfile','w')
infile = open('test.ttl', 'r')
all_line2 = {}
while True:
inFileLine1 = infile.readline()
if not inFileLine1:
break #EOF
inFileLine2 = infile.readline()
_ = infile.readline()
_ = infile.readline()
all_line2[inFileLine2] = False
infile.seek(0)
while True:
inFileLine1=infile.readline()
if not inFileLine1:
break #EOF
inFileLine2=infile.readline()
inFileLine3=infile.readline()
inFileLine4=infile.readline()
if not all_line2.get(inFileLine2):
fOutput.write(inFileLine1)
fOutput.write(inFileLine2)
fOutput.write(inFileLine3)
fOutput.write(inFileLine4)
all_line2[inFileLine2] = True
Look at java.util.concurrent.ConcurrentHashMap in Java. It is designed to perform well when used by multiple threads that access the map concurrently.
Also, read the file using Java NIO through an Executor fixed thread pool.
To start with you can use this code
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Main {
private static final ConcurrentHashMap map = new ConcurrentHashMap();
public static class Task implements Runnable {
private final String line;
public Task(String line) {
this.line = line;
}
#Override
public void run() {
// if (!map.containsKey(line)) // not needed
map.put(line, true);
}
}
public static void main(String[] args) throws IOException {
ExecutorService service = Executors.newFixedThreadPool(10);
String dir_path, file_name;
Files.lines(Paths.get(dir_path, file_name)).forEach(l -> service.execute(new Task(l)));
service.shutdown();
map.keySet().forEach(System.out::println);
}
}
I would prefer to use Java for this. And given that the size of the file is 60 GB, Java provides a well suited API for this named MappedByteBuffer.
You load the file using a file channel and map the channel using the above API as follows:
FileChannel fileChannel = new RandomAccessFile(new File(inputFile), "r").getChannel();
mappedBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());
This loads the entire file into memory. For the best efficient performance, map it into chunks (loads 50k bytes)
mappedBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, 50000);
Now you can iterate over the mappedBuffer and do your processing. Let know for any clarity.
I would like to read the file in a sequence manner. Let's consider some factors affecting performance and possible solutions:
Language: vote for C/C++.
IO: we can use Memory Mapping that is available on Windows and Linux, on Linux it is the mmap() function; basically, this will map the file content to a pointer e.g. char* data. Tell me if you are using Windows and need the code.
Searching for a key: I suggest to use Binary Search Tree, each time we take a new couple of lines => value, key; we need to traverse the tree to find the key. If found, then skip this and next couple. If not found, then insert this couple into the tree as a new node, at the ending position of the searching; and also write this couple to the output file. Of course, the searching takes O(logN).
Data structure of a node:
struct Node {
char* key;
unsigned short keyLen;
char* value;
unsigned short valueLen;
Node* leftNode;
Node* rightNode;
}
You can change unsigned short to unsigned char if relevant. The pointers key and value actually point to certain positions of the memory block hold by data, thus no new memory is allocated to store key and value.
The searching can be further improved by using Bloom Filter. If the filter answers NO (very quickly) then definitively the key is not existed in our tree, no need to traverse the tree anymore. If the answer is YES, then traverse the tree normally. Bloom Filter is implemented in Redis and HBase, please take a look at these open source database systems, if needed.
UPDATE: I solved the problem with a great external library - https://code.google.com/p/xdeltaencoder/. The way I did it is posted below as the accepted answer
Imagine I have two separate pcs who both have an identical byte[] A.
One of the pcs creates byte[] B, which is almost identical to byte[] A but is a 'newer' version.
For the second pc to update his copy of byte[] A into the latest version (byte[] B), I need to transmit the whole byte[] B to the second pc. If byte[] B is many GB's in size, this will take too long.
Is it possible to create a byte[] C that is the 'difference between' byte[] A and byte[] B? The requirements for byte[] C is that knowing byte[] A, it is possible to create byte[] B.
That way, I will only need to transmit byte[] C to the second PC, which in theory would be only a fraction of the size of byte[] B.
I am looking for a solution to this problem in Java.
Thankyou very much for any help you can provide :)
EDIT: The nature of the updates to the data in most circumstances is extra bytes being inserted into parts of the array. Ofcourse it is possible that some bytes will be changed or some bytes deleted. the byte[] itself represents a tree of the names of all the files/folders on a target pc. the byte[] is originally created by creating a tree of custom objects, marshalling them with JSON, and then compressing that data with a zip algorithm. I am struggling to create an algorithm that can intelligently create object c.
EDIT 2: Thankyou so much for all the help everyone here has given, and I am sorry for not being active for such a long time. I'm most probably going to try to get an external library to do the delta-encoding for me. A great part about this thread is that I now know what I want to achieve is called! I believe that when I find an appropriate solution I will post it and accept it so others can see as to how I solved my problem. Once again, thankyou very much for all your help.
Using a collection of "change events" rather than sending the whole array
A solution to this would be to send a serialized object describing the change rather than the actual array all over again.
public class ChangePair implements Serializable{
//glorified struct
public final int index;
public final byte newValue;
public ChangePair(int index, byte newValue) {
this.index = index;
this.newValue = newValue;
}
public static void main(String[] args){
Collection<ChangePair> changes=new HashSet<ChangePair>();
changes.add(new ChangePair(12,(byte)2));
changes.add(new ChangePair(1206,(byte)3));
}
}
Generating the "change events"
The most efficient method for achieving this would be to track changes as you go, but assuming thats not possible you can just brute force your way through, finding which values are different
public static Collection<ChangePair> generateChangeCollection(byte[] oldValues, byte[] newValues){
//validation
if (oldValues.length!=newValues.length){
throw new RuntimeException("new and old arrays are differing lengths");
}
Collection<ChangePair> changes=new HashSet<ChangePair>();
for(int i=0;i<oldValues.length;i++){
if (oldValues[i]!=newValues[i]){
//generate a change event
changes.add(new ChangePair(i,newValues[i]));
}
}
return changes;
}
Sending and recieving those change events
As per this answer regarding sending serialized objects over the internet you could then send your object using the following code
Collection<ChangePair> changes=generateChangeCollection(oldValues,newValues);
Socket s = new Socket("yourhostname", 1234);
ObjectOutputStream out = new ObjectOutputStream(s.getOutputStream());
out.writeObject(objectToSend);
out.flush();
On the other end you would recieve the object
ServerSocket server = new ServerSocket(1234);
Socket s = server.accept();
ObjectInputStream in = new ObjectInputStream(s.getInputStream());
Collection<ChangePair> objectReceived = (Collection<ChangePair>) in.readObject();
//use Collection<ChangePair> to apply changes
Using those change events
This collection can then simply be used to modify the array of bytes on the other end
public static void useChangeCollection(byte[] oldValues, Collection<ChangePair> changeEvents){
for(ChangePair changePair:changeEvents){
oldValues[changePair.index]=changePair.newValue;
}
}
Locally log the changes to the byte array, like a little version control system. In fact you could use a VCS to create patch files, send them to the other side and apply them to get an up-to-date file;
If you cannot log changes, you would need to double the array locally, or (not so 100% safe) use an array of checksums on blocks.
The main problem here is data compression.
Kamikaze offers you good compression algorithms for data arrays. It uses Simple16 and PForDelta coding. Simple16 is a good and (as the name says) simple list compression option. Or you can use Run Lenght Encoding. Or you can experiment with any compression algorithm you have available in Java...
Anyway, any method you use will be optimized if you first preprocess the data.
You can reduce the data calculating differences or, as #RichardTingle pointed, creating pairs of different data locations.
You can calculate C as B - A. A will have to be an int array, since the difference between two byte values can be higher than 255. You can then restore B as A + C.
The advantage of combining at least two methods here is that you get much better results.
E.g. if you use the difference method with A = { 1, 2, 3, 4, 5, 6, 7 } and B = { 1, 2, 3, 5, 6, 7, 7 }. The difference array C will be { 0, 0, 0, 1, 1, 1, 0 }. RLE can compress C in a very effective way, since it is good for compressing data when you have many repeated numbers in sequence.
Using the difference method with Simple16 will be good if your data changes in almost every position, but the difference between values is small. It can compress an array of 28 single-bit values (0 or 1) or an array of 14 two-bit values to a single 32-byte integer.
Experiment, it all will depend on how your data behaves. And compare the data compression ratios for each experiment.
EDIT: You will have to preprocess the data before JSON and zip compressing.
Create two sets old and now. The latter contains all files that exists now. For the former, the old files, you have at least two options:
Should contain all files that existed before you sent them to the other PC. You will need to keep a set of what the other PC knows to calculate what has changed since the last synchronization, and send only the new data.
Contains all files since you last checked for changes. You can keep a local history of changes and give each version an "id". Then, when you sync, you send the "version id" together with the changed data to the other PC. Next time, the other PC first sends its "version id" (or you keed the "version id" of each PC locally), then you can send the other PC all the new changes (all the versions that come after the one that PC had).
The changes can be represented by two other sets: newFiles, and deleted files. (What about files that changed in content? Don't you need to sync these too?) The newFiles contains the ones that only exist in set now (and do not exist in old). The deleted set contains the files that only exist in set old (and do not exist in now).
If you represent each file as an String with the full pathname, you safely will have unique representations of each file. Or you can use java.io.File.
After you reduced your changes to newFiles and deleted files set, you can convert them to JSON, zip and do anything else to serialize and compress the data.
So, what I ended up doing was using this:
https://code.google.com/p/xdeltaencoder/
From my test it works really really well. However, you will need to make sure to checksum the source (in my case fileAJson), as it does not do it automatically for you!
Anyways, code below:
//Create delta
String[] deltaArgs = new String[]{fileAJson.getAbsolutePath(), fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);
//Apply delta
deltaArgs = new String[]{"-d", fileAJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);
//Trivia, Surpisingly this also works
deltaArgs = new String[]{"-d", fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);
I have a general question on your opinion about my "technique".
There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).
My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.
But this sounds actually really, really time consuming and complicated to me.
Can you think of any other method to compare those two files?
How long do you think the comparison could take?
For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...
Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it?
I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.
Thanks in advance.
EDIT
I think I have to explain my problem a bit more.
The purpose is not to see if the two files in general are identical (they are not).
There are some lines in each file that share the same "characteristic".
Here's an example:
file_1 looks somewhat like this:
mat1 1000 2000 TEXT //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT
file_2looks like this:
mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT
TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1.
So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010).
I hope this makes it clear to you!
So my question is: how would you search for the matching lines?
Yes, I use Java as my programming language.
EDIT
I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-)
Nonentheless all your approaches were very helpful to me, thank you for your replies!
I think, your way is rather reasonable.
I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.
But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;
As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.
Next -- you may read lines to buffer like this:
final List<String> lines = new ArrayList<>();
try{
final List<String> block = new ArrayList<>(BLOCK_SIZE);
for(int i=0;i<BLOCK_SIZE;i++){
final String line = ...;//read line from file
block.add(line);
}
lines.addAll(block);
}catch(OutOfMemory ooe){
//break
}
So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM
In an ideal world, you would be able to read in every line of file_2 into memory (probably using a fast lookup object like a HashSet, depending on your needs), then read in each line from file_1 one at a time and compare it to your data structure holding the lines from file_2.
As you have said you run out of memory however, I think a divide-and-conquer type strategy would be best. You could use the same method as I mentioned above, but read in a half (or a third, a quarter... depending on how much memory you can use) of the lines from file_2 and store them, then compare all of the lines in file_1. Then read in the next half/third/quarter/whatever into memory (replacing the old lines) and go through file_1 again. It means you have to go through file_1 more, but you have to work with your memory constraints.
EDIT: In response to the added detail in your question, I would change my answer in part. Instead of reading in all of file_2 (or in chunks) and reading in file_1 a line at a time, reverse that, as file_1 holds the data to check against.
Also, with regards searching the matching lines. I think the best way would be to do some processing on file_1. Create a HashMap<List<Range>> that maps a String ("mat1" - "mat50") to a list of Ranges (just a wrapper for a startOfRange int and an endOfRange int) and populate it with the data from file_1. Then write a function like (ignoring error checking)
boolean isInRange(String material, int value)
{
List<Range> ranges = hashMapName.get(material);
for (Range range : ranges)
{
if (value >= range.getStart() && value <= range.getEnd())
{
return true;
}
}
return false;
}
and call it for each (parsed) line of file_2.
Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.
This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.
Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.
That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.
Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.
Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).
Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.
there is a tradeoff: if you read a big chunk of the file, you save the disc seek time, but you may have read information you will not need, since the change was encountered on the first lines.
You should probably run some experiments [benchmarks], with varying chunk size, to find out what is the optimal chunk to read, in the average case.
No sure how good an answer this would be - but have a look at this page: http://c2.com/cgi/wiki?DiffAlgorithm - it summarises a few diff algorithms. Hunt-McIlroy algorithm is probably the better implementation. From that page there's also a link to a java implementation of the GNU diff. However, I think an implementation in C/C++ and compiled into native code will be much faster. If you're stuck with java, you may want to consider JNI.
Indeed, that could take a while. You have to make 1,200.000,000 line comparisions.
There are several possibilities to speed that up by an order of magnitute:
One would be to sort file2 and do kind of a binary search on file level.
Another approach: compute a checksum of each line, and search that. Depending on average line length, the file in question would be much smaller and you really can do a binary search if you store the checksums in a fixed format (i.e. a long)
The number of lines you read at once from file_1 does not matter, however. This is micro-optimization in the face of great complexity.
If you want a simple approach: you can hash both of the files and compare the hash. But it's probably faster (especially if the files differ) to use your approach. About the memory consumption: just make sure you use enough memory, using no buffer for this kind a thing is a bad idea..
And all those answers about hashes, checksums etc: those are not faster. You have to read the whole file in both cases. With hashes/checksums you even have to compute something...
What you can do is sort each individual file. e.g. the UNIX sort or similar in Java. You can read the sorted files one line at a time to perform a merge sort.
I have never worked with such huge files but this is my idea and should work.
You could look into hash. Using SHA-1 Hashing.
Import the following
import java.io.FileInputStream;
import java.security.MessageDigest;
Once your text file etc has been loaded have it loop through each line and at the end print out the hash. The example links below will go into more depth.
StringBuffer myBuffer = new StringBuffer("");
//For each line loop through
for (int i = 0; i < mdbytes.length; i++) {
myBuffer.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
}
System.out.println("Computed Hash = " + sb.toString());
SHA Code example focusing on Text File
SO Question about computing SHA in JAVA (Possibly helpful)
Another sample of hashing code.
Simple read each file seperatley, if the hash value for each file is the same at the end of the process then the two files are identical. If not then something is wrong.
Then if you get a different value you can do the super time consuming line by line check.
Overall, It seems that reading line by line by line by line etc would take forever. I would do this if you are trying to find each individual difference. But I think hashing would be quicker to see if they are the same.
SHA checksum
If you want to know exactly if the files are different or not then there isn't a better solution than yours -- comparing sequentially.
However you can make some heuristics that can tell you with some kind of probability if the files are identical.
1) Check file size; that's the easiest.
2) Take a random file position and compare block of bytes starting at this position in the two files.
3) Repeat step 2) to achieve the needed probability.
You should compute and test how many reads (and size of block) are useful for your program.
My solution would be to produce an index of one file first, then use that to do the comparison. This is similar to some of the other answers in that it uses hashing.
You mention that the number of lines is up to about 45 million. This means that you could (potentially) store an index which uses 16 bytes per entry (128 bits) and it would use about 45,000,000*16 = ~685MB of RAM, which isn't unreasonable on a modern system. There are overheads in using the solution I describe below, so you might still find you need to use other techniques such as memory mapped files or disk based tables to create the index. See Hypertable or HBase for an example of how to store the index in a fast disk-based hash table.
So, in full, the algorithm would be something like:
Create a hash map which maps Long to a List of Longs (HashMap<Long, List<Long>>)
Get the hash of each line in the first file (Object.hashCode should be sufficient)
Get the offset in the file of the line so you can find it again later
Add the offset to the list of lines with matching hashCodes in the hash map
Compare each line of the second file to the set of line offsets in the index
Keep any lines which have matching entries
EDIT:
In response to your edited question, this wouldn't really help in itself. You could just hash the first part of the line, but it would only create 50 different entries. You could then create another level in the data structure though, which would map the start of each range to the offset of the line it came from.
So something like index.get("mat32") would return a TreeMap of ranges. You could look for the range preceding the value you are looking for lowerEntry(). Together this would give you a pretty fast check to see if a given matX/number combination was in one of the ranges you are checking for.
try to avoid memory consuming and make it disc consuming.
i mean divide each file into loadable size parts and compare them, this may take some extra time but will keep you safe dealing with memory limits.
What about using source control like Mercurial? I don't know, maybe it isn't exactly what you want, but this is a tool that is designed to track changes between revisions. You can create a repository, commit the first file, then overwrite it with another one an commit the second one:
hg init some_repo
cd some_repo
cp ~/huge_file1.txt .
hg ci -Am "Committing first huge file."
cp ~/huge_file2.txt huge_file1.txt
hg ci -m "Committing second huge file."
From here you can get a diff, telling you what lines differ. If you could somehow use that diff to determine what lines were the same, you would be all set.
That's just an idea, someone correct me if I'm wrong.
I would try the following: for each file that you are comparing, create temporary files (i refer to it as partial file later) on disk representing each alphabetic letter and an additional file for all other characters. then read the whole file line by line. while doing so, insert the line into the relevant file that corresponds to the letter it starts with. since you have done that for both files, you can now limit the comparison for loading two smaller files at a time. a line starting with A for example can appear only in one partial file and there will not be a need to compare each partial file more than once. If the resulting files are still very large, you can apply the same methodology on the resulting partial files (letter specific files) that are being compared by creating files according to the second letter in them. the trade-of here would be usage of large disk space temporarily until the process is finished. in this process, approaches mentioned in other posts here can help in dealing with the partial files more efficiently.
I find myself needing to generate a checksum for a string of data, for consistency purposes. The broad idea is that the client can regenerate the checksum based on the payload it recieves and thus detect any corruption that took place in transit. I am vaguely aware that there are all kinds of mathematical principles behind this kind of thing, and that it's very easy for subtle errors to make the whole algorithm ineffective if you try to roll it yourself.
So I'm looking for advice on a hashing/checksum algorithm with the following criteria:
It will be generated by Javascript, so needs to be relatively light computationally.
The validation will be done by Java (though I cannot see this actually being an issue).
It will take textual input (URL-encoded Unicode, which I believe is ASCII) of a moderate length; typically around 200-300 characters and in all cases below 2000.
The output should be ASCII text as well, and the shorter it can be the better.
I'm primarily interested in something lightweight rather than getting the absolute smallest potential for collisions possible. Would I be naive to imagine that an eight-character hash would be suitable for this? I should also clarify that it's not the end of the world if corruption isn't picked up at the validation stage (and I do realise that this will not be 100% reliable), though the rest of my code is markedly less efficient for every corrupt entry that slips through.
Edit - thanks to all that contributed. I went with the Adler32 option and given that it was natively supported in Java, extremely easy to implement in Javascript, fast to calculate at both ends and have an 8-byte output it was exactly right for my requirements.
(Note that I realise that the network transport is unlikely to be responsible for any corruption errors and won't be folding my arms on this issue just yet; however adding the checksum validation removes one point of failure and means we can focus on other areas should this reoccur.)
CRC32 is not too hard to implement in any language, it is good enough to detect simple data corruption and when implemted in a good fashion, it is very fast. However you can also try Adler32, which is almost equally good as CRC32, but it's even easier to implement (and about equally fast).
Adler32 in the Wikipedia
CRC32 JavaScript implementation sample
Either of these two (or maybe even both) are available in Java right out of the box.
Are aware that both TCP and UDP (and IP, and Ethernet, and...) already provide checksum protection to data in transit?
Unless you're doing something really weird, if you're seeing corruption, something is very wrong. I suggest starting with a memory tester.
Also, you receive strong data integrity protection if you use SSL/TLS.
Javascript implementation of MD4, MD5 and SHA1. BSD license.
Other people have mentioned CRC32 already, but here's a link to the W3C implementation of CRC-32 for PNG, as one of the few well-known, reputable sites with a reference CRC implementation.
(A few years back I tried to find a well-known site with a CRC algorithm or at least one that cited the source for its algorithm, & was almost tearing my hair out until I found the PNG page.)
[UPDATE 30/5/2013: The link to the old JS CRC32 implementation died, so I've now linked to a different one.]
Google CRC32: fast, and much lighter weight than MD5 et al. There is a Javascript implementation here.
In my search for a JavaScript implementation of a good checksum algorithm I came across this question. Andrzej Doyle rightfully chose Adler32 as the checksum, as it is indeed easy to implement and has some excellent properties. DroidOS then provided an actual implementation in JavaScript, which demonstrated the simplicity.
However, the algorithm can be further improved upon as detailed in the Wikipedia page and as implemented below. The trick is that you need not determine the modulo in each step. Rather, you can defer this to the end. This considerably increases the speed of the implementation, up to 6x faster on Chrome and Safari. In addition, this optimalisation does not affect the readability of the code making it a win-win. As such, it definitely fits in well with the original question as to having an algorithm / implementation that is computationally light.
function adler32(data) {
var MOD_ADLER = 65521;
var a = 1, b = 0;
var len = data.length;
for (var i = 0; i < len; i++) {
a += data.charCodeAt(i);
b += a;
}
a %= MOD_ADLER;
b %= MOD_ADLER;
return (b << 16) | a;
}
edit: imaya created a jsperf comparison a while back showing the difference in speed when running the simple version, as detailed by DroidOS, compared to an optimised version that defers the modulo operation. I have added the above implementation under the name full-length to the jsperf page showing that the above implementation is about 25% faster than the one from imaya and about 570% faster than the simple implementation (tests run on Chrome 30): http://jsperf.com/adler-32-simple-vs-optimized/6
edit2: please don't forget that, when working on large files, you will eventually hit the limit of your JavaScript implementation in terms of the a and b variables. As such, when working with a large data source, you should perform intermediate modulo operations as to ensure that you do not exceed the maximum value of the integer that you can reliably store.
Use SHA-1 JS implementation. It's not as slow as you think (Firefox 3.0 on Core 2 Duo 2.4Ghz hashes over 100KB per second).
Here's a relatively simple one I've 'invented' - there's no mathematical research behind it but it's extremely fast and works in practice. I've also included the Java equivalent that tests the algorithm and shows that there's less than 1 in 10,000,000 chance of failure (it takes a minute or two to run).
JavaScript
function getCrc(s) {
var result = 0;
for(var i = 0; i < s.length; i++) {
var c = s.charCodeAt(i);
result = (result << 1) ^ c;
}
return result;
}
Java
package test;
import java.util.*;
public class SimpleCrc {
public static void main(String[] args) {
final Random randomGenerator = new Random();
int lastCrc = -1;
int dupes = 0;
for(int i = 0; i < 10000000; i++) {
final StringBuilder sb = new StringBuilder();
for(int j = 0; j < 1000; j++) {
final char c = (char)(randomGenerator.nextInt(128 - 32) + 32);
sb.append(c);
}
final int crc = crc(sb.toString());
if(lastCrc == crc) {
dupes++;
}
lastCrc = crc;
}
System.out.println("Dupes: " + dupes);
}
public static int crc(String string) {
int result = 0;
for(final char c : string.toCharArray()) {
result = (result << 1) ^ c;
}
return result;
}
}
This is a rather old thread but I suspect it is still viewed quite often so - if all you need is a short but reliable piece of code to generate a checksum the Adler32 bit algorithm has to be your choice. Here is the JavaScript code
function adler32(data)
{
var MOD_ADLER = 65521;
var a = 1, b = 0;
for (var i = 0;i < data.length;i++)
{
a = (a + data.charCodeAt(i)) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}
var adler = a | (b << 16);
return adler;
}
The corresponding fiddle demonsrating the algorithm in action is here.