I find myself needing to generate a checksum for a string of data, for consistency purposes. The broad idea is that the client can regenerate the checksum based on the payload it recieves and thus detect any corruption that took place in transit. I am vaguely aware that there are all kinds of mathematical principles behind this kind of thing, and that it's very easy for subtle errors to make the whole algorithm ineffective if you try to roll it yourself.
So I'm looking for advice on a hashing/checksum algorithm with the following criteria:
It will be generated by Javascript, so needs to be relatively light computationally.
The validation will be done by Java (though I cannot see this actually being an issue).
It will take textual input (URL-encoded Unicode, which I believe is ASCII) of a moderate length; typically around 200-300 characters and in all cases below 2000.
The output should be ASCII text as well, and the shorter it can be the better.
I'm primarily interested in something lightweight rather than getting the absolute smallest potential for collisions possible. Would I be naive to imagine that an eight-character hash would be suitable for this? I should also clarify that it's not the end of the world if corruption isn't picked up at the validation stage (and I do realise that this will not be 100% reliable), though the rest of my code is markedly less efficient for every corrupt entry that slips through.
Edit - thanks to all that contributed. I went with the Adler32 option and given that it was natively supported in Java, extremely easy to implement in Javascript, fast to calculate at both ends and have an 8-byte output it was exactly right for my requirements.
(Note that I realise that the network transport is unlikely to be responsible for any corruption errors and won't be folding my arms on this issue just yet; however adding the checksum validation removes one point of failure and means we can focus on other areas should this reoccur.)
CRC32 is not too hard to implement in any language, it is good enough to detect simple data corruption and when implemted in a good fashion, it is very fast. However you can also try Adler32, which is almost equally good as CRC32, but it's even easier to implement (and about equally fast).
Adler32 in the Wikipedia
CRC32 JavaScript implementation sample
Either of these two (or maybe even both) are available in Java right out of the box.
Are aware that both TCP and UDP (and IP, and Ethernet, and...) already provide checksum protection to data in transit?
Unless you're doing something really weird, if you're seeing corruption, something is very wrong. I suggest starting with a memory tester.
Also, you receive strong data integrity protection if you use SSL/TLS.
Javascript implementation of MD4, MD5 and SHA1. BSD license.
Other people have mentioned CRC32 already, but here's a link to the W3C implementation of CRC-32 for PNG, as one of the few well-known, reputable sites with a reference CRC implementation.
(A few years back I tried to find a well-known site with a CRC algorithm or at least one that cited the source for its algorithm, & was almost tearing my hair out until I found the PNG page.)
[UPDATE 30/5/2013: The link to the old JS CRC32 implementation died, so I've now linked to a different one.]
Google CRC32: fast, and much lighter weight than MD5 et al. There is a Javascript implementation here.
In my search for a JavaScript implementation of a good checksum algorithm I came across this question. Andrzej Doyle rightfully chose Adler32 as the checksum, as it is indeed easy to implement and has some excellent properties. DroidOS then provided an actual implementation in JavaScript, which demonstrated the simplicity.
However, the algorithm can be further improved upon as detailed in the Wikipedia page and as implemented below. The trick is that you need not determine the modulo in each step. Rather, you can defer this to the end. This considerably increases the speed of the implementation, up to 6x faster on Chrome and Safari. In addition, this optimalisation does not affect the readability of the code making it a win-win. As such, it definitely fits in well with the original question as to having an algorithm / implementation that is computationally light.
function adler32(data) {
var MOD_ADLER = 65521;
var a = 1, b = 0;
var len = data.length;
for (var i = 0; i < len; i++) {
a += data.charCodeAt(i);
b += a;
}
a %= MOD_ADLER;
b %= MOD_ADLER;
return (b << 16) | a;
}
edit: imaya created a jsperf comparison a while back showing the difference in speed when running the simple version, as detailed by DroidOS, compared to an optimised version that defers the modulo operation. I have added the above implementation under the name full-length to the jsperf page showing that the above implementation is about 25% faster than the one from imaya and about 570% faster than the simple implementation (tests run on Chrome 30): http://jsperf.com/adler-32-simple-vs-optimized/6
edit2: please don't forget that, when working on large files, you will eventually hit the limit of your JavaScript implementation in terms of the a and b variables. As such, when working with a large data source, you should perform intermediate modulo operations as to ensure that you do not exceed the maximum value of the integer that you can reliably store.
Use SHA-1 JS implementation. It's not as slow as you think (Firefox 3.0 on Core 2 Duo 2.4Ghz hashes over 100KB per second).
Here's a relatively simple one I've 'invented' - there's no mathematical research behind it but it's extremely fast and works in practice. I've also included the Java equivalent that tests the algorithm and shows that there's less than 1 in 10,000,000 chance of failure (it takes a minute or two to run).
JavaScript
function getCrc(s) {
var result = 0;
for(var i = 0; i < s.length; i++) {
var c = s.charCodeAt(i);
result = (result << 1) ^ c;
}
return result;
}
Java
package test;
import java.util.*;
public class SimpleCrc {
public static void main(String[] args) {
final Random randomGenerator = new Random();
int lastCrc = -1;
int dupes = 0;
for(int i = 0; i < 10000000; i++) {
final StringBuilder sb = new StringBuilder();
for(int j = 0; j < 1000; j++) {
final char c = (char)(randomGenerator.nextInt(128 - 32) + 32);
sb.append(c);
}
final int crc = crc(sb.toString());
if(lastCrc == crc) {
dupes++;
}
lastCrc = crc;
}
System.out.println("Dupes: " + dupes);
}
public static int crc(String string) {
int result = 0;
for(final char c : string.toCharArray()) {
result = (result << 1) ^ c;
}
return result;
}
}
This is a rather old thread but I suspect it is still viewed quite often so - if all you need is a short but reliable piece of code to generate a checksum the Adler32 bit algorithm has to be your choice. Here is the JavaScript code
function adler32(data)
{
var MOD_ADLER = 65521;
var a = 1, b = 0;
for (var i = 0;i < data.length;i++)
{
a = (a + data.charCodeAt(i)) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}
var adler = a | (b << 16);
return adler;
}
The corresponding fiddle demonsrating the algorithm in action is here.
Related
I want to write a relatively simple program, that can backup files from my computer to a remote location and encrypt them in the process, while also computing a diff (well not really...I'm content with seeing if anything changed at all, not so much what has changed) between the local and the remote files to see which ones have changed and are necessary to update.
I am aware that there are perfectly good programs out there to do this (rsync, or others based on duplicity). I'm not trying to reinvent the wheel, it's just supposed to be a learning experience for myself
My question is regarding to the diff part of the project. I have made some assumptions and wrote some sample code to test them out, but I would like to know if you see anything I might have missed, if the assumptions are just plain wrong, or if there's something that could go wrong in a particular constelation.
Assumption 1: If files are not of equal length, they can not be the same (ie. some modification must have taken place)
Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set of these two files will have the same hash
Assumption 3: If a byte sub-set of two files is found which does not result in the same hash, the two files are not the same (ie. have been modified)
The code is written in Java and the hashing algorithm used is BLAKE-512 using the java implementation from Marc Greim.
_File1 and _File2 are 2 files > 1.5GB of type java.io.File
public boolean compareStream() throws IOException {
int i = 0;
int step = 4096;
boolean equal = false;
FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];
if(_File1.length() == _File2.length()) { //Assumption 1
while(i*step < _File1.length()) {
fi1.read(fi1Content, 0, step); //Assumption 2
fi2.read(fi2Content, 0, step); //Assumption 2
equal = BLAKE512.isEqual(fi1Content, fi2Content); //Assumption 2
if(!equal) { //Assumption 3
break;
}
++i;
}
}
fi1.close();
fi2.close();
return equal;
}
The calculation for two equal 1.5 GB files takes around 4.2 seconds. Times are of course much shorter when the files differ, especially when they are of different length since it returns immediately.
Thank you for your suggestions :)
..I hope this isn't too broad
While assumptions are correct, they won't protect from rare false positives (when method says files are equal when they aren't):
Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set will have the same hash
This is right, but because of hash collisions you can have the situation, when hashes of chunks are the same, but chunks themselves differ.
I have a question, i'm doing some research on some programming languages.
The research is about the efficiency of the substring functions in C# and Java.
Questions like is C# using a brute force kind of way, or do they implement Boyer-Moore's algorithm like a good boy.
I need the source code for this, I already found it for Java ( Who use a brute force implementation in the indexOf()method for those who wonder ).
Does anyone have an idea how i can retrieve the source code for methods like these in C#.
I have visual studios installed on my laptop but i can't find any source code...
Your help will much obliged!
Microsoft has published the complete framework source code, including comments. You will find the actual implementation over here on referencesource. For SubString, it comes down to some unmanaged code:
[System.Security.SecurityCritical] // auto-generated
unsafe string InternalSubString(int startIndex, int length) {
Contract.Assert( startIndex >= 0 && startIndex <= this.Length, "StartIndex is out of range!");
Contract.Assert( length >= 0 && startIndex <= this.Length - length, "length is out of range!");
String result = FastAllocateString(length);
fixed(char* dest = &result.m_firstChar)
fixed(char* src = &this.m_firstChar) {
wstrcpy(dest, src + startIndex, length);
}
return result;
As you can see, they are using wstrcpy which probably is as fast as it gets.
I do not have a teacher who I can ask questions about efficiency, so I will ask it here.
If I am only looking to have fast working code, not paying attention to ram use, only cpu:
I assume that checking 'if' once is faster than writing a variable once. But what is the ratio? When is it worth always checking if the variable is not already at the value that I am going to set it to?
For example:
//ex. 1
int a = 5;
while (true) {
a = 5;
}
//ex. 2
int a = 5;
while (true) {
if (a != 5) a = 5;
}
//ex. 3
int a = 6;
while (true) {
if (a != 5) a = 5;
a = 6;
}
I guess ex. 2 will work faster than ex. 1 because 'a' always stays at '5'. In this case 'if' speeds up the process by not writing a new value to 'a' everytime. But if 'a' often changes, like in ex. 3, then checking if (a != 5) is not necessary and slows down the process. So this checking is worth it if the variable stays the same most of the time; and not worth it if the variable changes most of the time. But where is the ratio? Maybe writing a variable takes 1000 times more time than just checking it? Or maybe writing almost takes the same time as checking it? Im not asking for an exact answer, I just always wonder what is best for my code.
Short answer: it doesn't matter.
Long answer: It really doesn't matter at that low level. Even if you were to actually compare the executed machine code, there are so many things in between (the JIT compiler for one, all sorts of CPU caches for other).
Gone are the times when you needed to micro-optimize things like this. What you need to make sure is that you're using effective algorithms. And as always, premature optimization is the root of all evil.
I noted that you wrote "I just always wonder what is the best way for my code". The best way is to write clear code, so that other people can understand what you're doing (if they saw code like in your examples, they would think you're insane). Another old adage was that in order for the JVM to optimize your code in the best way, you should write "dumb code". The JIT optimizer can then understand the code better and convert it to a more efficient form.
I am calling the Google Protocol Buffers Java API from Matlab. This works pretty well, but I have hit a big performance bottleneck. The bulk of the data are returned as objects of type:
java.util.Collections$UnmodifiableRandomAccessList
They actually contain a list of floats. I need to convert this to a Matlab matrix. The best approach I have found so far is to call:
cell2mat(cell(Q.toArray()))
However, that one line is a huge performance bottleneck in the code.
Note I am aware of the FarSounder Matlab parser generators for the Google Protocol Buffers, unfortunately these are very slow. See below for some rough benchmark speeds for my problem (YMMV). High is good.
Farsounder Matlab: 0.03
Pure Python: 1
Java API called from Matlab (parsing and extracting metadata only): 10
Java API called from Matlab (parsing and extracting both metadata and data): 0.25
If it wasn't for the overhead of converting the java.util.Collections$UnmodifiableRandomAccessList
to a Matlab matrix, then the approach of calling the Java API from Matlab would look quite promising.
Is there a better way of converting this Java object into a Matlab matrix?
Bear in mind that the method returning this type is in automatically generated code.
You might be best writing a tiny piece of extra java code, like so:
import java.util.List;
import java.util.ListIterator;
class Helper {
public static float[] toFloatArray(List l) {
float retValue[] = new float[l.size()];
ListIterator iterator = l.listIterator();
for (int idx = 0; idx < retValue.length; ++idx ){
// List had better contain float values,
// or else the following line will ClassCastException.
retValue[idx] = (float) iterator.next();
}
return retValue;
}
}
with which I see:
>> j = java.util.LinkedList;
>> for idx = 1:1e5, j.add(single(idx)); end
>> tic, out = Helper.toFloatArray(j); toc
Elapsed time is 0.006553 seconds.
>> tic, cell2mat(cell(j.toArray)); toc
Elapsed time is 0.305973 seconds.
In my experience, the most performant solution is write a little set of java helpers, that converts the lists to plain arrays of primitive types.
These are well mapped to matrices by matlab.
If the above e.g. gives a an array of java.lang.Floats, the helper could look like this:
public static float[] toFloats(Float[] floats) {
float[] rv = new float[floats.length];
for (int i=0; i < floats.length; i++) rv[i] = (float) floats[i];
return rv;
}
In matlab cell2mat(cell(Q.toArray())) hence would become:
some.package.toFloats(Q.toArray());
Obviously you could modify the helper function to directly take your list as well, avoiding the need for the toArray() call (does this actually make a copy?).
I'm currently developing a Java-based library for network coding (http://en.wikipedia.org/wiki/Network_coding). This is very CPU-intensive and therefore need some help optimizing the encoding stage. What I'm essentially doing is that I'm creating random-linear combinations of the original data where addition is XOR and multiplication is a Galois-field multiplication (in GF(2^16)).
I've come as far as I'm capable with the optimizations. For instance I'm using tricks like this: http://groups.google.com/group/comp.dsp/browse_thread/thread/cba57ae9db9971fd/7cd21eec39ddae1a?hl=en&lnk=gst&q=Sarwate+Galois#7cd21eec39ddae1a to make the multiplications faster.
I'm therefore looking for tips on how to optimize this further. It's hard to profile since the profilers I've used doesn't give you any hints on which operation is the most expensive (e.g. is it the array-lookup or the XOR). So I'm at the point where I'm sort of randomly trying out different ideas and test if it improves the overall performance.
More specifically some potential areas of improvement that I need help on are:
How can I make sure that Java can skip the bounds-checking on the array-operations?
How can I retrieve the bytecode that actually executes after the HotSpot is done optimizing?
Here's the core of the algorithm. It might be hard to understand out of context but if you see any unnecessarily expensive operations I'm doing then please let me know!
int messageFragmentStart = 0;
int messageFragmentEnd = fragmentCharSize;
int coefficientIndex = fragmentID * messageFragmentsPerDataBlock;
final int resultArrayIndexStart = fragmentID * fragmentCharSize;
for (int messageFragmentIndex = 0; messageFragmentIndex < messageFragmentsPerDataBlock; messageFragmentIndex++) {
final int coefficientLogValue = coefficientLogValues[coefficientIndex++];
int resultArrayIndex = resultArrayIndexStart;
for (int i = messageFragmentStart; i < messageFragmentEnd; i++) {
final int logSum = coefficientLogValue + logOfDataToEncode[i];
final int messageMultipliedByCoefficient = expTable[logSum];
resultArray[resultArrayIndex++] ^= messageMultipliedByCoefficient;
}
messageFragmentStart += fragmentCharSize;
messageFragmentEnd = Math.min(messageFragmentEnd + fragmentCharSize, maxTotalLength);
}
You can't make Java forgo the bounds checking as its specified in the JLS. But in most cases the JIT is able to avoid this as long as the bounds check is simple (eg i < array.length) - if not, there's no way to avoid it (well I assume one could play with unsafe objects?).
For your second problem there's this here which should fulfill the purposes just fine.
But anyhow from your code it seems like this problem is trivial to vectorize and sadly the JVM isn't very good at it/does it at all. Hence implementing the same code in c/c++ using compile intrinsics (you could even try the auto vectorization of ICC/GCC) could lead to some quite noticeable speedups - assuming we're not completely memory bound. So I'd implement it in C++ and use JNI just for reference.