Is there any hash function which have following properties - java

I want a hash function which is fast, collision resistant and can give unique output. The primary requirement is - it should be persist-able i.e It's progress(hashing progress) could be saved on a file and then later resumed. You can also provide your own implementation with Python.
Implementations in "other languages" is/are also accepted if it is possible to use that with Python without getting hands dirty going internal.
Thanks in advance :)

Because of the pigeonhole principle no hash function can generate hashes which are unique / collision-proof. A good hashing function is collision-resistant, and makes it difficult to generate a file that produces a specified hash. Designing a good hash function is an advanced topic, and I'm certainly no expert in that field. However, since my code is based on sha256 it should be fairly collision-resistant, and hopefully it's also difficult to generate a file that produces a specified hash, but I can make no guarantees in that regard.
Here's a resumable hash function based on sha256 which is fairly fast. It takes about 44 seconds to hash a 1.4GB file on my 2GHz machine with 2GB of RAM.
persistent_hash.py
#! /usr/bin/env python
''' Use SHA-256 to make a resumable hash function
The file is divided into fixed-sized chunks, which are hashed separately.
The hash of each chunk is combined into a hash for the whole file.
The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
When a signal is received, hashing continues until the end of the
current chunk, then the file position and current hex digest is saved
to a file. The name of this file is formed by appending '.hash' to the
name of the file being hashed.
Just re-run the program to resume hashing. The '.hash' file will be deleted
once hashing is completed.
Written by PM 2Ring 2014.11.11
'''
import sys
import os
import hashlib
import signal
quit = False
blocksize = 1<<16 # 64kB
blocksperchunk = 1<<10
chunksize = blocksize * blocksperchunk
def handler(signum, frame):
global quit
print "\nGot signal %d, cleaning up." % signum
quit = True
def do_hash(fname):
hashname = fname + '.hash'
if os.path.exists(hashname):
with open(hashname, 'rt') as f:
data = f.read().split()
pos = int(data[0])
current = data[1].decode('hex')
else:
pos = 0
current = ''
finished = False
with open(fname, 'rb') as f:
f.seek(pos)
while not (quit or finished):
full = hashlib.sha256(current)
part = hashlib.sha256()
for _ in xrange(blocksperchunk):
block = f.read(blocksize)
if block == '':
finished = True
break
part.update(block)
full.update(part.digest())
current = full.digest()
pos += chunksize
print pos
if finished or quit:
break
hexdigest = full.hexdigest()
if quit:
with open(hashname, 'wt') as f:
f.write("%d %s\n" % (pos, hexdigest))
elif os.path.exists(hashname):
os.remove(hashname)
return (not quit), pos, hexdigest
def main():
if len(sys.argv) != 2:
print "Calculate resumable hash of a file."
print "Usage:\npython %s filename\n" % sys.argv[0]
exit(1)
fname = sys.argv[1]
signal.signal(signal.SIGINT, handler)
signal.signal(signal.SIGTERM, handler)
print do_hash(fname)
if __name__ == '__main__':
main()

Related

Java call from Python without loading classpath

I am making a Java jar file call from Python.
def extract_words(file_path):
"""
Extract words and bounding boxes
Arguments:
file_path {[str]} -- [Input file path]
Returns:
[Document]
"""
extractor = PDFBoxExtractor(file_path=file_path,jar_path="external/pdfbox-app-2.0.15.jar",class_path="external")
document = extractor.run()
return document
And somewhere:
pipe = subprocess.Popen(['java',
'-cp',
'.:%s:%s' %
(self._jar_path,
self._class_path) ,
'PrintTextLocations',
self._file_path],
stdout=subprocess.PIPE)
output = pipe.communicate()[0].decode()
This is working fine. But the problem is the jar is heavy and when I have to call this multiple times in a loop, it takes 3-4 seconds to load the jar file each time. If I run this in a loop for 100 iterations, it adds 300-400 seconds to the process.
Is there any way to keep the classpath alive for java and not load jar file every time? Whats the best way to do it in time optimised manner?
You can encapsulate your PDFBoxExtractor in a class my making it a class member. Initialize the PDFBoxExtractor in the constructor of the class. Like below:
class WordExtractor:
def __init__(self):
self.extractor = PDFBoxExtractor(file_path=file_path,jar_path="external/pdfbox-app-2.0.15.jar",class_path="external")
def extract_words(self,file_path):
"""
Extract words and bounding boxes
Arguments:
file_path {[str]} -- [Input file path]
Returns:
[Document]
"""
document = self.extractor.run()
return document
Next step would be to create instance of WordExtractor class outside the loop.
word_extractor = WordExtractor()
#your loop would go here
while True:
document = word_extractor.extract_words(file_path);
This is just example code to explain the concept. You may tweak it the way you want as per your requirement.
Hope this helps !

Why is ReversedLinesFileReader so slow?

I have a file that is 21.6GB and I want to read it from the end to the start rather than from the beginning to the end as you would usually do.
If I read each line of the file from the start to the end using the following code, then it takes 1 minute, 12 seconds.
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLine {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
Now, I have read that to read in a file in reverse then I should use ReversedLinesFileReader from Apache Commons. I have created the following extension function to do just this:
fun File.forEachLineFromTheEndOfFile(action: (line: String) -> Unit) {
val reader = ReversedLinesFileReader(this, Charset.defaultCharset())
var line = reader.readLine()
while (line != null) {
action.invoke(line)
line = reader.readLine()
}
reader.close()
}
and then call it in the following way, which is the same as the previous way only with a call to forEachLineFromTheEndOfFile function:
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLineFromTheEndOfFile {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
This took 17 minutes and 50 seconds to run!
Am I using ReversedLinesFileReader in the correct way?
I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Is it just the case that files should not be read from the end to the start?
You are asking for a very expensive operation. Not only are you using random access in blocks to read the file and going backwards (so if the file system is reading ahead, it is reading the wrong direction), you are also reading an XML file which is UTF-8 and the encoding is slower than a fixed byte encoding.
Then on top of that you are using a less than efficient algorithm. It reads a block at a time of inconvenient size (is it disk block size aware? are you setting the block size to match your file system?) backwards while processing encoding and makes (unnecessary?) copy of the partial byte array and then turns it into a string (do you need a string to parse?). It could create the string without the copy and really creating the string probably could be deferred and you work directly from the buffer only decoding if you need to (XML parsers for example also work from ByteArrays or buffers). And there are other array copies that just are not needed but it is more convenient for the code.
It also might have a bug in that it checks for newlines without considering that the character might mean something different if actually is part of a multi-byte sequence. It would have to look back a few extra characters to check this for variable length encodings, I don't see it doing that.
So instead of a nice forward only heavily buffered sequential read of a file which is the fastest thing you can do on your filesystem, you are doing random reads of 1 block at a time. It should at least read multiple disk blocks so that it can use the forward momentum (set blocksize to some multiple of your disk block size will help) and also avoid the number of "left over" copies made at buffer boundaries.
There are probably faster approaches. But it'll not be as fast as reading a file in forward order.
UPDATE:
Ok, so I tried an experiment with a rather silly version that processes around 27G of data by reading the first 10 million lines from wikidata JSON dump and reversing those lines.
Timings on my 2015 Mac Book Pro (with all my dev stuff and many chrome windows open eating memory and some CPU all the time, about 5G of total memory is free, VM size is default with no parameters set at all, not run under debugger):
reading in reverse order: 244,648 ms = 244 secs = 4 min 4 secs
reading in forward order: 77,564 ms = 77 secs = 1 min 17 secs
temp file count: 201
approx char count: 29,483,478,770 (line content not including line endings)
total line count: 10,050,000
The algorithm is to read the original file by lines buffering 50000 lines at a time, writing the lines in reverse order to a numbered temp file. Then after all files are written, they are read in reverse numerical order forward by lines. Basically dividing them into reverse sort order fragments of the original. It could be optimized because this is the most naive version of that algorithm with no tuning. But it does do what file systems do best, sequential reads and sequential writes with good sized buffers.
So this is a lot faster than the one you were using and it could be tuned from here to be more efficient. You could trade CPU for disk I/O size and try using gzipped files as well, maybe a two threaded model to have the next buffer gzipping while processing the previous. Less string allocations, checking each file function to make sure nothing extra is going on, make sure no double buffering, and more.
The ugly but functional code is:
package com.stackoverflow.reversefile
import java.io.File
import java.util.*
fun main(args: Array<String>) {
val maxBufferSize = 50000
val lineBuffer = ArrayList<String>(maxBufferSize)
val tempFiles = ArrayList<File>()
val originalFile = File("/data/wikidata/20150629.json")
val tempFilePrefix = "/data/wikidata/temp/temp"
val maxLines = 10000000
var approxCharCount: Long = 0
var tempFileCount = 0
var lineCount = 0
val startTime = System.currentTimeMillis()
println("Writing reversed partial files...")
try {
fun flush() {
val bufferSize = lineBuffer.size
if (bufferSize > 0) {
lineCount += bufferSize
tempFileCount++
File("$tempFilePrefix-$tempFileCount").apply {
bufferedWriter().use { writer ->
((bufferSize - 1) downTo 0).forEach { idx ->
writer.write(lineBuffer[idx])
writer.newLine()
}
}
tempFiles.add(this)
}
lineBuffer.clear()
}
println(" flushed at $lineCount lines")
}
// read and break into backword sorted chunks
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { lineCount <= maxLines }.forEach { line ->
lineBuffer.add(line)
if (lineBuffer.size >= maxBufferSize) flush()
}
flush()
// read backword sorted chunks backwards
println("Reading reversed lines ...")
tempFiles.reversed().forEach { tempFile ->
tempFile.bufferedReader(bufferSize = 4096 * 32).lineSequence()
.forEach { line ->
approxCharCount += line.length
// a line has been read here
}
println(" file $tempFile current char total $approxCharCount")
}
} finally {
tempFiles.forEach { it.delete() }
}
val elapsed = System.currentTimeMillis() - startTime
println("temp file count: $tempFileCount")
println("approx char count: $approxCharCount")
println("total line count: $lineCount")
println()
println("Elapsed: ${elapsed}ms ${elapsed / 1000}secs ${elapsed / 1000 / 60}min ")
println("reading original file again:")
val againStartTime = System.currentTimeMillis()
var againLineCount = 0
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { againLineCount <= maxLines }
.forEach { againLineCount++ }
val againElapsed = System.currentTimeMillis() - againStartTime
println("Elapsed: ${againElapsed}ms ${againElapsed / 1000}secs ${againElapsed / 1000 / 60}min ")
}
The correct way to investigate this problem would be:
Write a version of this test in pure Java.
Benchmark it to make sure that the performance problem is still there.
Profile it to figure out where the performance bottleneck is.
Q: Am I using ReversedLinesFileReader in the correct way?
Yes. (Assuming that it is an appropriate thing to use a line reader at all. That depends on what it is you are really trying to do. For instance, if you just wanted to count lines backwards, then you should be reading 1 character at a time and counting the newline sequences.)
Q: I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Possibly. Reading a file in reverse means that the read-ahead strategies used by the OS to give fast I/O may not work. It could be interacting with the characteristics of an SSD.
Q: Is it just the case that files should not be read from the end to the start?
Possibly. See above.
The other thing that you have not considered is that your file could actually contain some extremely long lines. The bottleneck could be the assembly of the characters into (long) lines.
Looking at the source code, it would seem that there is potential for O(N^2) behavior when lines are very long. The critical part is (I think) in the way that "rollover" is handled by FilePart. Note the way that the "left over" data gets copied.

Java Sha-512 Message Digest with salting not matching linux shadow file hashed passwords

I'm trying to produce the same hashes found in the linux shadow file using the MessageDigest, given the password, salt value and hashing algorithm, although the results do not match with what I get from the function below.
Hash Algorithm = 6
Password = mandar
Salt Value = 5H0QpwprRiJQR19Y
Expected Output = $6$5H0QpwprRiJQR19Y$bXGOh7dIfOWpUb/Tuqr7yQVCqL3UkrJns9.7msfvMg4ZOPsFC5Tbt32PXAw9qRFEBs1254aLimFeNM8YsYOv.
Actual Output = ca0d04319f273d36f246975a4f9c71d0184c4ca7f3ba54bc0b3e0b4106f0eefca1e9a122a536fb17273b1077367bf68365c10fa8a2b18285a6825628f3614194
I have this function for generating the hash value
public String getSha512Hash(String password, String saltValue) throws NoSuchAlgorithmException{
String text = saltValue + password ;
MessageDigest messageDigest = MessageDigest.getInstance("SHA-512");
byte[] bytes = messageDigest.digest( text.getBytes() );
StringBuilder sb = new StringBuilder();
for (int i = 0; i < bytes.length; ++i) {
sb.append(Integer.toHexString((bytes[i] & 0xFF) | 0x100).substring(1,3));
}
return sb.toString();
}
I'm referring to this website.
The passwords in /etc/shadow are hashed using the crypt(3) system call (man crypt).
You can use the Apache Commons implementation which should mimic the same behavior.
The fundamental problem is that the site you are referring to uses Perl's crypt() which seems a direct call to libc crypt(). In the manual of crypt is not specified how the SHA-512 hash is actually computed, but I searched GitHub and found this ~400 LOC source file sha512-crypt.c.
I read throught it and can't tell if it refers to some standard or if it's the only program using that algorithm. Since the SHA-512 thing also seems a proprietary extension to the POSIX standard, it's absolutely not unlikely.
You could ask the maintainer or the mailing list and report your findings back, otherwise if you absolutely need that functionality, you could write a native extension (don't know if there are Java libraries already available).

How to get File both MD5 and SHA1 checksum at the same time when upload a new file?

I am working on a storage system. Users upload files to the server.
On the server side, I want to implement a program to get the checksums of the file using both MD5 and SHA1.
I know how to calculate checksums using DigestInputStream functions, but it seems only supports one method (either MD5 or SHA1) a time.
How can I calculate both MD5 and SHA1 a the same time when dealing with the upload stream in JAVA?
Thanks guys
Use two MessageDigest instances (one for MD5 and one for SHA1) and feed the bytes you read into both.
as java-ish pseudocode, since you can look up the API for OpenSSL or BSafe or the Java Crypto API on your own...
Buffered reader = ...;
char[MY_ARRAY_SIZE] buf = ...;
while( true ) {
int count = reader.read(buf, 0, buf.length);
if( count == -1 ) { break };
/* You'll need to check for the right API and handle errors yourself */
md5.add(buf, count);
sha256.add(buf, count);
}
String md5sum = base64(md5.finalize()); // assumes an appropriate base64 method
String sha256sum = base64(sha256.finalize());

fastest packing of data in Python (and Java)

(Sometimes our host is wrong; nanoseconds matter ;)
I have a Python Twisted server that talks to some Java servers and profiling shows spending ~30% of its runtime in the JSON encoder/decoder; its job is handling thousands of messages per second.
This talk by youtube raises interesting applicable points:
Serialization formats - no matter which one you use, they are all
expensive. Measure. Don’t use pickle. Not a good choice. Found
protocol buffers slow. They wrote their own BSON implementation which
is 10-15 time faster than the one you can download.
You have to measure. Vitess swapped out one its protocols for an HTTP
implementation. Even though it was in C it was slow. So they ripped
out HTTP and did a direct socket call using python and that was 8%
cheaper on global CPU. The enveloping for HTTP is really expensive.
Measurement. In Python measurement is like reading tea leaves.
There’s a lot of things in Python that are counter intuitive, like
the cost of grabage colleciton. Most of chunks of their apps spend
their time serializing. Profiling serialization is very depending on
what you are putting in. Serializing ints is very different than
serializing big blobs.
Anyway, I control both the Python and Java ends of my message-passing API and can pick a different serialisation than JSON.
My messages look like:
a variable number of longs; anywhere between 1 and 10K of them
and two already-UTF8 text strings; both between 1 and 3KB
Because I am reading them from a socket, I want libraries that can cope gracefully with streams - its irritating if it doesn't tell me how much of a buffer it consumed, for example.
The other end of this stream is a Java server, of course; I don't want to pick something that is great for the Python end but moves problems to the Java end e.g. performance or torturous or flaky API.
I will obviously be doing my own profiling. I ask here in the hope you describe approaches I wouldn't think of e.g. using struct and what the fastest kind of strings/buffers are.
Some simple test code gives surprising results:
import time, random, struct, json, sys, pickle, cPickle, marshal, array
def encode_json_1(*args):
return json.dumps(args)
def encode_json_2(longs,str1,str2):
return json.dumps({"longs":longs,"str1":str1,"str2":str2})
def encode_pickle(*args):
return pickle.dumps(args)
def encode_cPickle(*args):
return cPickle.dumps(args)
def encode_marshal(*args):
return marshal.dumps(args)
def encode_struct_1(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
def decode_struct_1(s):
i, j, k = struct.unpack(">iii",s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = struct.unpack(">%dq"%i,s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
struct_header_2 = struct.Struct(">iii")
def encode_struct_2(longs,str1,str2):
return "".join((
struct_header_2.pack(len(longs),len(str1),len(str2)),
array.array("L",longs).tostring(),
str1,
str2))
def decode_struct_2(s):
i, j, k = struct_header_2.unpack(s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = array.array("L")
longs.fromstring(s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
def encode_ujson(*args):
return ujson.dumps(args)
def encode_msgpack(*args):
return msgpacker.pack(args)
def decode_msgpack(s):
msgunpacker.feed(s)
return msgunpacker.unpack()
def encode_bson(longs,str1,str2):
return bson.dumps({"longs":longs,"str1":str1,"str2":str2})
def from_dict(d):
return [d["longs"],d["str1"],d["str2"]]
tests = [ #(encode,decode,massage_for_check)
(encode_struct_1,decode_struct_1,None),
(encode_struct_2,decode_struct_2,None),
(encode_json_1,json.loads,None),
(encode_json_2,json.loads,from_dict),
(encode_pickle,pickle.loads,None),
(encode_cPickle,cPickle.loads,None),
(encode_marshal,marshal.loads,None)]
try:
import ujson
tests.append((encode_ujson,ujson.loads,None))
except ImportError:
print "no ujson support installed"
try:
import msgpack
msgpacker = msgpack.Packer()
msgunpacker = msgpack.Unpacker()
tests.append((encode_msgpack,decode_msgpack,None))
except ImportError:
print "no msgpack support installed"
try:
import bson
tests.append((encode_bson,bson.loads,from_dict))
except ImportError:
print "no BSON support installed"
longs = [i for i in xrange(10000)]
str1 = "1"*5000
str2 = "2"*5000
random.seed(1)
encode_data = [[
longs[:random.randint(2,len(longs))],
str1[:random.randint(2,len(str1))],
str2[:random.randint(2,len(str2))]] for i in xrange(1000)]
for encoder,decoder,massage_before_check in tests:
# do the encoding
start = time.time()
encoded = [encoder(i,j,k) for i,j,k in encode_data]
encoding = time.time()
print encoder.__name__, "encoding took %0.4f,"%(encoding-start),
sys.stdout.flush()
# do the decoding
decoded = [decoder(e) for e in encoded]
decoding = time.time()
print "decoding %0.4f"%(decoding-encoding)
sys.stdout.flush()
# check it
if massage_before_check:
decoded = [massage_before_check(d) for d in decoded]
for i,((longs_a,str1_a,str2_a),(longs_b,str1_b,str2_b)) in enumerate(zip(encode_data,decoded)):
assert longs_a == list(longs_b), (i,longs_a,longs_b)
assert str1_a == str1_b, (i,str1_a,str1_b)
assert str2_a == str2_b, (i,str2_a,str2_b)
gives:
encode_struct_1 encoding took 0.4486, decoding 0.3313
encode_struct_2 encoding took 0.3202, decoding 0.1082
encode_json_1 encoding took 0.6333, decoding 0.6718
encode_json_2 encoding took 0.5740, decoding 0.8362
encode_pickle encoding took 8.1587, decoding 9.5980
encode_cPickle encoding took 1.1246, decoding 1.4436
encode_marshal encoding took 0.1144, decoding 0.3541
encode_ujson encoding took 0.2768, decoding 0.4773
encode_msgpack encoding took 0.1386, decoding 0.2374
encode_bson encoding took 55.5861, decoding 29.3953
bson, msgpack and ujson all installed via easy_install
I would love to be shown I'm doing it wrong; that I should be using cStringIO interfaces or however else you speed it all up!
There must be a way to serialise this data that is an order of magnitude faster surely?
While JSon is flexible, it is one of the slowest serialization formats in Java (possible python as well) in nano-seconds matter I would use a binary format in native byte order (likely to be little endian)
Here is a library were I do exactly that AbstractExcerpt and UnsafeExcerpt A typical message takes 50 to 200 ns to serialize and send or read and deserialize.
In the end, we chose to use msgpack.
If you go with JSON, your choice of library on Python and Java is critical to performance:
On Java, http://blog.juicehub.com/2012/11/20/benchmarking-web-frameworks-for-games/ says:
Performance was absolutely atrocious until we swapped out the JSON Lib (json-simple) for Jackon’s ObjectMapper. This brought RPS for 35 to 300+ – a 10x increase
You may be able to speed up the struct case
def encode_struct(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
Try using the python array module and the method tostring to convert your longs into a binary string. Then you can append it like you did with the strings
Create a struct.Struct object and use that. I believe it's more efficient
You can also look into:
http://docs.python.org/library/xdrlib.html#module-xdrlib
Your fastest method encodes 1000 elements in .1222 seconds. That's 1 element in .1222 milliseconds. That's pretty fast. I doubt you'll do much better without switching languages.
I know this is an old question, but it is still interesting. My most recent choice was to use Cap’n Proto, written by the same guy who did protobuf for google. In my case, that lead to a decrease in both time and volume around 20% compared to Jackson's JSON encoder/decoder (server to server, Java on both sides).
Protocol Buffers are pretty fast and have bindings for both Java and Python. It's quite a popular library and used inside Google so it should be tested and optimized quite well.
Since the data your sending is already well defined, non-recursive, and non-nested, why not just use a simple delimited string. You just need a delimiter that isn't contained in your string variables maybe '\n'.
"10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\nFoo Foo Foo\nBar Bar Bar"
Then just use a simple String Split method.
string[] temp = str.split("\n");
ArrayList<long> longs = new ArrayList<long>(long.parseLong(temp[0]));
string string1 = temp[temp.length-2];
string string2 = temp[temp.length-1];
for(int i = 1; i < temp.length-2 ; i++)
longs.add(long.parseLong(temp[i]));
note The above was written in the web browser and untested so syntax errors may exist.
For a text based; I'd assume the above is the fastest method.

Categories