I'm Currently creating a web application that requires passwords to be encrypted and stored in a database. I found the following Guide that encrypts passwords using PBKDF2WithHmacSHA1.
In the example provided the getEncryptedPassword method returns a byte array.
Are there any advantages in doing Base64 encoding the result?
Any Disadvantages?
The byte[] array is the smallest mechanism for storing the value (storage space wise). If you have lots of these values it may make sense to store them as bytes. Depending on where you store the result, the format may make a difference too. Most databases will accomodate byte[] values fairly well, but it can be cumbersome (depending on the database). Stores like text files and XML documents, etc. obviously will struggle with the byte[] array.
In most circumstances I feel there are two formats that make sense, Hexadecimal representation, or byte[]. I seldom think that the advantages of Base64 for short values (less than 32 characters) are worth it (for larger items then sure, use base64, and there's a fantastic library for it too).
This is obviously all subjective.....
Converting values to Hexadecimal are quite easy: see How to convert a byte array to a hex string in Java?
Hex output is convenient, and easier to manage than Base64 which has a more complicated algorithm to build, and is thus slightly slower.....
Assuming a reasonable database there is no advantage, since it's just an encoding scheme. There is a size increase as a consequence of base 64 encoding, which is a disadvantage. If your database reliably stores 8-bit bytes, just store the hash in the database.
Related
Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).
I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.
The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?
So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?
Too vague in terms of whats really needed.
If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.
Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.
If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.
The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).
Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:
byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");
The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).
Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes).
My recommendation is that you use the text version, as it will probably be easier to parse and more readable.
About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).
I have a Java application which works with MySQL database.
I want to be able to store long texts and check whether table contains them. For this I want to use index, and search by reduced "hash" of full_text.
MY_TABLE [
full_text: TEXT
text_hash: varchar(255) - indexed
]
Thing is, I cannot use String.hashCode() as:
Implementation may vary across JVM versions.
Value is too short, which means many collisions.
I want to find a fast hashing function that will read the long text value and produce a long hash value for it, say 64 symbols long.
Such reliable hash methods are not fast. They're probably fast enough, though. You're looking for a cryptographic message digest method (like the ones used to identify files in P2P networks or commits in Git). Look for the MessageDigest class, and pick your algorithm (SHA1, MD5, SHA256, etc.).
Such a hash function will take bytes as argument, and produce bytes as a result, so make sure to convert your strings using a constant encoding (UTF8, for example), and to transform the produced byte array (typically of 16 or 20 bytes) to a readable String using hexadecimal or Base64 encoding.
I'd suggest that you to revisit String.hashCode().
First, it does not vary across implementations. The exact hash is specified; see the String.hashCode javadoc specification.
Second, while the String hash algorithm isn't the best there possibly is (and certainly it will have more collisions than a cryptographic hash) it does do a reasonably good job of spreading the hashes over the 32-bit result space. For example, I did a quick check of a text file on my machine (/usr/share/dict/web2a) which has 235,880 words, and there were six collisions.
Third and fourth: String.hashCode() should be considerably faster, and the storage required for the hash values should be considerably smaller, than a cryptographic hash.
If you're storing strings in a database table, and their hash values are indexed, having a few collisions shouldn't matter. Looking up a string should get you the right database rows really quickly, and having to (maybe) check a couple actual strings should be very fast compared to the database I/O.
I have written an algorithm to implement Huffman Coding for compressing text files. It basically takes in a string as an input and generates a string of bits as output. However, I am having trouble storing this binary data as it is being stored as a string where each bit is a character and consumes 2 bytes of memory for storage. End result, output file is larger than the input, making the whole program worthless. How should I store this binary output such that each bit takes only one bit of memory for storage?
ps. I have tried using a BitSet but that did not change the size of the output at all
Once you have your result in the BitSet, you can call
BitSet.toByteArray() to save your data to a file, i.e.:
FileUtils.writeByteArrayToFile(new File(...), bitSet.toByteArray());
And BitSet.valueOf(byte[]) to read your data from file:
BitSet bitSet = new BitSet(FileUtils.readFileToByteArray(new File(...)));
I have a Java application that persists byte[] structures to a DB (using Hibernate). I'm writing a C++ application that reads these structures at a later time.
As you might expect, I'm having problems.... The byte[] structure written to the DB is longer than the original number of bytes - 27 bytes, by the looks of things.
Is there a way of determining (in C++) the byte[] header structure to properly find the beginning of the true data?
I spent some time looking at the JNI source code (GetArrayLength(jbytearray), etc.) to determine how that works, but got quickly mired in the vagaries of JVM code. yuck...
Ideas?
The object is probably being serialized using the Java Object Serialization Protocol. You can verify this by looking for the magic number 0xACED at the beginning. If this is the case, it's just wrapped with some meta information about the class and length, and you can easily parse the actual byte values off the end.
In particular, you would see 0xAC 0xED 0x00 0x05 for the header followed by a classDesc element that would be 0x75 ...bytes... 0x70, followed by a 4 byte length, and the then the bytes themselves. Java serializes the length and other multibyte values in big-endian format.
What will be the most efficient way (optimal for performance and storage space) to store the MD5 sum of file in a java (or groovy) object considering the following use-cases:
I need to compare with thousands of other md5 sums.
I may need to store this in HSQLDB, so that records can be pulled/group by based on md5
May be stored in Map's as keys
I am trying to avoid storing it as String as String comparisons will be more costly and take more space. Will BigInteger(string,radix) be more efficient? Also, what datatype should be selected if persisting in database?
Create a class that wraps a byte[] and provides no mutation. If you want to use it as a key in a map, then it needs to either be comparable, or have a hash code. With a byte[] you'll have an easier time computing a simple hashcode from the first 32 bits.
For comparison speed in Java, storing it as two long values will likely be fastest. For persistence, storage as a byte array makes the most sense, if your database and persistence tools support it. Otherwise, storage as hexadecimal or Base-64–encoded text is fairly common and will inter-operate well with other applications that access the same database.
If you need to perform a lot of comparisons, you could store the MD5 value as 2 long integers, that way you only need to perform at most 4 logical operations to check against another MD5 value.
Basically, provide a class that will accept an input, a raw digest data as byte[] and use
ByteBuffer bb = ByteBuffer.wrap(digestData);
long[] bits = new long[] {
bb.getLong(),
bb.getLong()
};
Compare with another long[] MD5 array with
boolean eq = ((bits[0]^otherBits[0]) | (bits[1]^otherBits[1])) == 0);
Reconstruct the MD5 with
ByteBuffer bb = ByteBuffer.allocate(16);
bb.putLong(bits[0]);
bb.putLong(bits[1]);
byte[] digestData = new byte[16];
bb.get(digestData);
Note : I am not suggesting to convert the byte[] into long[] for every comparisons, this is simply how to store the digest for comparisons. The last reconstruction snippet is optional, you should keep the data as byte[] and compare the long[] arrays only. In the database, store the data as a 32 bytes hexadecimal value.