Get length of base64 decoded data - java

I need to calculate the length of base64 decoded data.
I have Base-64 data that I am sending the unencoded data as the body of a HTTP response (typo: I meant request, but same idea).
I need to send a Content-Length header.
In the interest of memory usage and performance I'd rather not actually Base-64 decode the data all at once, but rather stream it.
Given base64 data, how do I calculate the length of the decoded data will be? I need either a general algorithm, or a Java/Scala solution.
EDIT: This is similar to, but not a duplicate of Calculate actual data size from Base64 encoded string length, where the OP asks
...can I calculate the length of the raw data that has been encoded only by looking at the length of the Base64-encoded string?
The answer is no. It is necessary to look at the padding as well.
I want to know how the length and the base64 data can be used to calculate the original length.

Assuming that you can't just use chunked encoding (and thereby avoid sending a Content-Length header), you need to consult the padding thus:
Base64 encodes three binary octets into four characters. You have 4N Base64 characters. Let k be the number of trailing '=' chars (i.e. padding chars: 0, 1 or 2).
Let M = 3*floor((N-k)/4), i.e. the number of octets in "complete" 3-octet chunks.
If you have 2 padding chars then you have M + 1 bytes.
If you have 1 padding char then you have M + 2 bytes.
If you have 0 padding chars then you have M bytes.
Of course, floor() in this case means truncating integer division, i.e. the normal / operator.
Presumably you can count padding octets relatively easily (e.g. by seeking to the end of a file, or by looking at the end of a byte array), without having to read the whole Base64-encoded thing sequentially.

I arrived at this simple calculation.
If L is the length of the Base-64 encoded data, and p is the number of padding characters (which will be 0, 1, or 2), then the length of the unencoded data is
L * 3 / 4 - p
In my case (with Scala),
bytes.length * 3 / 4 - bytes.reverseIterator.takeWhile(_ == '=').length
NOTE: This is assuming the the data does not have line separators. (Often, Base-64 data will have new lines every 72 characters or so.) If it does, exclude line separators from the length L.

Related

Differences between Crypt.crypt() and DigestUtils.md5() in apache.commons.Codec

I am writing a basic password cracker for the MD5 hashing scheme against a Linux /etc/shadow file. When I use commons.codec's DigestUtils or Crypt libraries, the hash length for them are different (among other things).
When I use the Crypt.crypt(passwordToHash, "$1$Jhe937$") the output is a 22-character string. When I use the DigestUtils.md5[Hex](passwordToHash + "Jhe937")(or the Java MessageDigest class) the output is a 32-character string (after converted). This makes no sense to me.
aside: is there no easy way to convert the DigestUtils.md5(passwordToHash)'s byte[] to a String. I've tried all* the ways and I get all non-valid output: Nz_èJÓ_µù[î¬y
*all being: new String(byte[], "UTF-8") and convert to char then to String
The executive summary is that while they'll perform the same hashing, the output format is different between the two so the lengths will be different. Read on for details.
MD5 is a message digesting algorithm that produces a 16 byte hash value, always (assuming valid input, etc.) Those bytes aren't all printable characters, they can take any value from 0-255 for any of the bytes, while the printable characters in ASCII are in the range 32-126.
DigestUtils.md5(String) generates the MD5 of the string and returns a 16 element byte array. DigestUtils.md5Hex(String) is a convenience wrapper (I'm assuming, I haven't looked at the source, but that's how I'd write it :-) ) around DigestUtils.md5 that takes the 16 element byte array md5 produces and base16 encodes it (also known as hex encoding). That replaces each byte with the equivalent two hex characters, which is why you get a 32 character String out of it.
Crypt.crypt uses a special format that goes back to the original Unix method of storing passwords. It's been extended over the years to use different hash/encryption algorithms, longer salts, and additional features. It also encodes it's output to be printable text, which is where the length difference is coming from. By using a salt of "$1$...", you're saying to use MD5, so the password plus the salt will be hashed using MD5, resulting in 16 bytes as expected, but because those bytes aren't necessarily printable, the hash is base64 encoded (using a slightly different alphabet than the standard base64 encoding), which replaces 3 bytes with 4 printable characters. So 16 bytes becomes 16 / 3 * 4 = 21-1/3 characters, rounded up to 22.
On your aside, DigestUtils.md5 produces 16 bytes, but those bytes can have any value from 0 to 255 and are (effectively) random. new String(byte[], "UTF-8") says the bytes in the byte array are a UTF-8 encoding, which is a very specific format. new String does it's best to treat the bytes as a UTF-8 encoded string, but because they're really not, you generally get gibberish out. If you want something printable, you'll have to use something that takes random bytes, not bytes in a specific format (like UTF-8). Two popular options are base16/hex encoding, which you can get with DigestUtils.md5Hex, or base64, which you can get with Base64.encodeBase64String(DigestUtils.md5(pwd + salt)).

Apache commons base64 decode and Sun base64 decode

byte[] commonsDecode = Base64.decodeBase64(data);
debug("The data is " + commonsDecode.length + " bytes long for the apache commons base64 decoder.");
BASE64Decoder decoder = new BASE64Decoder();
byte[] sunDecode = decoder.decodeBuffer(data);
Log.debug("The data is " + sunDecode.length + " bytes long for the SUN base64 decoder.");
Please explain to me why these two method calls would produce different length for the resulting byte arrays. I initially thought it might have to do with character encodings but if so I don't understand all of the issues properly. The above code was executed on the same system and in the same application, in the order shown above. So the default character encoding on that system would be the same.
The input (test) data:
The below is a System.out.println of the Java String.
qFkIQgDq jk3ScHpqx8BPVS97YE4pP/nBl5Qw7mBnpSGqNqSdGIkLPVod0pBl Uz7NgpizHDicGzNCaauefAdwGklpPr0YdwCu4wRkwyAuvtDmL0BYASOn2tDw72LMz5FChtSa0CoCBQ2ARsFG2GdflnIWsUuBQapX73ZBMiqqm ZCOnMRv9Ol8zT1TECddlKZMYAvmjANgq0sBPyUMF7co XY9BYAjV3L/cA8CGQpXGdrsAgjPKMhzk4hh1GAoQ1soX2Dva8p3erPJ4sy2Vcb6lS1Hap9FR0AZFawbJ10FFSTg10wxc24539kYA6xxq/TFqkhaEoSyTqjXjvo1SA==
Apache commons decoder says it's 252 length byte array.
Java Sun decoder says 256.
The decoded data is not valid Base64 data.
Valid Base64 data can contain whitespace. Usually, it has a newline every 72 characters. However, your data contains spaces in random places. If they are removed (as every Base64 decoder is supposed to do), 339 characters remain. Yet, valid Base64 data has to be a multiple of 4 characters.
Interestingly, your data contains no plus signs. I suspect it once contained them but they have probably been replaced with spaces somewhere in transmission. If you replace all spaces with plus signs, the Base64 data is valid and the decoded data will have a length of 256 bytes: 344 characters / 4 * 3 - 2 padding characters.
I further suspect that the Base64 data was used in a URL without proper URL encoding. That's a probable cause for the missing plus signs. Note that Base64 encoded data is not URL safe. Both the plus and the equal signs need to be escaped.

Simple java algorithm to encode/decode the following string

Suppose I have
String input = "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3";
I want to encode it into a string with less character and actually hides the actual information by representing it in roman character, IE. the above encodes to something like "Adqwqkjlhs". Must be able to decode to original string if given the encoded string.
The string input is actually something I parse from the hash of an URL, but the original format is lengthy and open to manipulation.
Any ideas?
Thanks
Edit #1
The number can be from 0 to 99, and each number is separate by a comma for String.split(",") to retrieve the String[]
Edit #2 (Purpose of encoded string)
Suppose the above string encodes to bmtwva1131gpefvb1xv, then I can have URL link like www.shortstring.com/input#bmtwva1131gpefvb1xv. From there I would decode bmtwva1131gpefvb1xv into comma separate numbers.
This isn't really much of an improvement from Nathan Hughes' solution, but the longer the Strings are, the more of a savings you get.
Encoding: create a String starting with "1", making each of the numbers in the source string 2 digits, thus "0" becomes "00", "5" becomes "05", "99" becomes "99", etc. Represent the resulting number in base 36.
Decoding: Take the base 36 number/string, change it back to base 10, skip the first "1", then turn every 2 numbers/letters into an int and rebuild the original string.
Example Code:
String s = "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3";
// ENCODE the string
StringTokenizer tokenizer = new StringTokenizer(s,",");
StringBuilder b = new StringBuilder();
b.append("1"); // This is a primer character, in case we end up with a bunch of zeroes at the beginning
while(tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken().trim();
if(token.length()==1) {
b.append("0");
b.append(token);
}
else {
b.append(token);
}
}
System.out.println(b);
// We get this String: 101020000000000000000000000000000000000010202030004000000040003
String encoded = (new BigInteger(b.toString())).toString(36);
System.out.println(encoded);
// We get this String: kcocwisb8v46v8lbqjw0n3oaad49dkfdbc5zl9vn
// DECODE the string
String decoded = (new BigInteger(encoded, 36)).toString();
System.out.println(decoded);
// We should get this String: 101020000000000000000000000000000000000010202030004000000040003
StringBuilder p = new StringBuilder();
int index = 1; // we skip the first "1", it was our primer
while(index<decoded.length()) {
if(index>1) {
p.append(",");
}
p.append(Integer.parseInt(decoded.substring(index,index+2)));
index = index+2;
}
System.out.println(p);
// We should get this String: 1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3
I don't know of an easy way to turn a large number into base 64. Carefully chosen symbols (like +,,-) are ok to be URL encoded, so 0-9, a-z, A-Z, with a "" and "-" makes 64. The BigInteger.toString() method only takes up to Character.MAX_RADIX which is 36 (no uppercase letters). If you can find a way to take a large number and change to base 64, then the resulting encoded String will be even shorter.
EDIT: looks like this does it for you: http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html
How about saving it as a base 36 number?
In Java that would be
new java.math.BigInteger("120000000000000000012230400403").toString(36)
which would evaluate to "bmtwva1131gpefvb1xv"
You would get the original number back with
new java.math.BigInteger("bmtwva1131gpefvb1xv", 36)
It's a good point that this doesn't handle leading 0s (Thilo's suggestion of adding a leading 1 would work). About the commas: if the numbers were equally sized (01 instead of 1) then i think there wouldn't be a need to commas.
Suggest you look at base64 which provides 6 bits of information per character -- in general your encoding efficiency is log2(K) bits per symbol where K is the number of symbols in the set of allowable symbols.
For 8-bit character set, many of these are impermissible in URLs, so you need to choose some subset that are legal URL characters.
Just to clarify: I didn't mean encode your "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3" string as base64 -- I meant figure out what information you really want to encode, expressed as a string of raw binary bytes, and encode that in base64. It will exclude control characters (although you might want to use an alternate form where all 64 characters can be used in URLs without escaping) and be more efficient than converting numbers to a printable number form.
The number can be from 0 to 99, and each number is separate by a comma for String.split(",") to retrieve the String[]
OK, now you have a clear definition. Here's a suggestion:
Convert your information from its original form to a binary number / byte array. If all you have is a string of comma-separated numbers from 0-99, then here's two options:
(slow) -- treat as numbers in base 100, convert to a BigInteger (e.g. n = n * 100 + x[i] for each number x in the array), convert to a byte array, and be sure to precede the whole thing by its length, so that "0,0,0,0" can be distinguished from "0,0" (numerically equal in base 100 but it has a different length. Then convert the result to base64.
(more efficient) -- treat as numbers in base 128 (since that is a power of 2), and use any number from 100-127 as a termination character. Each block of 6 numbers therefore contains 42 (=6*7) bits of information, which can be encoded as a string of 7 characters using base64. (Pad with termination characters as needed to reach an even multiple of 6 of the original numbers.)
Because you have a potentially variable-length array of numbers as inputs, you need to encode the length somehow -- either directly as a prefix, or indirectly by using a termination character.
For the inverse algorithm, just reverse the steps and you'll get an array of numbers from 0 to 99 -- using either the prefixed length or termination character to determine the size of the array -- which you can convert to a human-readable string separated with commas.
If you have access to the original information in a raw binary form before it's encoded as a string, use that instead. (but please post a question with the input format requirements for that information)
If numbers are between 0 and 255, you can create a byte array out of it. Once you have a byte array, you have manu choices :
Use base64 on the byte array, which will create a compact string (almost) URL compatible
Convert them to chars, using your own algorithm based on maximum values
Convert them to longs, and then use Long.toString(x,31).
To convert back, you'll obviously have to apply the chosen algorithm in the opposite way.
Modified UUENCODE:-
Split the binary into groups of 6 bits
Make an array of 64 characters (choose ones allowable and keep in ASCII order for easy search):- 0..9, A..Z, _, a..z, ~
Map between the binary and the characters.

Base64 vs HEX for sending binary content over the internet in XML doc

What is the best way of sending binary content between system inside an XML document
I know of Base64 and Hex, what is the real difference. I am currently using Base64 but need to include an external commons library for this where as with HEX I think I could just create a function.
You could just write your own method for Base64 as well... but I'd generally recommend using external, well-tested libraries for both. (It's not like there's any shortage of them.)
The difference between Base64 and hex is really just how bytes are represented. Hex is another way of saying "Base16". Hex will take two characters for each byte - Base64 takes 4 characters for every 3 bytes, so it's more efficient than hex. Assuming you're using UTF-8 to encode the XML document, a 100K file will take 200K to encode in hex, or 133K in Base64. Of course it may well be that you don't care about the space efficiency - in many cases it won't matter. If it does matter, then clearly Base64 is better on that front. (There are alternatives which are even more efficient, but they're not as common.)
I was curious how on EARTH base64 can convert 3 input bytes into 4 output bytes for just 33% space growth (whereas hex converts 1 input byte into 2 output bytes for 100% space growth). Why specifically 3 input bytes?
The answer is:
3 bytes = 3 x 8 bits = 24 bits.
Why that magic "24 bits" number? Well, base 64 represents the numbers 0 to 63. How are those represented in binary? With 000000 (0) to 111111 (63).
Bingo! Each base64 character represents 6 bits of input data using a single output byte (a single character such as "Z", etc).
So 24 bits (3 full 8-bit bytes of input) / 6 bits (base64 alphabet) = 4 bytes of base64. That's it!
Or, described another way, every Base64 character (which is 1 byte (8 bits)) encodes 6 bits of real data. And if we divide 8bits/6bits we see where the 33% growth comes from, as mentioned at the top of this post... So yes, Base64 always increases data size by 33% (plus some potential padding by the = characters that are sometimes added at the end of the base64 output).
You may think "Why not base128 (7 bits of input = 8 bits of output), at just 14% size growth when encoding?". The answer for that is that base64 is the best we can find, since the lower 128 ASCII characters aren't all printable. Many are control characters such as NULL etc.
There are obviously ways to create other systems such as perhaps "base81" etc, since you can do anything you want if you create a custom encoding algorithm. But the beauty of base64 is how it encodes data so cleanly in chunks of 6 bits, and how you simply have to "read 3 bytes and output 4" to encode, and "read 4 bytes and output 3" to decode. So that encoding scheme became popular.
Now you are hopefully wiser after having read this.
Fun Update: Speaking of other encoding styles with more characters... It's come to my attention that Ascii85 aka Base85 exists and is slightly more efficient (25% data size growth when encoding as Base85 instead of 33% for Base64): https://en.wikipedia.org/wiki/Ascii85
There only two 'real differences':
The radix. Base64 is base-64, surprise, and hex is base-16.
The encoding: base-64 encodes 3 source bytes into 4 base-64 characters (http://en.wikipedia.org/wiki/Base64#Examples); hex encodes 1 byte into 2 hex characters.
So base64 is more compact than hex.
Other answers made clear the efficiency difference between base16 and base64.
There is more to base selection than efficiency.
Base64 uses more than just letters and numbers. Different implementations use different punctuation characters for indicating padding, and making up the last two characters of the set of 64. These can include plus "+" and equal "=". both problematic in HTTP query strings.
So one reason to favour base16 over base64 is that base16 values can be composed directly into HTTP query strings without requiring additional encoding. Is that important to you?
Notice that this is an additional concern, over and above efficiency. Neither base is inherently better or worse; they're just two different points on a scale, at which you'll find different properties that will be more or less attractive in different situations.
For example, consider base32. It's 20% less efficient than base64, but is still suitable for use in HTTP query strings. Most of its inefficiency comes from being case-insensitive and avoiding zero "0" and one "1", to mistakes in reproduction by humans.
So base32 introduces a new concern; ease of reproduction for humans. Is that a concern for you? If it's not, you could go for something like base62, which is still convenient in HTTP query strings, but is case sensitive and includes zero "0" and "1".
Hopefully, I've clarified that the selection of your encoding base is a matter of sliding along a scale until you get the best efficiency you can have before sacrificing what's important to you.
Wikipedia has a fun list of numeral systems.
Is size important to you?
Base64 is more space efficient. Using 4 characters to represent 3 bytes where as hex uses 2 characters for each byte. In other words: hex increases the size of the string with 100%. For small strings that fit as params in url requests I wouldn't mind the extra cost/size.
Is ease of use important to you?
Hex is easier to use than Base64 because you don't need to escape (it may contain +, = and /) when using the string as a get parameter in url requests.
Is widespread use important to you?
I don't have the numbers, but Base64 might be more known to the general developer than hex depending on several factors. I knew about base64 long before hex (base16).
base64 has less overhead (base64 produces 4 characters for every 3 bytes of original data while hex produces 2 characters for every byte of original data). Hex is more readable - you just look at the two characters and immediately know what byte is behind, but with base64 you need effort decoding the 4-characters group, so debugging will be easier with hex.

SHA-1 Hashes Mixed with Strings

I have to parse something like the following "some text <40 byte hash>" can i read this whole thing in to a string without corrupting 40 byte hash part?
The thing is hash is not going to be there so i don't want to process it while reading.
EDIT: I forgot to mention that the 40 byte hash is 2x20 byte hashes no encoding raw bytes.
Read it from your input stream as a byte stream, and then strip the String out of the stream like this:
String s = new String(Arrays.copyOfRange(bytes, 0, bytes.length-40));
Then get your bytes as:
byte[] hash = Arrays.copyOfRange(bytes, s.length-1, bytes.length-1)
SHA-1 hashes are 20 bytes (160 bits) in length. If you are dealing with 40 character hashes, then they are probably an ASCII representation of the hash, and therefore only contain the characters 0-9 and a-f. If this is the case, then you should be able to read and manipulate the strings in Java without any trouble.
Some more details could be useful, but I think the answer is that you should be okay.
You didn't say how the SHA-1 hash was encoded (common possibilities include "none" (the raw bytes), Base64 and hex). Since SHA-1 produces a 20 byte (160 bit) hash, I am guessing that it will be encoded using hex, since that doubles the space needed to the 40 bytes you mentioned. With that encoding, 2 characters will be used to encode each byte from the hash, using the symbols 0 through 9 and A through F. Those are all ASCII characters so you are safe.
Base64 encoding would also work (though probably not what you asked about since it increases the size by about 1/3 leaving you at well less than 40 bytes) as each of the characters used in Base64 are also ASCII.
If the raw bytes were used directly, you would have a problem, as some of the values are not valid characters.
OK, now that you've clarified that these are raw bytes
No, you cannot read this into Java as a string, you will need to read it as raw bytes.
WORKING CODE:
Converts byte string inputs into hex characters which should be safe in almost all string encodings. Use the code I posted in your other question to decode the hex chars back to raw bytes.
/** Lookup table: character for a half-byte */
static final char[] CHAR_FOR_BYTE = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
/** Encode byte data as a hex string... hex chars are UPPERCASE */
public static String encode(byte[] data){
if(data == null || data.length==0){
return null;
}
char[] store = new char[data.length*2];
for(int i=0; i<data.length; i++){
final int val = (data[i]&0xFF);
final int charLoc=i<<1;
store[charLoc]=CHAR_FOR_BYTE[val>>>4];
store[charLoc+1]=CHAR_FOR_BYTE[val&0x0F];
}
return new String(store);
}

Categories