How to encode and shorten hash for url safety?

How to encode and shorten hash for url safety? - java

I use SHA-512 to hash a token, which results in a 128 character string. How can I encode it so that it is shorter but still URL safe (no need to escape like '+' or '&')?
Hex and Base32 are both suboptimal, and Base64 is unsafe. I don't want to hack up a custom Base62 codec, as that will only become a maintenance burden.

Google has you covered:
com.google.common.io.BaseEncoding#base64Url
This is an efficient Base64 codec that only uses url-safe characters. It will produce a String with a length of 88 characters for a SHA512 hash.

Java 8 added Base64 encoders, with different flavors. See java.util.Base64
For example:
Base64.getUrlEncoder().encode(...);

Related

How to know what variant of Base64 encoding Java class uses?

In my Java program i use com.ibm.xml.enc.dom.Base64 class for encoding/decoding
binary files. How can i know what variant of Base64 encoding this class uses?

You have indeed several implementations for base64.
The idea behind this encoding, is to find a way to carry raw bytes through the different network layers, without them to be altered.
Each layer is reading bytes, and you don't want your raw data to be cut (corrupted) because a random sequence of bytes is interpreted like a "end-of-rquest". That's why application data are encoded as printable characters.
(more details: http://www.comptechdoc.org/independent/networking/protocol/protlayers.html)
Most of the Base64 tables are using A-Z, a-z and 0-9 for the 62 first characters. And the differences among implementations are for the last 2 characters and the padding one.
The most common implementations use + and / for the last two characters of the table. But you might find as well - and _ which are used to be url-safe.
For your class com.ibm.xml.enc.dom.Base64, nothing is specified in the doc:
https://www.ibm.com/support/knowledgecenter/en/SSYKE2_6.0.0/com.ibm.java.security.api.doc/xmlsec/com/ibm/xml/enc/dom/Base64.html#Base64()
So you can assume that they use the most common implementation for Base64. If you have doubt, just try to generate example with random raw bytes. You can double-check that the base64 are using + and / and 63 and 64 characters.
If you need to write a generic base64 decoder, able to handle different variant of bas64. You would need to check for these special characters, check the length of the string, and the characters used for padding. From these info you could deduce an implementation to be used.
You have much more details about the variants on wikipedia:
https://en.wikipedia.org/wiki/Base64

Differences between Crypt.crypt() and DigestUtils.md5() in apache.commons.Codec

I am writing a basic password cracker for the MD5 hashing scheme against a Linux /etc/shadow file. When I use commons.codec's DigestUtils or Crypt libraries, the hash length for them are different (among other things).
When I use the Crypt.crypt(passwordToHash, "$1$Jhe937$") the output is a 22-character string. When I use the DigestUtils.md5[Hex](passwordToHash + "Jhe937")(or the Java MessageDigest class) the output is a 32-character string (after converted). This makes no sense to me.
aside: is there no easy way to convert the DigestUtils.md5(passwordToHash)'s byte[] to a String. I've tried all* the ways and I get all non-valid output: Nz_èJÓ_µù[î¬y
*all being: new String(byte[], "UTF-8") and convert to char then to String

The executive summary is that while they'll perform the same hashing, the output format is different between the two so the lengths will be different. Read on for details.
MD5 is a message digesting algorithm that produces a 16 byte hash value, always (assuming valid input, etc.) Those bytes aren't all printable characters, they can take any value from 0-255 for any of the bytes, while the printable characters in ASCII are in the range 32-126.
DigestUtils.md5(String) generates the MD5 of the string and returns a 16 element byte array. DigestUtils.md5Hex(String) is a convenience wrapper (I'm assuming, I haven't looked at the source, but that's how I'd write it :-) ) around DigestUtils.md5 that takes the 16 element byte array md5 produces and base16 encodes it (also known as hex encoding). That replaces each byte with the equivalent two hex characters, which is why you get a 32 character String out of it.
Crypt.crypt uses a special format that goes back to the original Unix method of storing passwords. It's been extended over the years to use different hash/encryption algorithms, longer salts, and additional features. It also encodes it's output to be printable text, which is where the length difference is coming from. By using a salt of "$1$...", you're saying to use MD5, so the password plus the salt will be hashed using MD5, resulting in 16 bytes as expected, but because those bytes aren't necessarily printable, the hash is base64 encoded (using a slightly different alphabet than the standard base64 encoding), which replaces 3 bytes with 4 printable characters. So 16 bytes becomes 16 / 3 * 4 = 21-1/3 characters, rounded up to 22.
On your aside, DigestUtils.md5 produces 16 bytes, but those bytes can have any value from 0 to 255 and are (effectively) random. new String(byte[], "UTF-8") says the bytes in the byte array are a UTF-8 encoding, which is a very specific format. new String does it's best to treat the bytes as a UTF-8 encoded string, but because they're really not, you generally get gibberish out. If you want something printable, you'll have to use something that takes random bytes, not bytes in a specific format (like UTF-8). Two popular options are base16/hex encoding, which you can get with DigestUtils.md5Hex, or base64, which you can get with Base64.encodeBase64String(DigestUtils.md5(pwd + salt)).

Is there a drastic difference between UTF-8 and UTF-16

I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method.
Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service.
Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16.
In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service.
Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot of difference in the two of these encodings? And if I wanted to know what will É convert from UTF-8 to UTF-16, then how can I do that?

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no
accents, etc. occupy one byte which is identical to US-ASCII
representation. This way all US-ASCII strings become valid UTF-8,
which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this
introduces a great deal of backwards compatibility too.
Main UTF-8 cons:
Many common characters have different length, which slows indexing
and calculating a string length terribly.
Main UTF-16 pros:
Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese
can be represented with 2 bytes. Unless really exotic characters are
needed, this means that the 16-bit subset of UTF-16 can be used as a
fixed-length encoding, which speeds indexing.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no
null-terminated strings and a lot of wasted memory.
In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol

There are two things:
the encoding in which you exchange data;
the internal string representation of Java.
You should not be preoccupied with the second point ;) The thing is to use the appropriate methods to convert from your data (byte arrays) to Strings (char arrays ultimately), and to convert form Strings to your data.
The most basic classes you can think of are CharsetDecoder and CharsetEncoder. But there are plenty others. String.getBytes(), all Readers and Writers are but two possible methods. And there are all static methods of Character as well.
If you see gibberish at some point, it means you failed to decode or encode from the original byte data to Java strings. But again, the fact that Java strings use UTF-16 is not relevant here.
In particular, you should be aware that when you create a Reader or Writer, you should specify the encoding; if you fail to do so, the default JVM encoding will be used, and it may, or may not, be UTF-8.

This Website provide UTF TO UTF Conversion
http://www.fileformat.info/convert/text/utf2utf.htm
UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits and While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling.
HOWEVER
UTF-32 is the same as the old UCS-4 encoding and remains fixed width. Why can this remain fixed width? As UTF-16 is now the format that can encode the least amount of characters it set the limit for all formats. It was defined that 1,112,064 was the total number of code points that will ever be defined by either Unicode or ISO 10646. Since Unicode is now only defined from 0 to 10FFFF UTF-32 sounds a bit like a pointless encoding now as it's 32 bit wide, but only ever about 21 bits are used which makes this very wasteful.

UTF-8: Generally speaking, you should use UTF-8. Most HTML documents use this encoding.
It uses at least 8 bits of data to store each character. This can lead to more efficient storage, especially when the text contains mostly English ASCII characters. But higher-order characters, such as non-ASCII characters, may require up to 24 bits each!
UTF-16:
This encoding uses at least 16 bits to encode characters, including lower-order ASCII characters and higher-order non-ASCII characters.
If you are encoding text consisting of mostly non-English or non-ASCII characters, UTF-16 may result in smaller file size. But if you use UTF-16 to encode mostly ASCII text, it will use up more space.

How to use UTF-16 in URL encoding?

Currently I am using utf-8 for URL encoding. I want to convert it to UTF-16.
How can I achieve this?

When encoding Unicode characters in URLs, it's necessary to encode them in such a fashion that all URL parsers and consumers can understand your URLs.
To that end; when the URL was expanded by RFCs in the wake of the development of Unicode and related standards and tools, it was decided that the encoding to employ for encoding characters (using percent escapes) was to be UTF-8, as this would mean that established ASCII escapes would Just Work™.
Consequently, even if you could generate URLs with UTF-16-based percent escapes, no other program would be able to understand them, making them useless. In fact, by matter of definition, they wouldn't even be URLs.
There's also the question of why on earth you would want to use UTF-16 for anything, it being silly and all.
Remember: Never Don't Use UTF-8! (N'DUUH!)

URL escapes, as in %nn hex values, encode bytes. 8-bit bytes. If for some very nonstandard reason you want to encode bytes of UTF-16 instead of UTF-8, you must first pick a byte order (BE or LE). Then you have to write code in your program to take the two bytes of each 16-bit UTF-16 character and represent it as %nn in hex.

Why does the Blowfish output in Java and PHP differ by only 2 chars?

I have a blowfish encryption script in PHP and JAVA vice versa that was working fine until today when I came across a problem.
The same content is encrypted differently in Java vs PHP by only 2 chars, which is really weird.
PHP
wTHzxfxLHdMm/JMFnoh0hciS/JADvFFg
Java
wTHzxfxLHdMm/JMFnoh0hciS/D8DvFFg
-------------------------^^
As you see those two positions do not match. Unfortunately the value is a real email address and I can't share it. Also I was not able to reproduce the problem with other few values I've tested. I've tried changing Base64 encode classes on Java, and that neither helped.
The source code for PHP is here, and for Java is here.
What could I do to resolve this problem?

Let's have a look at your Java code:
String c = new String(Test.encrypt((new String("thevalue")).getBytes(),
(new String("mykey")).getBytes()));
...
System.out.println("Base64 encoded String:" +
new sun.misc.BASE64Encoder().encode(c.getBytes()));
What you are doing here is:
Convert the plaintext string to bytes, using the system's default encoding
convert the key to bytes, using the system's default encoding
encrypt the bytes
convert the encrypted bytes back to a string, using the system's default encoding
convert the encrypted string back to bytes, using the system's default encoding
encode these encrypted bytes using Base64.
The problem is in step 4. It assumes that an arbitrary byte array represents a string in your system's default encoding, and encoding this string back gives the same byte[]. This is valid for some encodings (the ISO-8859 series, for example), but not for others. In Java, when some byte (or byte sequence) is not representable in the given encoding, it will be replaced by some other character, which later for reconverting will be mapped to byte 63 (ASCII ?). Actually, the documentation even says:
The behavior of this constructor when the given bytes are not valid in the default charset is unspecified.
In your case, there is no reason to do this at all - simply use the bytes which your encrypt method outputs directly to convert them to Base64.
byte[] encrypted = Test.encrypt("thevalue".getBytes(),
"mykey".getBytes());
System.out.println("Base64 encoded String:"+ new sun.misc.BASE64Encoder().encode(encrypted));
(Also note that I removed the superfluous new String("...") constructor calls here, though this does not relate to your problem.)
The point to remember: Never ever convert an arbitrary byte[], which did not come from encoding a string, to a string. Output of an encryption algorithm (and most other cryptographic algorithms, except decryption) certainly belongs to the category of data which should not be converted to a string.
And never ever use the System's default encoding, if you want portable programs.

Your code seems right to me.
It looks like you have a trailing white space in the input to one of these programs, and it is only one. I'll tell you why:
Each of these 4-char blocks represent 3 characters in the encrypted string. Th different part (JA and D8 in the 7th block) actually come from a single different character.
wTHz xfxL HdMm /JMF noh0 hciS /JAD vFFg
wTHz xfxL HdMm /JMF noh0 hciS /D8D vFFg
If I have got it right your email address is 19 characters long. The 20th character in one of your input strings is a white space.

Question: Have you tried the associated PHP decryption library to decrypt the PHP generated encrypted text? Have you tried the associated JAVA decryption library to decrypt the JAVA encrypted text?
If both produce differing outputs, then one MUST fail decrypting.
Is that one PHP, or Java?
Whichever one it is -- I would try to duplicate another such failure with a publicly shareable string... give that string as a unit test -- to the developer or developers that created the encrypt/decrypt code in the language that the round-trip encrypt/decrypt fails in.
Then... wait for them to fix it.
Not sure of any faster solutions -- except maybe change encryption/decryption library providers... or roll your own...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.