Java SE 8 Base64 class: encoding of byte[] parameter - java

Background: I'm working with the java.util.Base64 class that's new with Java 1.8.
In the documentation, it specifies that the encodeToString takes a byte array (there are some other options, but byte[] is the one I'm using). However, the doc doesn't specify how the byte array needs to be encoded. Here's my functional code:
import java.util.Base64;
import java.util.Base64.Encoder;
public class Test64 {
public static void main(String[] args){
try{
System.out.println(print64("This should be base64"));
} catch(Exception e) {
e.printStackTrace();
}
}
public static String print64(String test) throws Exception {
String test64 = "";
byte[] testBytes = test.getBytes("US-ASCII");
Base64.Encoder encoder64 = Base64.getUrlEncoder();
test64 = encoder64.encodeToString(testBytes);
return test64;
}
}
The question I have is whether the Base64 encodeToString will accept a byte[] with ANY encoding. I've tried US-ASCII and UTF-8, and those both work, but I'm hoping for a general conclusion.
Link to Javadoc for Base64.Encoder

The documentation does not specify an encoding, so any byte[] data will work. Base64 conversion is numerical, not character-oriented, so whoever interprets the Base64 number will have to know what it means. So as long as your documentation is clear how to interpret the bytes, you could use the Base64 string for any data serialization.

Related

Base64 string from JWT to json

When I try to parse header from jwt as base64 to string then the output is :
{"alg":"RS256","typ":"JWT","kid":"1234"
without last bracket, but when I decode the same base64 string for example here: https://www.base64decode.org/ then the json has correct format.
function that I use:
public void test() {
String encodedToken = "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjEyMzQifQ";
System.out.println(new String(DatatypeConverter.parseBase64Binary(encodedToken)));
}
What can be wrong?
EDIT: Java 7 is mandatory.
Try to encode {"alg":"RS256","typ":"JWT","kid":"1234"} in base64
You will see eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjEyMzQifQ==
== - is a padding
I think that problem is DatatypeConverter.parseBase64Binary use representation of xsd:base64Binary (RFC 2045). But in RFC 2045 padding is mandatory.
You can use this way (java.util.Base64):
public void test() {
String encodedToken = "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjEyMzQifQ";
System.out.println(new String(Base64.getDecoder().decode(encodedToken.getBytes())));
}
java.util.Base64 uses RFC 4648 (padding is optional).
and welcome on StackOverflow.
According to this answer on Github, DatatypeConverter.parseBase64Binary() has some bugs and doesn't output the correct decoded string.
If you're using Java 8 or higher you can decode this way:
String base64 = "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjEyMzQifQ";
byte[] temp = Base64.getDecoder().decode(base64.getBytes());
System.out.println(new String(temp));
importing class java.util.Base64

Convert UTF-8 to Shift-JIS

I have written the simple conversion code to convert to Japanese character from UTF-8.
private static String convertUTF8ToShiftJ(String uft8Strg) {
String shftJStrg = null;
try {
byte[] b = uft8Strg.getBytes(UTF_8);
shftJStrg = new String(b, Charset.forName("SHIFT-JIS"));
logger.info("Converted to the string :" + shftJStrg);
} catch (Exception e) {
e.printStackTrace();
return uft8Strg;
}
return shftJStrg;
}
But it gives the output error,
convertUTF8ToShiftJ START !!
uft8Strg=*** abc000.sh ����started�
*** abc000.sh ��中�executing...�
*** abc000.sh ����ended��*
Do anybody have any idea that where I made a mistake or need some additional logic, it would be really helpful!
You String is already a String, so your method is "wrong". UTF8 is an encoding that is a byte[] and can be converted to a String in Java.
It should read:
private static byte[] convertUTF8ToShiftJ(byte[] uft8) {
If you want to convert UTF8 byte[] to JIS byte[]:
private static byte[] convertUTF8ToShiftJ(byte[] uft8) {
String s = new String(utf8, StandardCharsets.UTF_8);
return s.getBytes( Charset.forName("SHIFT-JIS"));
}
A String can be converted to a byte[] later, by mystring.getBytes(encoding)
Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for more detail.
It seems you have a conceptual misunderstanding about String encodings.
See for example Byte Encodings and Strings.
Converting a String from one encoding to another encoding doesn't make sense,
because String is a thing independent of encoding.
However, a String can be represented by byte arrays in various encodings
(like for example UTF-8 or Shift-JIS).
Therefore, it would make sense to convert a UTF-8 encoded byte array
to a Shift-JIS encoded byte array.
private static byte[] convertUTF8ToShiftJ(byte[] utf8Bytes) throws IllegalCharsetNameException {
String s = new String(utf8Bytes, StandardCharsets.UTF_8);
byte[] shftJBytes = s.getBytes(Charset.forName("SHIFT-JIS"));
return shftJBytes;
}

How to unescape html special characters in Java?

I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?
Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()
#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}

Why those calls to base64 classes return different results?

My code:
private static String convertToBase64(String string)
{
final byte[] encodeBase64 =
org.apache.commons.codec.binary.Base64.encodeBase64(string
.getBytes());
System.out.println(Hex.encodeHexString(encodeBase64));
final byte[] data = string.getBytes();
final String encoded =
javax.xml.bind.DatatypeConverter.printBase64Binary(data);
System.out.println(encoded);
return encoded;
}
Now I'm calling it: convertToBase64("stackoverflow"); and get following result:
6333526859327476646d56795a6d787664773d3d
c3RhY2tvdmVyZmxvdw==
Why I get different results?
I think Hex.encodeHexString will encode your String to hexcode, and the second one is a normal String
From the API doc of Base64.encodeBase64():
byte[] containing Base64 characters in their UTF-8 representation.
So instead
System.out.println(Hex.encodeHexString(encodeBase64));
you should write
System.out.println(new String(encodeBase64, "UTF-8"));
BTW: You should never use the String.getBytes() version without explicit encoding, because the result depends on the default platform encoding (for Windows this is usually "Cp1252" and Linux "UTF-8").

MD5 Hash of ISO-8859-1 string in Java

I'm implementing an interface for digital payment service called Suomen Verkkomaksut. The information about the payment is sent to them via HTML form. To ensure that no one messes with the information during the transfer a MD5 hash is calculated at both ends with a special key that is not sent to them.
My problem is that for some reason they seem to decide that the incoming data is encoded with ISO-8859-1 and not UTF-8. The hash that I sent to them is calculated with UTF-8 strings so it differs from the hash that they calculate.
I tried this with following code:
String prehash = "6pKF4jkv97zmqBJ3ZL8gUw5DfT2NMQ|13466|123456||Testitilaus|EUR|http://www.esimerkki.fi/success|http://www.esimerkki.fi/cancel|http://www.esimerkki.fi/notify|5.1|fi_FI|0412345678|0412345678|esimerkki#esimerkki.fi|Matti|Meikäläinen||Testikatu 1|40500|Jyväskylä|FI|1|2|Tuote #101|101|1|10.00|22.00|0|1|Tuote #202|202|2|8.50|22.00|0|1";
String prehashIso = new String(prehash.getBytes("ISO-8859-1"), "ISO-8859-1");
String hash = Crypt.md5sum(prehash).toUpperCase();
String hashIso = Crypt.md5sum(prehashIso).toUpperCase();
Unfortunately both hashes are identical with value C83CF67455AF10913D54252737F30E21. The correct value for this example case is 975816A41B9EB79B18B3B4526569640E according to Suomen Verkkomaksut's documentation.
Is there a way to calculate MD5 hash in Java with ISO-8859-1 strings?
UPDATE: While waiting answer from Suomen Verkkomaksut, I found an alternative way to make the hash. Michael Borgwardt corrected my understanding of String and encodings and I looked for a way to make the hash from byte[].
Apache Commons is an excellent source of libraries and I found their DigestUtils class which has a md5hex function which takes byte[] input and returns a 32 character hex string.
For some reason this still doesn't work. Both of these return the same value:
DigestUtils.md5Hex(prehash.getBytes());
DigestUtils.md5Hex(prehash.getBytes("ISO-8859-1"));
You seem to misunderstand how string encoding works, and your Crypt class's API is suspect.
Strings don't really "have an encoding" - an encoding is what you use to convert between Strings and bytes.
Java Strings are internally stored as UTF-16, but that does not really matter, as MD5 works on bytes, not Strings. Your Crypt.md5sum() method has to convert the Strings it's passed to bytes first - what encoding does it use to do that? That's probably the source of your problem.
Your example code is pretty nonsensical as the only effect this line has:
String prehashIso = new String(prehash.getBytes("ISO-8859-1"), "ISO-8859-1");
is to replace characters that cannot be represented in ISO-8859-1 with question marks.
Java has a standard java.security.MessageDigest class, for calculating different hashes.
Here is the sample code
include java.security.MessageDigest;
// Exception handling not shown
String prehash = ...
final byte[] prehashBytes= prehash.getBytes( "iso-8859-1" );
System.out.println( prehash.length( ) );
System.out.println( prehashBytes.length );
final MessageDigest digester = MessageDigest.getInstance( "MD5" );
digester.update( prehashBytes );
final byte[] digest = digester.digest( );
final StringBuffer hexString = new StringBuffer();
for ( final byte b : digest ) {
final int intByte = 0xFF & b;
if ( intByte < 10 )
{
hexString.append( "0" );
}
hexString.append(
Integer.toHexString( intByte )
);
}
System.out.println( hexString.toString( ).toUpperCase( ) );
Unfortunately for you it produces the same "C83CF67455AF10913D54252737F30E21" hash. So, I guess your Crypto class is exonerated. I specifically added the prehash and prehashBytes length printouts to verify that indeed 'ISO-8859-1' is used. In this case both are 328.
When I did presash.getBytes( "utf-8" ) it produced "9CC2E0D1D41E67BE9C2AB4AABDB6FD3" (and the length of the byte array became 332). Again, not the result you are looking for.
So, I guess Suomen Verkkomaksut does some massaging of the prehash string that they did not document, or you have overlooked.
Not sure if you solved your problem, but I had a similar problem with ISO-8859-1 encoded strings with nordic ä & ö characters and calculating a SHA-256 hash to compare with stuff in documentation. The following snippet worked for me:
import java.security.MessageDigest;
//imports omitted
#Test
public void test() throws ProcessingException{
String test = "iamastringwithäöchars";
System.out.println(this.digest(test));
}
public String digest(String data) throws ProcessingException {
MessageDigest hash = null;
try{
hash = MessageDigest.getInstance("SHA-256");
}
catch(Throwable throwable){
throw new ProcessingException(throwable);
}
byte[] digested = null;
try {
digested = hash.digest(data.getBytes("ISO-8859-1"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
String ret = BinaryUtils.BinToHexString(digested);
return ret;
}
To transform bytes to hex string there are many options, including the apache commons codec Hex class mentioned in this thread.
If you send UTF-8 encoded data that they treat as ISO-8859-1 then that could be the source of your problem. I suggest you either send the data in ISO-8859-1 or try to communicate to Suomen Verkkomaksut that you're sending UTF-8. In a http-based protocol you do this by adding charset=utf-8 to Content-Type in the HTTP header.
A way to rule out some issues would be to try a prehash String that only contains characters that are encoded the same in UTF-8 and ISO-8859-1. From what I can see you can achieve this by removing all "ä" characters in the string you'e used.

Categories