decode UTF-16 text - java

I have a Java servlet that receives data from an upstream system via a HTTP GET request. This request includes a parameter named "text" and another named "charset" that indicates how the text parameter was encoded:
If I instruct the upstream system to send me the text TĀ and debug the servlet request params, I see the following:
request.getParameter("charset") == "UTF-16LE"
request.getParameter("text").getBytes() == [0, 84, 1, 0]
The code points (in hex) for the two characters in this string are:
[T] 0054
[Ā] 0100
I cannot figure out how to convert this byte[] back to the String "TĀ". I should mention that I don't entirely trust the charset and suspect it may be using UTF-16BE.

Use the String(byteArray, charset) constructor:
byte[] bytes = { 0, 84, 1, 0 };
String string = new String(bytes, "UTF-16BE");

Why are you calling getBytes() at all? You already have the parameter as a String. Calling getBytes(), without specifying a charset, is just an opportunity to mangle the data.

Related

Java - decode base64 - Illegal base64 character 1

I have following data in a file:
I want to decode the UserData. On reading it as string comment, I'm doing following:
String[] split = comment.split("=");
if(split[0].equals("UserData")) {
System.out.println(split[1]);
byte[] callidArray = Arrays.copyOf(java.util.Base64.getDecoder().decode(split[1]), 9);
System.out.println("UserData:" + Hex.encodeHexString(callidArray).toString());
}
But I'm getting the following exception:
java.lang.IllegalArgumentException: Illegal base64 character 1
What could be the reason?
The image suggests that the string you are trying to decode contains characters like SOH and BEL. These are ASCII control characters, and will not ever appear in a Base64 encoded string.
(Base64 typically consists of letters, digits, and +, \ and =. There are some variant formats, but control characters are never included.)
This is confirmed by the exception message:
java.lang.IllegalArgumentException: Illegal base64 character 1
The SOH character has ASCII code 1.
Conclusions:
You cannot decode that string as if it was Base64. It won't work.
It looks like the string is not "encoded" at all ... in the normal sense of what "encoding" means in Java.
We can't advise you on what you should do with it without a clear explanation of:
where the (binary) data comes from,
what you expected it to contain, and
how you read the data and turned it into a Java String object: show us the code that did that!
The UserData field in the picture in the question actually contains Bytes representation of Hexadecimal characters.
So, I don't need to decode Base64. I need to copy the string to a byte array and get equivalent hexadecimal characters of the byte array.
String[] split = comment.split("=");
if(split[0].equals("UserData")) {
System.out.println(split[1]);
byte[] callidArray = Arrays.copyOf(split[1].getBytes(), 9);
System.out.println("UserData:" + Hex.encodeHexString(callidArray).toString());
}
Output:
UserData:010a20077100000000

Response has 2 bytes per character in Netty

In Netty, I create a response by feeding a String in body:
DefaultFullHttpResponse res = new DefaultFullHttpResponse(HTTP_1_1, httpResponse.getHttpResponseStatus());
if (body != null) {
ByteBuf buf = Unpooled.copiedBuffer(body, CharsetUtil.UTF_8);
res.content().writeBytes(buf);
buf.release();
res.headers().set(HttpHeaderNames.CONTENT_LENGTH, res.content().readableBytes());
}
When I look at the response, I see content-length being twice the length of the characters in the String. I understand the Java String contains 2 bytes per character, but I can't figure out how to prevent this in Netty when returning the request.
When I look at Cloudflare responses, these contain one byte per character. So there must be a way to change this. Ideas?
As #Chris O'Toole shows in How to convert a netty ByteBuf to a String and vice versa we must
first convert the String to Byte Array using the desired charset (UTF-8 works fine) String.getBytes(Charset),
then Unpooled.wrappedBuffer(byte[]) using the Byte Array instead the
String.
One byte per character for most characters, as #rossum stated.
Use US_ASCII charset instead of UTF-8. Haven't tested, try.

Why I am not able send header with /(forward slash) like character?

I am getting cookie value in jstring. Server is sending it as base64encoded UTF8 string. I compared string from server and my end, and I am getting exactly same string.
Now I need to decorate this value with n= as prefix and ; as suffix. (Which I am doing in line no. 2 of code).
If I do not use line no. 1, string goes null to Java Server. Otherwise server is getting value.
jstring = [jstring stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
NSString *cookieVal=[NSString stringWithFormat:#"n=%#%#",jstring,#";"];
[self.requestSerializer setValue:cookieVal forHTTPHeaderField:#"Cookie"];
We are using AFNetworking in iOS for request and response. We have observed very strange pattern,
If string contains /(forward slash) then we are getting padding error on Java server, if string doesn't contain /, then string will go as required.
As you can see in line no. 3, we are sending this value as header of http/https request.
I have tried many things, like this (tried very last code with my string.). Also, tried to use different encoding, but problem still persists.
This url conversion would not convert all the special characters we have in ios device keypad.
we have to convert this with blow function. use this as category.
- (NSString *) URLEncodedString_ch {
return (NSString *)CFBridgingRelease(CFURLCreateStringByAddingPercentEscapes(NULL, (CFStringRef)self, NULL, (CFStringRef)#"!*'\"();:#&=+$,/?%#[]%~_. ", CFStringConvertNSStringEncodingToEncoding(NSUTF8StringEncoding)));
}

Convert a java string to an xml that contains valid utf-8 characters

Here is what I was doing -
Take up a document(JSON) from mongodb
Write this key value as an XML
Send this XML to Apache Solr for indexing
Here is how I was doing step #2
Given key say "key1" and value as "value1" step#2 output is
"<"+ key1 + ">" + value1 + "</"+ key1 + ">"
Now when i send this XML to Solr, I was getting Stax exceptions like -
Invalid UTF-8 start byte 0xb7
Invalid UTF-8 start byte 0xa0
Invalid UTF-8 start byte 0xb0
Invalid UTF-8 start byte 0x96
So here is how I am thinking to fix it -
key1New = new String(key1.getBytes("UTF-8"), "UTF-8");
value1New = new String(value1.getBytes("UTF-8"), "UTF-8");
Should this work OR I should rather do this -
key1New = new String(key1.getBytes("UTF-8"), "ISO-8859-1");
value1New = new String(value1.getBytes("UTF-8"), "ISO-8859-1");
Java String Objects dont have encodings. An encoding, in this context, makes sense when associated with a byte[]. try something like this:
byte[] utf8xmlBytes = originalxmlString.getBytes("UTF8");
and send these bytes.
EDIT: Also, consider the comment of Jon Skeet. It is usually a good idea to create XML using an API unless you have a very small amount of XML.

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.
You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.
UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252
String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

Categories