Java UTF-8 string serialization \xF2 instead of \u00F2 - java

I have a string with UTF8 characters in it and I'm using StringEntity to put it into a HttpEntityEnclosingRequestBase and send it to a server.
My problem is that the UTF8 characters are coded as \xF2 and the server would like \u00f2. How can I fix this? Or how can I easily convert an UTF-8 string to a string where I have \u00f2 like substrings instead of the UTF8 chars?
Solution:
In the end, the solution was:
StringEntity(string, "UTF-8"));
Thanks in advance, David

You can convert between Java's internal character encoding (UTF-16) and UTF-8 byte sequences in a variety of ways. The simplest is:
byte[] utf8data = "my string".getBytes("UTF-8");
String myString = new String(utf8data, "UTF-8");
There are also stream-oriented classes that can translate between byte streams and character streams using an encoding. See the java.io package.

Just do
String str = ...;
str = replace("\\x", "\\u00");

Related

Input byte array has incorrect ending byte at 40

I have a string that is base64 encoded. It looks like this:
eyJibGExIjoiYmxhMSIsImJsYTIiOiJibGEyIn0=
Any online tool can decode this to the proper string which is {"bla1":"bla1","bla2":"bla2"}. However, my Java implementation fails:
import java.util.Base64;
System.out.println("payload = " + payload);
String json = new String(Base64.getDecoder().decode(payload));
I'm getting the following error:
payload = eyJibGExIjoiYmxhMSIsImJsYTIiOiJibGEyIn0=
java.lang.IllegalArgumentException: Input byte array has incorrect ending byte at 40
What is wrong with my code?
Okay, I found out. The original String is encoded on an Android device using android.util.Base64 by Base64.encodeToString(json.getBytes("UTF-8"), Base64.DEFAULT);. It uses android.util.Base64.DEFAULT encoding scheme.
Then on the server side when using java.util.Base64 this has to be decoded with Base64.getMimeDecoder().decode(payload) not with Base64.getDecoder().decode(payload)
I was trying to use the strings from the args. I found that if I use arg[0].trim() that it made it work. eg
Base64.getDecoder().decode(arg[0].trim());
I guess there's some sort of whitespace that gets it messed up.
Maybe too late, but I also had this problem.
By default, the Android Base64 util adds a newline character to the end of the encoded string.
The Base64.NO_WRAP flag tells the util to create the encoded string without the newline character.
Your android app should encode src something like this:
String encode = Base64.encodeToString(src.getBytes(), Base64.NO_WRAP);

UTF-8 -- ISO 8859-1 mapping tool

When I convert a UTF-8 String with chars that are not known in 8859-1 to 8859-1 then i get question marks here and there. Sure what sould he do else!
Is there a java tool that can map a string like "İKEA" to "IKEA" and avoid ? to make the best out of it?
For the specific example, you can:
decompose the letters and diacritics using compatibility form Unicode normalization
instruct the encoder to drop unsupported characters (the diacritics)
Example:
ByteArrayOutputStream out = new ByteArrayOutputStream();
// create encoder
CharsetEncoder encoder = StandardCharsets.ISO_8859_1.newEncoder();
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
// write data
String ikea = "\u0130KEA";
String decomposed = Normalizer.normalize(ikea, Form.NFKD);
CharBuffer cbuf = CharBuffer.wrap(decomposed);
ByteBuffer bbuf = encoder.encode(cbuf);
out.write(bbuf.array());
// verify
String decoded = new String(out.toByteArray(), StandardCharsets.ISO_8859_1);
System.out.println(decoded);
You're still transcoding from a character set that defines 109,384 values (Unicode 6) to one that supports 256 so there will always be limitations.
Also consider a more sophisticated transformation API like ICU for features like transliteration.

Base64 InputStream to String

I have been trying to get an input stream reading a file, which isa plain text and has embeded some images and another files in base64 and write it again in a String. But keeping the encoding, I mean, I want to have in the String something like:
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAoHBwgHBgoICAgLCgoLDhgQDg0NDh0VFhEYIx8lJCIf
IiEmKzcvJik0KSEiMEExNDk7Pj4+JS5ESUM8SDc9Pjv/2wBDAQoLCw4NDhwQEBw7KCIoOzs7Ozs7
I have been trying with the classes Base64InputStream and more from packages as org.apache.commons.codec but I just can not fiugure it out. Any kind of help would be really appreciated. Thanks in advance!
Edit
Piece of code using a reader:
BufferedReader br= new BufferedReader(new InputStreamReader(bodyPart.getInputStream()));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
Getting as a result something like: .DIC;ÿÛC;("(;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;ÿÀ##"ÿÄ
Have you tried doing this:
final byte[] bytes64bytes = Base64.encodeBase64(IOUtils.toByteArray(is));
final String content = new String(bytes64bytes);
A text file containing some base64 data can be read with the charset of the rest of the file.
Base64 encoding is a mean to encode bytes in a limited set of characters that are unchanged with almost all char encodings, for example ASCII or UTF-8.
Base64 isn't a charset encoding, you don't have to specify you have some base64 encoded data when reading a file into a string.
So if your text file is generally UTF-8 (that's probable), you can read it without problem even if it contains a base64 encoded stream. Simply use a basic reader and don't use a Base64InputStream if you don't want to decode it.
When opening a file with a reader, you have to specify the encoding. If you don't know it, I suggest you test with the probable ones, like UTF-8, US-ASCII or ISO-8859-1.
If you have a normal InputStream object than You can directly get Base64 encoded stream from it using apache common library class Base64InputStream constructor
I found the solution, inspired by this post getting base64 content string of an image from a mimepart in Java
I think it is kind of stupid decode and encode again the base64 code, but it is the only way I found to manage this issue. If someone could give a better solution, it would be also really appreciated.
Thanks

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.
You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.
UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252
String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

Java Internationalization

I have a Java string that I'm having trouble manipulating. I have a String, s, that has a value of 丞 (a Chinese character I chose at random, I don't speak Chinese). If I call
String t = new String(s.getBytes());
if (s.equals(t))
System.out.println("String unchanged");
else
System.out.println("String changed");
Then I get the String changed result. Does anyone know what's going on?
Because that method:
Encodes this String into a sequence of bytes using the platform's default charset
If your default charset is ie US-ASCII you won't get the same bytes used by that Chinese letter
I imagine an extra bit/byte may be added/droppped in the process.
Try using getBytes( String charSetName )
public byte[] getBytes(String charsetName)
Using the correct charsetName
The getBytes() method uses the default encoding. According to the docs:
The CharsetEncoder class should be used when more control over the encoding process is required.
Actually, I figured this out, sorry for the post. I was using the default Java Charset, instead of explicitly casting it as a UTF-8 Charset. It works now.
String t = new String(s.getBytes()); may create string using ASCII as default charset. Use following method to create the string with charsetName as UTF-8
String(byte[] bytes, int offset, int length, String charsetName)

Categories