Java Convert strange string to Burmese language string - java

Hi my example code is like ;
String ln="á€á€­á€•á€¹á€•á€¶á€”ဲ့";
try {
byte[] b = ln.getBytes("UTF-8");
String s = new String(b, "US-ASCII");
System.out.println(s);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
when I run it, it does not print Brumese, Is there a sloution for that ? Thanks

The real problem is that the server is sending back content either with the wrong charset, or double-encoded. If at all possible, you should get that fixed.
In the meantime, you have the right idea—converting the mis-encoded text to the correct charset.
Each character in your String was apparently supposed to be a single byte which was part of an UTF-8 byte sequence. What you're actually seeing is each of those single bytes being treated as a character in the Windows cp1252 charset, and converted to a Java char accordingly.
So, you first want to convert the chars from cp1252 back into the proper bytes:
byte[] b = ln.getBytes("cp1252");
Now you have a true UTF-8 byte sequence, which you can convert into the proper String:
String s = new String(b, StandardCharsets.UTF_8);
// In Java 6, you must use:
//String s = new String(b, "UTF-8");
You should never use US-ASCII if you are decoding, or trying to generate, Burmese characters, or any non-English characters. ASCII consists of codepoints 0 through 127 only.

Related

Convert string representation of a hexadecimal byte array to a string with non ascii characters in Java

I have a String being sent in the request payload by a client as:
"[0xc3][0xa1][0xc3][0xa9][0xc3][0xad][0xc3][0xb3][0xc3][0xba][0xc3][0x81][0xc3][0x89][0xc3][0x8d][0xc3][0x93][0xc3][0x9a]Departms"
I want to get a String which is "áéíóúÁÉÍÓÚDepartms". How can I do this in Java?
The problem is that I have no control over the way client encodes this string. Seems like the client is just encoding the non-ascii characters in this format and sends the ascii chars as it is(see 'Departms' at the end).
The stuff within the square brackets, seems to be characters encoded in UTF-8 but converted into a hexadecimal string in a weird way. What you can do is find each instance that looks like [0xc3] and convert it into the corresponding byte, and then create a new string from the bytes.
Unfortunately there are no good tools for working with byte arrays. Here's a quick and dirty solution that uses regex to find and replace these hex codes with the corresponding character in latin-1, and then fixes that by re-interpreting the bytes.
String bracketDecode(String str) {
Pattern p = Pattern.compile("\\[(0x[0-9a-f]{2})\\]");
Matcher m = p.matcher(str);
StringBuilder sb = new StringBuilder();
while (m.find()) {
String group = m.group(1);
Integer decode = Integer.decode(group);
// assume latin-1 encoding
m.appendReplacement(sb, Character.toString(decode));
}
m.appendTail(sb);
// oh no, latin1 is not correct! re-interpret bytes in utf-8
byte[] bytes = sb.toString().getBytes(StandardCharsets.ISO_8859_1);
return new String(bytes, StandardCharsets.UTF_8);
}

FileInputStream read method keeps returning 194

I'm teaching myself Java IO currently and I'm able to read basic ASCII characters from a .txt file but when I get to other Latin-1 or characters within the 255 range it prints it as 194 instead of the correct character decimal number.
For example, I can read abcdefg from the txt file but if I throw in a character like © I dont get 169, I for some reason get 194. I tried testing this out by just printing all chars between 1-255 with a loop but that works. Reading this input seems to not though... so I'm a little perplexed. I understand I can use a reader object or whatever but I want to cover the basics first by learning the byte streams. Here is what I have though:
InputStream io = null;
try{
io = new FileInputStream("thing.txt");
int yeet = io.read();
System.out.println(yeet);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
UTF-8 encoding table and Unicode characters
You can see here that HEX code for © is c2 a9 i.e. 194 169. It seems that your file has Big Endian Endian Endianness and you read the first byte which is 194.
P.S. Read a file character by character/UTF8 this is another good example of java encodings, code-points, etc.
I have some solutions for you.
The first solution
There is a full understanding of the book on this site
The second solution
I have a sample code for you
public class Example {
public static void main(String[] args) throws Exception {
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8");
String result = new String(charset, "UTF-8");
System.out.println(result);
}
}
Output:
hey捦
Let us understand the above program. Firstly we converted a given Unicode string to UTF-8 for future verification using the getBytes() method
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8")
Then we converted the charset byte array to Unicode by creating a new String object as follows
String result = new String(charset, "UTF-8");
System.out.println(result);
Good luck

Java convert encoding

I have a string which used to be an xml tag where mojibakes are contained:
<Applicant_Place_Born>Москва</Applicant_Place_Born>
I know that exactly the same string but in correct encoding is:
<Applicant_Place_Born>Москва</Applicant_Place_Born>
I know this because using Tcl utility I can convert it into proper string:
# The original string
set s "Москва"
# substituting the html escapes
set t "Ð\x9cоÑ\x81ква"
# decode from utf-8 into Unicode
encoding convertfrom utf-8 "Ð\x9cоÑ\x81ква"
Москва
I tried different variations of this:
System.out.println(new String(original.getBytes("UTF-8"), "CP1251"));
but I always got other mojibakes or question marks instead of characters.
Q: How can I do the same as Tcl does but using Java code?
EDIT:
I have tried #Joop Eggen's approach:
import org.apache.commons.lang3.StringEscapeUtils;
public class s {
static String s;
public static void main(String[] args) {
try {
System.setProperty("file.encoding", "CP1251");
System.out.println("JVM encoding: " + System.getProperty("file.encoding"));
s = "Москва";
System.out.println("Original text: " + s);
s = StringEscapeUtils.unescapeHtml4(s);
byte[] b = s.getBytes(StandardCharsets.ISO_8859_1);
s = new String(b, "UTF-16BE");
System.out.println("Result: " + s);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The converted string was something Chineese:
JVM encoding: CP1251
Original text: Москва
Result: 킜킾톁킺킲킰
A String in java should always be correct Unicode. In your case you seem to have UTF16BE interpreted as some single-byte encoding.
A patch would be
String string = new StringEscapeUtils().UnescapeHTML4(s);
byte[] b = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(b, "UTF-16BE");
Now s should be a correct Unicode String.
System.out.println(s);
If the operating system for instance is in Cp1251 the Cyrillic text should be converted correct.
The characters in s are actually bytes of UTF-16BE I guess
By getting the bytes of the string in an single-byte encoding hopefully no conversion takes place
Then make a String of the bytes as being in UTF-16BE, internally converted to Unicode (actually UTF-16BE too)
You were pretty close. However, getBytes is used to encode UTF-8 rather than decode. What you want is something along the lines of
String string = "Ð\x9cоÑ\x81ква";
byte[] bytes = string.getBytes("UTF-8");
System.out.println(new String(bytes, "UTF-8"));

How to convert into cyrillic

Good day.
I got string like this from server
\u041a\u0438\u0441\u0435\u043b\u0435\u0432 \u0410\u043d\u0434\u0440\u0435\u0439
I need to convert it into cyrillic cp-1251 string.
How do i do it? Thank you.
If that is a literal sequence of characters that must decoded, you'll need to first start with something like this (assuming your input is in the string input):
StringBuffer decodedInput = new StringBuffer();
Matcher match = Pattern.compile("\\\\u([0-9a-fA-F]{4})| ").matcher(input);
while (match.find()) {
String character = match.group(1);
if (character == null)
decodedInput.append(match.group());
else
decodedInput.append((char)Integer.parseInt(character, 16));
}
At this point, you should have java string representation of your input in decodedInput.
If your system supports the cp-1251 charset, you can then convert that to cp-1251 with something like this:
Charset cp1251charset = Charset.forName("cp-1251");
ByteBuffer output = cp1251charset.encode(decodedInput.toString());

Letter with trema being shown as percentage sign

In my program I convert a byte stream I get as input to a String. But when the bytestream contains words with a ë, this letter is converted to a %. How do I fix this?
Thx
For encoding these characters,
Convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. Refer this,
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

Categories