Java Convert strange string to Burmese language string

Java Convert strange string to Burmese language string - java

Hi my example code is like ;
String ln="á€á€á€•á€¹á€•á€¶á€”á€²á€·";
try {
byte[] b = ln.getBytes("UTF-8");
String s = new String(b, "US-ASCII");
System.out.println(s);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
when I run it, it does not print Brumese, Is there a sloution for that ? Thanks

The real problem is that the server is sending back content either with the wrong charset, or double-encoded. If at all possible, you should get that fixed.
In the meantime, you have the right idea—converting the mis-encoded text to the correct charset.
Each character in your String was apparently supposed to be a single byte which was part of an UTF-8 byte sequence. What you're actually seeing is each of those single bytes being treated as a character in the Windows cp1252 charset, and converted to a Java char accordingly.
So, you first want to convert the chars from cp1252 back into the proper bytes:
byte[] b = ln.getBytes("cp1252");
Now you have a true UTF-8 byte sequence, which you can convert into the proper String:
String s = new String(b, StandardCharsets.UTF_8);
// In Java 6, you must use:
//String s = new String(b, "UTF-8");
You should never use US-ASCII if you are decoding, or trying to generate, Burmese characters, or any non-English characters. ASCII consists of codepoints 0 through 127 only.

Related

Convert string representation of a hexadecimal byte array to a string with non ascii characters in Java

I have a String being sent in the request payload by a client as:
"[0xc3][0xa1][0xc3][0xa9][0xc3][0xad][0xc3][0xb3][0xc3][0xba][0xc3][0x81][0xc3][0x89][0xc3][0x8d][0xc3][0x93][0xc3][0x9a]Departms"
I want to get a String which is "áéíóúÁÉÍÓÚDepartms". How can I do this in Java?
The problem is that I have no control over the way client encodes this string. Seems like the client is just encoding the non-ascii characters in this format and sends the ascii chars as it is(see 'Departms' at the end).

The stuff within the square brackets, seems to be characters encoded in UTF-8 but converted into a hexadecimal string in a weird way. What you can do is find each instance that looks like [0xc3] and convert it into the corresponding byte, and then create a new string from the bytes.
Unfortunately there are no good tools for working with byte arrays. Here's a quick and dirty solution that uses regex to find and replace these hex codes with the corresponding character in latin-1, and then fixes that by re-interpreting the bytes.
String bracketDecode(String str) {
Pattern p = Pattern.compile("\\[(0x[0-9a-f]{2})\\]");
Matcher m = p.matcher(str);
StringBuilder sb = new StringBuilder();
while (m.find()) {
String group = m.group(1);
Integer decode = Integer.decode(group);
// assume latin-1 encoding
m.appendReplacement(sb, Character.toString(decode));
}
m.appendTail(sb);
// oh no, latin1 is not correct! re-interpret bytes in utf-8
byte[] bytes = sb.toString().getBytes(StandardCharsets.ISO_8859_1);
return new String(bytes, StandardCharsets.UTF_8);
}

FileInputStream read method keeps returning 194

I'm teaching myself Java IO currently and I'm able to read basic ASCII characters from a .txt file but when I get to other Latin-1 or characters within the 255 range it prints it as 194 instead of the correct character decimal number.
For example, I can read abcdefg from the txt file but if I throw in a character like © I dont get 169, I for some reason get 194. I tried testing this out by just printing all chars between 1-255 with a loop but that works. Reading this input seems to not though... so I'm a little perplexed. I understand I can use a reader object or whatever but I want to cover the basics first by learning the byte streams. Here is what I have though:
InputStream io = null;
try{
io = new FileInputStream("thing.txt");
int yeet = io.read();
System.out.println(yeet);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

UTF-8 encoding table and Unicode characters
You can see here that HEX code for © is c2 a9 i.e. 194 169. It seems that your file has Big Endian Endian Endianness and you read the first byte which is 194.
P.S. Read a file character by character/UTF8 this is another good example of java encodings, code-points, etc.

I have some solutions for you.
The first solution
There is a full understanding of the book on this site
The second solution
I have a sample code for you
public class Example {
public static void main(String[] args) throws Exception {
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8");
String result = new String(charset, "UTF-8");
System.out.println(result);
}
}
Output:
hey捦
Let us understand the above program. Firstly we converted a given Unicode string to UTF-8 for future verification using the getBytes() method
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8")
Then we converted the charset byte array to Unicode by creating a new String object as follows
String result = new String(charset, "UTF-8");
System.out.println(result);
Good luck

Java convert encoding

I have a string which used to be an xml tag where mojibakes are contained:
<Applicant_Place_Born>ÐÐ¾ÑÐºÐ²Ð°</Applicant_Place_Born>
I know that exactly the same string but in correct encoding is:
<Applicant_Place_Born>Москва</Applicant_Place_Born>
I know this because using Tcl utility I can convert it into proper string:
# The original string
set s "ÐÐ¾ÑÐºÐ²Ð°"
# substituting the html escapes
set t "Ð\x9cÐ¾Ñ\x81ÐºÐ²Ð°"
# decode from utf-8 into Unicode
encoding convertfrom utf-8 "Ð\x9cÐ¾Ñ\x81ÐºÐ²Ð°"
Москва
I tried different variations of this:
System.out.println(new String(original.getBytes("UTF-8"), "CP1251"));
but I always got other mojibakes or question marks instead of characters.
Q: How can I do the same as Tcl does but using Java code?
EDIT:
I have tried #Joop Eggen's approach:
import org.apache.commons.lang3.StringEscapeUtils;
public class s {
static String s;
public static void main(String[] args) {
try {
System.setProperty("file.encoding", "CP1251");
System.out.println("JVM encoding: " + System.getProperty("file.encoding"));
s = "ÐÐ¾ÑÐºÐ²Ð°";
System.out.println("Original text: " + s);
s = StringEscapeUtils.unescapeHtml4(s);
byte[] b = s.getBytes(StandardCharsets.ISO_8859_1);
s = new String(b, "UTF-16BE");
System.out.println("Result: " + s);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The converted string was something Chineese:
JVM encoding: CP1251
Original text: ÐÐ¾ÑÐºÐ²Ð°
Result: 킜킾톁킺킲킰

A String in java should always be correct Unicode. In your case you seem to have UTF16BE interpreted as some single-byte encoding.
A patch would be
String string = new StringEscapeUtils().UnescapeHTML4(s);
byte[] b = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(b, "UTF-16BE");
Now s should be a correct Unicode String.
System.out.println(s);
If the operating system for instance is in Cp1251 the Cyrillic text should be converted correct.
The characters in s are actually bytes of UTF-16BE I guess
By getting the bytes of the string in an single-byte encoding hopefully no conversion takes place
Then make a String of the bytes as being in UTF-16BE, internally converted to Unicode (actually UTF-16BE too)

You were pretty close. However, getBytes is used to encode UTF-8 rather than decode. What you want is something along the lines of
String string = "Ð\x9cÐ¾Ñ\x81ÐºÐ²Ð°";
byte[] bytes = string.getBytes("UTF-8");
System.out.println(new String(bytes, "UTF-8"));

How to convert into cyrillic

Good day.
I got string like this from server
\u041a\u0438\u0441\u0435\u043b\u0435\u0432 \u0410\u043d\u0434\u0440\u0435\u0439
I need to convert it into cyrillic cp-1251 string.
How do i do it? Thank you.

If that is a literal sequence of characters that must decoded, you'll need to first start with something like this (assuming your input is in the string input):
StringBuffer decodedInput = new StringBuffer();
Matcher match = Pattern.compile("\\\\u([0-9a-fA-F]{4})| ").matcher(input);
while (match.find()) {
String character = match.group(1);
if (character == null)
decodedInput.append(match.group());
else
decodedInput.append((char)Integer.parseInt(character, 16));
}
At this point, you should have java string representation of your input in decodedInput.
If your system supports the cp-1251 charset, you can then convert that to cp-1251 with something like this:
Charset cp1251charset = Charset.forName("cp-1251");
ByteBuffer output = cp1251charset.encode(decodedInput.toString());

Letter with trema being shown as percentage sign

In my program I convert a byte stream I get as input to a String. But when the bytestream contains words with a ë, this letter is converted to a %. How do I fix this?
Thx

For encoding these characters,
Convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. Refer this,
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Convert strange string to Burmese language string - java

Related

Convert string representation of a hexadecimal byte array to a string with non ascii characters in Java

FileInputStream read method keeps returning 194

Java convert encoding

How to convert into cyrillic

Letter with trema being shown as percentage sign

Categories

Resources