I have a string which used to be an xml tag where mojibakes are contained:
<Applicant_Place_Born>ÐоÑква</Applicant_Place_Born>
I know that exactly the same string but in correct encoding is:
<Applicant_Place_Born>Москва</Applicant_Place_Born>
I know this because using Tcl utility I can convert it into proper string:
# The original string
set s "ÐоÑква"
# substituting the html escapes
set t "Ð\x9cоÑ\x81ква"
# decode from utf-8 into Unicode
encoding convertfrom utf-8 "Ð\x9cоÑ\x81ква"
Москва
I tried different variations of this:
System.out.println(new String(original.getBytes("UTF-8"), "CP1251"));
but I always got other mojibakes or question marks instead of characters.
Q: How can I do the same as Tcl does but using Java code?
EDIT:
I have tried #Joop Eggen's approach:
import org.apache.commons.lang3.StringEscapeUtils;
public class s {
static String s;
public static void main(String[] args) {
try {
System.setProperty("file.encoding", "CP1251");
System.out.println("JVM encoding: " + System.getProperty("file.encoding"));
s = "ÐоÑква";
System.out.println("Original text: " + s);
s = StringEscapeUtils.unescapeHtml4(s);
byte[] b = s.getBytes(StandardCharsets.ISO_8859_1);
s = new String(b, "UTF-16BE");
System.out.println("Result: " + s);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The converted string was something Chineese:
JVM encoding: CP1251
Original text: ÐоÑква
Result: 킜킾톁킺킲킰
A String in java should always be correct Unicode. In your case you seem to have UTF16BE interpreted as some single-byte encoding.
A patch would be
String string = new StringEscapeUtils().UnescapeHTML4(s);
byte[] b = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(b, "UTF-16BE");
Now s should be a correct Unicode String.
System.out.println(s);
If the operating system for instance is in Cp1251 the Cyrillic text should be converted correct.
The characters in s are actually bytes of UTF-16BE I guess
By getting the bytes of the string in an single-byte encoding hopefully no conversion takes place
Then make a String of the bytes as being in UTF-16BE, internally converted to Unicode (actually UTF-16BE too)
You were pretty close. However, getBytes is used to encode UTF-8 rather than decode. What you want is something along the lines of
String string = "Ð\x9cоÑ\x81ква";
byte[] bytes = string.getBytes("UTF-8");
System.out.println(new String(bytes, "UTF-8"));
Related
I have a constraint: I cannot save some chars (like & and =) in a some special storage.
The problem is that I have strings (user input) that contain these not allowed special chars, which I'd like to save to that storage .
I'd like to convert such string to another string that wouldn't contain these special characters.
I'd like to still be able to convert back to the original string without creating ambiguity.
Any idea how to implement the de/convert? Thanks.
Convert the user input to Hex and save. And convert the hex value back to string. Use these methods.
public static String stringToHex(String arg) {
return String.format("%x", new BigInteger(1, arg.getBytes(Charset.forName("UTF-8"))));
}
public static String hexToString(String arg) {
byte[] bytes = DatatypeConverter.parseHexBinary(arg);
return new String(bytes, Charset.forName("UTF-8"));
}
Usage:
String h = stringToHex("Perera & Sons");
System.out.println(h);
System.out.println(hexToString(h));
OUTPUT
506572657261202620536f6e73
Perera & Sons
Already pointed out in the comments but URL Encoding looks like the way to go.
In Java done simply URLEncoder and URLDecoder
String encoded = URLEncoder.encode("My string &with& illegal = characters ", "UTF-8");
System.out.println("Encoded String:" + encoded);
String decoded = URLDecoder.decode(encoded, "UTF-8");
System.out.println("Decoded String:" + decoded);
URLEncoder
URLDecoder
Hi my example code is like ;
String ln="á€á€á€•á€¹á€•á€¶á€”ဲ့";
try {
byte[] b = ln.getBytes("UTF-8");
String s = new String(b, "US-ASCII");
System.out.println(s);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
when I run it, it does not print Brumese, Is there a sloution for that ? Thanks
The real problem is that the server is sending back content either with the wrong charset, or double-encoded. If at all possible, you should get that fixed.
In the meantime, you have the right idea—converting the mis-encoded text to the correct charset.
Each character in your String was apparently supposed to be a single byte which was part of an UTF-8 byte sequence. What you're actually seeing is each of those single bytes being treated as a character in the Windows cp1252 charset, and converted to a Java char accordingly.
So, you first want to convert the chars from cp1252 back into the proper bytes:
byte[] b = ln.getBytes("cp1252");
Now you have a true UTF-8 byte sequence, which you can convert into the proper String:
String s = new String(b, StandardCharsets.UTF_8);
// In Java 6, you must use:
//String s = new String(b, "UTF-8");
You should never use US-ASCII if you are decoding, or trying to generate, Burmese characters, or any non-English characters. ASCII consists of codepoints 0 through 127 only.
Here is the code for my class:
public class Md5tester {
private String licenseMd5 = "?jZ2$??f???%?";
public Md5tester(){
System.out.println(isLicensed());
}
public static void main(String[] args){
new Md5tester();
}
public boolean isLicensed(){
File f = new File("C:\\Some\\Random\\Path\\toHash.txt");
if (!f.exists()) {
return false;
}
try {
BufferedReader read = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
//get line from txt
String line = read.readLine();
//output what line is
System.out.println("Line read: " + line);
//get utf-8 bytes from line
byte[] lineBytes = line.getBytes("UTF-8");
//declare messagedigest for hashing
MessageDigest md = MessageDigest.getInstance("MD5");
//hash the bytes of the line read
String hashed = new String(md.digest(lineBytes), "UTF-8");
System.out.println("Hashed as string: " + hashed);
System.out.println("LicenseMd5: " + licenseMd5);
System.out.println("Hashed as bytes: " + hashed.getBytes("UTF-8"));
System.out.println("LicenseMd5 as bytes: " + licenseMd5.getBytes("UTF-8"));
if (hashed.equalsIgnoreCase(licenseMd5)){
return true;
}
else{
return false;
}
} catch (FileNotFoundException e) {
return false;
} catch (IOException e) {
return false;
} catch (NoSuchAlgorithmException e) {
return false;
}
}
}
Here's the output I get:
Line read: Testing
Hashed as string: ?jZ2$??f???%?
LicenseMd5: ?jZ2$??f???%?
Hashed as bytes: [B#5fd1acd3
LicenseMd5 as bytes: [B#3ea981ca
false
I'm hoping someone can clear this up for me, because I have no clue what the issue is.
A byte[] returned by MD5 conversion is an arbitrary byte[], therefore you cannot treat it as a valid representation of String in some encoding.
In particular, ?s in ?jZ2$??f???%? correspond to bytes that cannot be represented in your output encoding. It means that content of your licenseMd5 is already damaged, therefore you cannot compare your MD5 hash with it.
If you want to represent your byte[] as String for further comparison, you need to choose a proper representation for arbitrary byte[]s. For example, you can use Base64 or hex strings.
You can convert byte[] into hex string as follows:
public static String toHex(byte[] in) {
StringBuilder out = new StringBuilder(in.length * 2);
for (byte b: in) {
out.append(String.format("%02X", (byte) b));
}
return out.toString();
}
Also note that byte[] uses default implementation of toString(). Its result (such as [B#5fd1acd3) is not related to the content of byte[], therefore it's meaningless in your case.
The ? symbols in the printed representation of hashed aren't literal question marks, they're unprintable characters.
You get this error when your java file format is not UTF-8 encoding while you encode a string using UTF-8, try remove UTF-8 and the md5 will output another result, you can copy to the string and see the result true.
Another way is set the file encoding to UTF-8, the string encode also be different
In my program I convert a byte stream I get as input to a String. But when the bytestream contains words with a ë, this letter is converted to a %. How do I fix this?
Thx
For encoding these characters,
Convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. Refer this,
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
When I got JSON then there are \u003c and \u003e instead of < and >. I want to convert them back to utf-8 in java. any help will be highly appreciated. Thanks.
try {
// Convert from Unicode to UTF-8
String string = "\u003c";
byte[] utf8 = string.getBytes("UTF-8");
// Convert from UTF-8 to Unicode
string = new String(utf8, "UTF-8");
} catch (UnsupportedEncodingException e) {
}
refer http://www.exampledepot.com/egs/java.lang/unicodetoutf8.html
You can try converting the string into a byte array
byte[] utfString = str.getBytes("UTF-8") ;
and convert that back to a string object by specifying the UTF-8 encoding like
str = new String(utfString,"UTF-8") ;
You can also try this
String s = "Hello World!";
String convertedInUTF8 = new String(s, StandardCharsets.US_ASCII);