I am receiving a string that is not properly encoded like mystring%201, where must be mystring 1. How could I replace all characters that could be interpreted as UTF8? I read a lot of posts but not a full solution. Please note that string is already encoded wrong and I am not asking about how to encode char sequence. I asked same issue for iOS few days ago and was solved using stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding. Thank you.
ios UTF8 encoding from nsstring
You can use the URLDecoder.decode() function, like this:
String s = URLDecoder.decode(myString, "UTF-8");
Looks like your string is partially URL-encoded, so...
how about this:
try {
System.out.println(URLDecoder.decode("mystring%201", "UTF-8"));
} catch(UnsupportedEncodingException e) {
e.printStackTrace();
}
I am receiving a string that is not properly encoded like
"mystring%201
Well this string is already encoded, you have to decode:
String sDecoded = URLDecoder.decode("mystring%201", "UTF-8");
so now sDecoded must have the value of "mystring 1".
The "Encoding" of String:
String sEncoded = URLEncoder.encode("mystring 1", "UTF-8");
sEncoded must have the value of "mystring%201"
Related
So basically I'm trying to send an email with Japanese characters, something like "𥹖𥹖𥹖" and then I got "???" what should I do to encode this? I have looked over a bunch of solutions but none of them have helped me solve this.
here's the method I've been trying to do the encode:
public String encoding(String str) throws UnsupportedEncodingException{
String Encoding = "Shift_JIS";
return this.changeCharset(str, Encoding);
}
public String changeCharset(String str, String newCharset) throws UnsupportedEncodingException {
if (str != null) {
byte[] jis = str.getBytes("Shift_JIS");
return new String(bs, newCharset);
}
return null;
}
You're making this too complicated...
First, make sure you have the Japanese text in a proper Java String object, using proper Unicode characters.
Then, set the content of the body part using this method:
htmlPart.setText(japaneseString, "Shift_JIS", "html");
I'm teaching myself Java IO currently and I'm able to read basic ASCII characters from a .txt file but when I get to other Latin-1 or characters within the 255 range it prints it as 194 instead of the correct character decimal number.
For example, I can read abcdefg from the txt file but if I throw in a character like © I dont get 169, I for some reason get 194. I tried testing this out by just printing all chars between 1-255 with a loop but that works. Reading this input seems to not though... so I'm a little perplexed. I understand I can use a reader object or whatever but I want to cover the basics first by learning the byte streams. Here is what I have though:
InputStream io = null;
try{
io = new FileInputStream("thing.txt");
int yeet = io.read();
System.out.println(yeet);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
UTF-8 encoding table and Unicode characters
You can see here that HEX code for © is c2 a9 i.e. 194 169. It seems that your file has Big Endian Endian Endianness and you read the first byte which is 194.
P.S. Read a file character by character/UTF8 this is another good example of java encodings, code-points, etc.
I have some solutions for you.
The first solution
There is a full understanding of the book on this site
The second solution
I have a sample code for you
public class Example {
public static void main(String[] args) throws Exception {
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8");
String result = new String(charset, "UTF-8");
System.out.println(result);
}
}
Output:
hey捦
Let us understand the above program. Firstly we converted a given Unicode string to UTF-8 for future verification using the getBytes() method
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8")
Then we converted the charset byte array to Unicode by creating a new String object as follows
String result = new String(charset, "UTF-8");
System.out.println(result);
Good luck
I have a constraint: I cannot save some chars (like & and =) in a some special storage.
The problem is that I have strings (user input) that contain these not allowed special chars, which I'd like to save to that storage .
I'd like to convert such string to another string that wouldn't contain these special characters.
I'd like to still be able to convert back to the original string without creating ambiguity.
Any idea how to implement the de/convert? Thanks.
Convert the user input to Hex and save. And convert the hex value back to string. Use these methods.
public static String stringToHex(String arg) {
return String.format("%x", new BigInteger(1, arg.getBytes(Charset.forName("UTF-8"))));
}
public static String hexToString(String arg) {
byte[] bytes = DatatypeConverter.parseHexBinary(arg);
return new String(bytes, Charset.forName("UTF-8"));
}
Usage:
String h = stringToHex("Perera & Sons");
System.out.println(h);
System.out.println(hexToString(h));
OUTPUT
506572657261202620536f6e73
Perera & Sons
Already pointed out in the comments but URL Encoding looks like the way to go.
In Java done simply URLEncoder and URLDecoder
String encoded = URLEncoder.encode("My string &with& illegal = characters ", "UTF-8");
System.out.println("Encoded String:" + encoded);
String decoded = URLDecoder.decode(encoded, "UTF-8");
System.out.println("Decoded String:" + decoded);
URLEncoder
URLDecoder
I have a string which used to be an xml tag where mojibakes are contained:
<Applicant_Place_Born>ÐоÑква</Applicant_Place_Born>
I know that exactly the same string but in correct encoding is:
<Applicant_Place_Born>Москва</Applicant_Place_Born>
I know this because using Tcl utility I can convert it into proper string:
# The original string
set s "ÐоÑква"
# substituting the html escapes
set t "Ð\x9cоÑ\x81ква"
# decode from utf-8 into Unicode
encoding convertfrom utf-8 "Ð\x9cоÑ\x81ква"
Москва
I tried different variations of this:
System.out.println(new String(original.getBytes("UTF-8"), "CP1251"));
but I always got other mojibakes or question marks instead of characters.
Q: How can I do the same as Tcl does but using Java code?
EDIT:
I have tried #Joop Eggen's approach:
import org.apache.commons.lang3.StringEscapeUtils;
public class s {
static String s;
public static void main(String[] args) {
try {
System.setProperty("file.encoding", "CP1251");
System.out.println("JVM encoding: " + System.getProperty("file.encoding"));
s = "ÐоÑква";
System.out.println("Original text: " + s);
s = StringEscapeUtils.unescapeHtml4(s);
byte[] b = s.getBytes(StandardCharsets.ISO_8859_1);
s = new String(b, "UTF-16BE");
System.out.println("Result: " + s);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The converted string was something Chineese:
JVM encoding: CP1251
Original text: ÐоÑква
Result: 킜킾톁킺킲킰
A String in java should always be correct Unicode. In your case you seem to have UTF16BE interpreted as some single-byte encoding.
A patch would be
String string = new StringEscapeUtils().UnescapeHTML4(s);
byte[] b = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(b, "UTF-16BE");
Now s should be a correct Unicode String.
System.out.println(s);
If the operating system for instance is in Cp1251 the Cyrillic text should be converted correct.
The characters in s are actually bytes of UTF-16BE I guess
By getting the bytes of the string in an single-byte encoding hopefully no conversion takes place
Then make a String of the bytes as being in UTF-16BE, internally converted to Unicode (actually UTF-16BE too)
You were pretty close. However, getBytes is used to encode UTF-8 rather than decode. What you want is something along the lines of
String string = "Ð\x9cоÑ\x81ква";
byte[] bytes = string.getBytes("UTF-8");
System.out.println(new String(bytes, "UTF-8"));
Hi my example code is like ;
String ln="á€á€á€•á€¹á€•á€¶á€”ဲ့";
try {
byte[] b = ln.getBytes("UTF-8");
String s = new String(b, "US-ASCII");
System.out.println(s);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
when I run it, it does not print Brumese, Is there a sloution for that ? Thanks
The real problem is that the server is sending back content either with the wrong charset, or double-encoded. If at all possible, you should get that fixed.
In the meantime, you have the right idea—converting the mis-encoded text to the correct charset.
Each character in your String was apparently supposed to be a single byte which was part of an UTF-8 byte sequence. What you're actually seeing is each of those single bytes being treated as a character in the Windows cp1252 charset, and converted to a Java char accordingly.
So, you first want to convert the chars from cp1252 back into the proper bytes:
byte[] b = ln.getBytes("cp1252");
Now you have a true UTF-8 byte sequence, which you can convert into the proper String:
String s = new String(b, StandardCharsets.UTF_8);
// In Java 6, you must use:
//String s = new String(b, "UTF-8");
You should never use US-ASCII if you are decoding, or trying to generate, Burmese characters, or any non-English characters. ASCII consists of codepoints 0 through 127 only.