I have a String called
String s = "Constitución GarantÃa";
I want to convert it to Constitución garantía.
This is a Spanish keyword. How can I convert it?
What you have described is an XY problem. It's the encoding issue and there might appear more of the characters that need to be replaced. Instead of replacing them one by one, you need to encode the whole String to UTF-8.
String s = "Constitución GarantÃa";
byte[] ptext = s.getBytes(StandardCharsets.ISO_8859_1);
String string = new String(ptext, StandardCharsets.UTF_8);
System.out.println(string); // Constitución Garantía
Consider fixing the encoding of a source where the string comes from before you actually start to work with it.
Related
I have this string
"=?UTF-8?B?VGLNBGNDQA==?="
to decode in a standard java String.
I wrote this quick and dirty main to get the String, but I'm having troubles
String s = "=?UTF-8?B?VGLNBGNDQA==?=";
s = s.split("=\\?UTF-8\\?B\\?")[1].split("\\?=")[0];
System.out.println(s);
byte[] decoded = Base64.getDecoder().decode(s);
String x = new String(decoded, "UTF8");
System.out.println(decoded);
System.out.println(x);
It is actually printing a strange string
"Tb�cC#"
I do not know what is the text behind the encoded string, but I can assume my program works, since I can convert without problems any other encoded string, for example
"=?UTF-8?B?SGlfR3V5cyE="
That is "Hi_Guys!".
Should I assume that string is malformed?
I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?
Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()
#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}
I am working o a mail application and I have some troubles with decoding mime encoded text. I am using MimeUtility.decode() but it doesn't for every encoded text. Some texts are decoded properly but others couldn't.
These encoded text which can't be decoded especially have utf-8 and iso-8859-9 encoding type.
How I can solve this issue??
This is the code I used for decoding
MimeUtility.decodeText(text);
These are example of failing text:
****Solution***** (Thanks to #user_xtech007)
I solve this with problem with decoding encoded parts by splitting multiple encoded parts with regex .
Here is the codes of method I using
private final String ENCODED_PART_REGEX_PATTERN="=\\?([^?]+)\\?([^?]+)\\?([^?]+)\\?=";
private String decode(String s)
{
Pattern pattern=Pattern.compile(ENCODED_PART_REGEX_PATTERN);
Matcher m=pattern.matcher(s);
ArrayList<String> encodedParts=new ArrayList<String>();
while(m.find())
{
encodedParts.add(m.group(0));
}
if(encodedParts.size()>0)
{
try
{
for(String encoded:encodedParts)
{
s=s.replace(encoded, MimeUtility.decodeText(encoded));
}
return s;
} catch(Exception ex)
{
return s;
}
}
else
return s;
}
convert the string you receive into byte array and then use this to decode utf-8 text
String s2 = new String(bytes, "UTF-8");
first convert the ISO-8859-1 text into bye array then convert it to string
byte[] b2 = s.getBytes("ISO-8859-1");
For getting the encoded string from the uri , you can use Regex
You can also decode this string by putting
System.setProperty("mail.mime.decodetext.strict", "false");
Before you use MimeUtility.decodeText(text);
This will ensure that also "inner words" get decoded:
The mail.mime.decodetext.strict property controls decoding of MIME
encoded words. The MIME spec requires that encoded words start at the
beginning of a whitespace separated word. Some mailers incorrectly
include encoded words in the middle of a word. If the
mail.mime.decodetext.strict System property is set to "false", an
attempt will be made to decode these illegal encoded words. The
default is true.
https://docs.oracle.com/javaee/7/api/javax/mail/internet/MimeUtility.html
We have a java lib accpeting a UTF8 string as the input. But if there is any char which is a non-ansi char in the input, the lib may crash. So, we want to remove all non-ansi char from the string. But how to do that in java?
Thanks,
Try this, I pulled this from here so haven't tested it
// Create a encoder and decoder for the character encoding
Charset charset = Charset.forName("US-ASCII");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
// This line is the key to removing "unmappable" characters.
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
String result = inString;
try {
// Convert a string to bytes in a ByteBuffer
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inString));
// Convert bytes in a ByteBuffer to a character ByteBuffer and then to a string.
CharBuffer cbuf = decoder.decode(bbuf);
result = cbuf.toString();
} catch (CharacterCodingException cce) {
String errorMessage = "Exception during character encoding/decoding: " + cce.getMessage();
cce.printStackTrace()
}
Take a look at String.codePointAt(index). That can give you the Unicode code point for a given character, and from there you could remove those outside your range.
How you handle the fact that a character has been removed is on your end, but keep in mind that the string you'll be sending to the library isn't necessarily the same as that provided by the client. This may or may not cause problems.
I'm not sure what you mean by ANSI here. Do you mean the Windows 1252 character encoding that people typically call ANSI? That's not ASCII and it's also not IS0-8859-1, so make sure you get your code pages correct.
For converting a string, I am converting it into a byte as follows:
byte[] nameByteArray = cityName.getBytes();
To convert back, I did: String retrievedString = new String(nameByteArray); which obviously doesn't work. How would I convert it back?
What characters are there in your original city name? Try UTF-8 version like this:
byte[] nameByteArray = cityName.getBytes("UTF-8");
String retrievedString = new String(nameByteArray, "UTF-8");
which obviously doesn't work.
Actually that's exactly how you do it. The only thing that can go wrong is that you're implicitly using the platform default encoding, which could differ between systems, and might not be able to represent all characters in the string.
The solution is to explicitly use an encoding that can represent all characts, such as UTF-8:
byte[] nameByteArray = cityName.getBytes("UTF-8");
String retrievedString = new String(nameByteArray, "UTF-8");