How to unescape html special characters in Java?

How to unescape html special characters in Java? - java

I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?

Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()

#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}

Related

UTF-8 cannot properly pass Japanese character (Hiragana and Katakana) strings as an argument

For example the file that I needed is found at this filepath and it will be passed as an argument:
"C:\Users\user.name\docs\jap\あああいいいうううえええおおおダウンロード\filename.txt"
I used this code to decode the characters:
String new_path = new String(args[0].getBytes("Shift_JIS"), StandardCharsets.UTF_8);
System.out.println(new_path);
However, the output is:
C:\Users\user.name\docs\jap\あああい�?�?�?�?�?えええおおお�?ウンロード\filename.txt
Some of the characters have not been decoded properly. I already changed the text encoding and encoding of the console to UTF-8 but it still didn't work.
But if I would just print it regularly, it displays just fine.
System.out.println("C:\\Users\\user.name\\docs\\jap\\あああいいいうううえええおおおダウンロード\\filename.txt");
which displays:
C:\Users\user.name\docs\jap\あああいいいうううえええおおおダウンロード\filename.txt
Please tell me how to read the other characters, it really be a great help. Thanks!

public static void main(String[] args) throws UnsupportedEncodingException {
// it is your code
String newPath = new String(args[0].getBytes("Shift_JIS"), StandardCharsets.UTF_8);
System.out.println(newPath);
// instead of your code
newPath = args[0];
System.out.println(newPath);
}
maybe, you can show "あああいいいうううえええおおおダウンロード".
if you create the String object with a byte array and corresponding charset, you can convert it to any charset for it.

how to remove special characters from string

I have a String called
String s = "ConstituciÃ³n GarantÃa";
I want to convert it to Constitución garantía.
This is a Spanish keyword. How can I convert it?

What you have described is an XY problem. It's the encoding issue and there might appear more of the characters that need to be replaced. Instead of replacing them one by one, you need to encode the whole String to UTF-8.
String s = "ConstituciÃ³n GarantÃa";
byte[] ptext = s.getBytes(StandardCharsets.ISO_8859_1);
String string = new String(ptext, StandardCharsets.UTF_8);
System.out.println(string); // Constitución Garantía
Consider fixing the encoding of a source where the string comes from before you actually start to work with it.

MimeUtility.decode() doesn't work for every encoded text

I am working o a mail application and I have some troubles with decoding mime encoded text. I am using MimeUtility.decode() but it doesn't for every encoded text. Some texts are decoded properly but others couldn't.
These encoded text which can't be decoded especially have utf-8 and iso-8859-9 encoding type.
How I can solve this issue??
This is the code I used for decoding
MimeUtility.decodeText(text);
These are example of failing text:

****Solution***** (Thanks to #user_xtech007)
I solve this with problem with decoding encoded parts by splitting multiple encoded parts with regex .
Here is the codes of method I using
private final String ENCODED_PART_REGEX_PATTERN="=\\?([^?]+)\\?([^?]+)\\?([^?]+)\\?=";
private String decode(String s)
{
Pattern pattern=Pattern.compile(ENCODED_PART_REGEX_PATTERN);
Matcher m=pattern.matcher(s);
ArrayList<String> encodedParts=new ArrayList<String>();
while(m.find())
{
encodedParts.add(m.group(0));
}
if(encodedParts.size()>0)
{
try
{
for(String encoded:encodedParts)
{
s=s.replace(encoded, MimeUtility.decodeText(encoded));
}
return s;
} catch(Exception ex)
{
return s;
}
}
else
return s;
}

convert the string you receive into byte array and then use this to decode utf-8 text
String s2 = new String(bytes, "UTF-8");
first convert the ISO-8859-1 text into bye array then convert it to string
byte[] b2 = s.getBytes("ISO-8859-1");
For getting the encoded string from the uri , you can use Regex

You can also decode this string by putting
System.setProperty("mail.mime.decodetext.strict", "false");
Before you use MimeUtility.decodeText(text);
This will ensure that also "inner words" get decoded:
The mail.mime.decodetext.strict property controls decoding of MIME
encoded words. The MIME spec requires that encoded words start at the
beginning of a whitespace separated word. Some mailers incorrectly
include encoded words in the middle of a word. If the
mail.mime.decodetext.strict System property is set to "false", an
attempt will be made to decode these illegal encoded words. The
default is true.
https://docs.oracle.com/javaee/7/api/javax/mail/internet/MimeUtility.html

Encode french character in Java over smpp?

This is my code i am tring to Send Message Over SMPP but as Output ? is coming:
public class Encoding
{
public static void main(String[] args) throws SocketTimeoutException, AlreadyBoundException, VersionException, SMPPProtocolException, UnsupportedOperationException, IOException
{
SubmitSM sm=new SubmitSM();
String strMessage="Pour se désinscrire du service TT ZONE, envoyez GRATUITEMENT « DTTZ » ";
String utf8 = new String(strMessage.getBytes("UTF-8"));
UCS2Encoding uc = UCS2Encoding.getInstance(true);
sm.setDataCoding(2);
sm.setMessageText(utf8);
System.out.println(sm.getMessageText());
}
}

Your problem is here:
String strMessage="Pour se désinscrire du service TT ZONE, envoyez GRATUITEMENT « DTTZ » ";
String utf8 = new String(strMessage.getBytes("UTF-8"));
Why do you do that at all? Since the UCS2Encoding class accepts a String as an argument, it will take care of the encoding itself.
Just do:
sm.setMessageText(strMessage);
As I mentioned in the other question you asked, you are mixing a LOT of concepts. Remind that a String is a sequence of chars; it is independent of the encoding you use. The fact that internally Java uses UTF-16 is totally irrelevant here. It could use UTF-32 or EBCDIC, or even use carrier pigeons, the process itself would not change:
encode decode
String (char[]) --------> byte[] --------> String (char[])
And by using the String constructor taking a byte array as an argument, you create a seqeunce of chars from these bytes using the default JVM encoding. Which may, or may not, be UTF-8.
In particular, if you are using Windows, the default encoding will be windows-1252. Let us replace encode and decode above with the charset names. What you do is:
UTF-8 windows-1252
String (char[]) -------> byte[] --------------> String (char[])
"Houston, we have a problem!"
For more details, see the javadocs for Charset, CharsetEncoder and CharsetDecoder.

MD5 Hash of ISO-8859-1 string in Java

I'm implementing an interface for digital payment service called Suomen Verkkomaksut. The information about the payment is sent to them via HTML form. To ensure that no one messes with the information during the transfer a MD5 hash is calculated at both ends with a special key that is not sent to them.
My problem is that for some reason they seem to decide that the incoming data is encoded with ISO-8859-1 and not UTF-8. The hash that I sent to them is calculated with UTF-8 strings so it differs from the hash that they calculate.
I tried this with following code:
String prehash = "6pKF4jkv97zmqBJ3ZL8gUw5DfT2NMQ|13466|123456||Testitilaus|EUR|http://www.esimerkki.fi/success|http://www.esimerkki.fi/cancel|http://www.esimerkki.fi/notify|5.1|fi_FI|0412345678|0412345678|esimerkki#esimerkki.fi|Matti|Meikäläinen||Testikatu 1|40500|Jyväskylä|FI|1|2|Tuote #101|101|1|10.00|22.00|0|1|Tuote #202|202|2|8.50|22.00|0|1";
String prehashIso = new String(prehash.getBytes("ISO-8859-1"), "ISO-8859-1");
String hash = Crypt.md5sum(prehash).toUpperCase();
String hashIso = Crypt.md5sum(prehashIso).toUpperCase();
Unfortunately both hashes are identical with value C83CF67455AF10913D54252737F30E21. The correct value for this example case is 975816A41B9EB79B18B3B4526569640E according to Suomen Verkkomaksut's documentation.
Is there a way to calculate MD5 hash in Java with ISO-8859-1 strings?
UPDATE: While waiting answer from Suomen Verkkomaksut, I found an alternative way to make the hash. Michael Borgwardt corrected my understanding of String and encodings and I looked for a way to make the hash from byte[].
Apache Commons is an excellent source of libraries and I found their DigestUtils class which has a md5hex function which takes byte[] input and returns a 32 character hex string.
For some reason this still doesn't work. Both of these return the same value:
DigestUtils.md5Hex(prehash.getBytes());
DigestUtils.md5Hex(prehash.getBytes("ISO-8859-1"));

You seem to misunderstand how string encoding works, and your Crypt class's API is suspect.
Strings don't really "have an encoding" - an encoding is what you use to convert between Strings and bytes.
Java Strings are internally stored as UTF-16, but that does not really matter, as MD5 works on bytes, not Strings. Your Crypt.md5sum() method has to convert the Strings it's passed to bytes first - what encoding does it use to do that? That's probably the source of your problem.
Your example code is pretty nonsensical as the only effect this line has:
String prehashIso = new String(prehash.getBytes("ISO-8859-1"), "ISO-8859-1");
is to replace characters that cannot be represented in ISO-8859-1 with question marks.

Java has a standard java.security.MessageDigest class, for calculating different hashes.
Here is the sample code
include java.security.MessageDigest;
// Exception handling not shown
String prehash = ...
final byte[] prehashBytes= prehash.getBytes( "iso-8859-1" );
System.out.println( prehash.length( ) );
System.out.println( prehashBytes.length );
final MessageDigest digester = MessageDigest.getInstance( "MD5" );
digester.update( prehashBytes );
final byte[] digest = digester.digest( );
final StringBuffer hexString = new StringBuffer();
for ( final byte b : digest ) {
final int intByte = 0xFF & b;
if ( intByte < 10 )
{
hexString.append( "0" );
}
hexString.append(
Integer.toHexString( intByte )
);
}
System.out.println( hexString.toString( ).toUpperCase( ) );
Unfortunately for you it produces the same "C83CF67455AF10913D54252737F30E21" hash. So, I guess your Crypto class is exonerated. I specifically added the prehash and prehashBytes length printouts to verify that indeed 'ISO-8859-1' is used. In this case both are 328.
When I did presash.getBytes( "utf-8" ) it produced "9CC2E0D1D41E67BE9C2AB4AABDB6FD3" (and the length of the byte array became 332). Again, not the result you are looking for.
So, I guess Suomen Verkkomaksut does some massaging of the prehash string that they did not document, or you have overlooked.

Not sure if you solved your problem, but I had a similar problem with ISO-8859-1 encoded strings with nordic ä & ö characters and calculating a SHA-256 hash to compare with stuff in documentation. The following snippet worked for me:
import java.security.MessageDigest;
//imports omitted
#Test
public void test() throws ProcessingException{
String test = "iamastringwithäöchars";
System.out.println(this.digest(test));
}
public String digest(String data) throws ProcessingException {
MessageDigest hash = null;
try{
hash = MessageDigest.getInstance("SHA-256");
}
catch(Throwable throwable){
throw new ProcessingException(throwable);
}
byte[] digested = null;
try {
digested = hash.digest(data.getBytes("ISO-8859-1"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
String ret = BinaryUtils.BinToHexString(digested);
return ret;
}
To transform bytes to hex string there are many options, including the apache commons codec Hex class mentioned in this thread.

If you send UTF-8 encoded data that they treat as ISO-8859-1 then that could be the source of your problem. I suggest you either send the data in ISO-8859-1 or try to communicate to Suomen Verkkomaksut that you're sending UTF-8. In a http-based protocol you do this by adding charset=utf-8 to Content-Type in the HTTP header.
A way to rule out some issues would be to try a prehash String that only contains characters that are encoded the same in UTF-8 and ISO-8859-1. From what I can see you can achieve this by removing all "ä" characters in the string you'e used.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to unescape html special characters in Java? - java

Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method. String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr); You may also need unescapeXml()

Related

UTF-8 cannot properly pass Japanese character (Hiragana and Katakana) strings as an argument

how to remove special characters from string

MimeUtility.decode() doesn't work for every encoded text

Encode french character in Java over smpp?

MD5 Hash of ISO-8859-1 string in Java

Categories

Resources