Replacing Unicode character codes with characters in String in Java

Replacing Unicode character codes with characters in String in Java - java

I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).

Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.

I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.

If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.

Related

Fill in placeholders in an HTML file using Java

I got an HTML file that looks like this:
<body>
<p>Hello! <b>[NAME]%</b></p>
</body>
And what I got in my Java file is that:
String name = "John";
My question is:
How do that fill John into the [Name]% in Java?
After doing so, how do I convert it to a base64-encoded string in Java?
Thank you for your help!

You are using a lot of characters that Java's regular-expression processor likes to haggle with. I would think that if you have programmed Java before for text-processing, then the String.replace(String, String); method would accomplish what you are attempting to do.
There are three String replace methods. Two of them, though, require regular-expressions. Regular-expressions would expect you to "escape" the brackets that you have typed.
Here is the text, copied from Oracle/Sun's Java documentation for: java.lang.String
String replace(CharSequence target, CharSequence replacement)
Replaces each substring of this string that matches the literal target
sequence with the specified literal replacement sequence.
String replaceAll(String regex, String replacement)
Replaces each substring of this string that matches the given regular
expression with the given replacement.
String replaceFirst(String regex, String replacement)
Replaces the first substring of this string that matches the given
regular expression with the given replacement.
Just so you are aware - the two that say "regex" in the parameter-list would expect the regex String to follow this format for pattern-matching purposes:
// Regular-Expression Programming with java.lang.String - Several "Escaped" Characters!
// ALSO NOTE: Back-slashes need to be twice-escaped!
String replacePattern = "\\[NAME\\]%";
yourText.replaceFirst(replacePattern, "John");
These "back-slashes from hell" are required because the Regular Expressions Processor wants you to escape the '[' and the ']' because they are key-words (reserved/special characters) to the processor's system. Please review Regular Expressions in the Java 7/8/9 documentation to understand how String.replaceFirst and String.replaceAll work vis-a-vis the regex variable. Alternatively, if you use String.replace, all Java would expect is a direct character match, specifically:
yourText = yourText.replace("[NAME]%", "John");
Here is a link to Sun/Oracle's page on java.util.regex.Pattern:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
NOTE: Answer below is copied Google's Answer about base64 Encoding. I personally do not quite understand your question. Let me know if you are talking about UTF-8? UniCode? What do you mean by a "Base64 encoded String"?
What is the use of base64 encoding in Java? Encodes the specified byte array into a String using the Base64 encoding scheme. Returns an
encoder instance that encodes equivalently to this one, but without
adding any padding character at the end of the encoded byte data.
Wraps an output stream for encoding byte data using the Base64
encoding scheme.
What is base64 encoding in Java?
Base64 is a binary-to-text encoding scheme that represents binary data in a printable ASCII string format by translating it into a radix-64 representation. Each Base64 digit represents exactly 6 bits of binary data.Dec 6, 2017
Here is a link to Sun's Page on the issue:
https://docs.oracle.com/javase/8/docs/api/java/util/Base64.Encoder.html

Binary string (11110010) to char - Java

I have a ""binary"" string String s1 = "10011000" and want to print the corresponding Character (Ф) of this Byte, how can I make this?
I have read and tested so many solutions and tutorials...and can't find exactly what I want! Moreover, I think therected is an encoding problem.
For example, this code doesn't work, but why (I have "?" in output, so encoding problem?)?
int j = Integer.parseInt("10011000", 2);
System.out.println(new Character ((char)j));

10011000 is unicode code point 152 which is an extended unicode character which will only appear if its encoding is supported by your console

The character Ф is a Cyrillic capital letter; in Unicode, the hexadecimal value is \u0424. The binary string you are trying to parse is 152 decimal. The binary string for \u0424 is 010000100100 (1060 decimal) and so I would fix that first. And as others noted, until your environment character set supports Unicode output, Java will substitute a "?" character for any character that the current character set doesn't support. See Unicode characters in Eclipse for setting up Eclipse console to Unicode.

You have used wrong code. If you want to see in output Ф you need to change your code into this:
int j = Integer.parseInt("10000100100", 2);
System.out.println((char) j);

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

escaped html won't unescaped (now: unescaped html won't escape back)

So I'm currently using the commons lang apache library.
When I tried unescaping this string: 😀
This returns the same string: 😀
String characters = "😀"
StringEscapeUtils.unescapeHtml(characters);
Output: 😀
But when I tried unescaping a String with a less few characters, it works:
String characters = "㈳"
StringEscapeUtils.unescapeHtml(characters);
Output: ㈳
Any ideas? When I tried unescaping this String "😀" on online unescaping utility, it works, so maybe it's a bug in the apache common langs library? Or can anyone recommend another library?
Thanks.
UPDATES:
I'm now able to unescape the String successfully. The problem now is when I tried to escaped the result of that unescape, it won't bring back the String (😀).

unescapeHtml() leaves 😀 untouched because – as the documentation says – it only unescapes HTML 4.0 entities, which are limited to 65,536 characters. Unfortunately, 128,512 is far beyond that limit.
Have you tried using unescapeXml()?
XML supports up to 1,114,111 (10FFFFh) character entities (link).

This is a unicode character whose index is U+1F600 (128512) - GRINNING FACE
Refer the URL for details
The String you have mentioned is HTML Escape of U+1F600, If you unescape it using Apache commons lang it will draw you the required smiley as provided in screenshot
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Regarding your update that its not converting back to 😀
You can also represent a character using a Numeric Character Reference, of the form &#dddd;, where dddd is the decimal value representing the character's Unicode scalar value. You can alternatively use a hexadecimal representation &#xhhhh;, where hhhh is the hexadecimal value equivalent to the decimal value.
A good site for this
Have added few SoP to help you understand this unicode better.

Well - the solution is pretty easy:
use org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 instead! (unless you're using Java <1.5, which you probably won't)
String characters = "😀";
StringEscapeUtils.unescapeHtml4(characters);

i think the problem is that there is no unicode character "😀"
so the method simply returns this string.
the doc of the function says only
Returns: a new unescaped String, null if null string input

If it's a HTML specific question, then you can just use JavaScript for this purpose.
You can do
escape("😀") which gives you %26%23128512%3B
unescape("%26%23128512%3B") which gives you back 😀

How to parse word-created special chars in java

I am trying to parse some word documents in java. Some of the values are things like a date range and instead of showing up like Startdate - endDate I am getting some funky characters like so
StartDate ΓÇô EndDate
This is where word puts in a special character hypen. Can you search for these characters and replace them with a regular - or something int he string so that I can then tokenize on a "-" and what is that character - ascii? unicode or what?
Edited to add some code:
String projDateString = "08/2010 ΓÇô Present"
Charset charset = Charset.forName("Cp1252");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer buf = ByteBuffer.wrap(projDateString.getBytes("Cp1252"));
CharBuffer cbuf = decoder.decode(buf);
String s = cbuf.toString();
println ("S: " + s)
println("projDatestring: " + projDateString)
Outputs the following:
S: 08/2010 ΓÇô Present
projDatestring: 08/2010 ΓÇô Present
Also, using the same projDateString above, if I do:
projDateString.replaceAll("\u0096", "\u2013");
projDateString.replaceAll("\u0097", "\u2014");
and then print out projDateString, it still prints as
projDatestring: 08/2010 ΓÇô Present

You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)
Windows-1252, formerly "Cp1252" is almost Unicode, but keeps some characters that came from Cp1252 in their same places. The En Dash is character 150 (0x96) which falls within the Unicode C1 reserved control character range and shouldn't be there.
You can search for char 150 and replace it with \u2013 which is the proper Unicode code point for En Dash.
There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.
Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.
Say you have
String stuff = MSWordUtil.getNextChunkOfText();
Where MSWordUtil would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to
File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file
By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tell what encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.
After getting some String like stuff above, you won't find \u2013 or \u2014 in it, you'll find 0x96 and 0x97 instead.
At that point you should be able to do
stuff.replaceAll("\u0096", "\u2013");
I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence one char at a time, decide based on 0x80 <= charValue <= 0x9f if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.

s = s.replace( (char)145, (char)'\'');
s = s.replace( (char)8216, (char)'\''); // left single quote
s = s.replace( (char)146, (char)'\'');
s = s.replace( (char)8217, (char)'\''); // right single quote
s = s.replace( (char)147, (char)'\"');
s = s.replace( (char)148, (char)'\"');
s = s.replace( (char)8220, (char)'\"'); // left double
s = s.replace( (char)8221, (char)'\"'); // right double
s = s.replace( (char)8211, (char)'-' ); // em dash??
s = s.replace( (char)150, (char)'-' );
http://www.coderanch.com/how-to/java/WeirdWordCharacters

Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8 if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc documents. See this site for more info. Notably,
Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.
So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset class.
Windows-1252 is recognized by the IANA Charset Registry
Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None
so you it should be Charset-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String constructor that takes a byte[] and a Charset as arguments.

Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.
If I remember correctly from when I did some work on character encodings in Java, String instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s, you may write
s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');
where 201c and 201d are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.