Automatic Unicode string formatting in Java

Automatic Unicode string formatting in Java - java

I just came across something like this:
String sample = "somejunk+%3cfoobar%3e+morestuff";
Printed out, sample looks like this:
somejunk+<foobar>+morestuff
How does that work? U+003c and U+003e are the Unicode codes for the less than and greater than signs, respectively, which seems like more than a coincidence, but I've never heard of Java automatically doing something like this. I figured it'd be an easy thing to pop into Google, but it turns out Google doesn't like the percent sign.

That string is probably URL encoded You'd decode that in java using the URLDecoder
String res = java.net.URLDecoder.decode(sample, "UTF8");

You can do something like this,
String sample = "somejunk+%3cfoobar%3e+morestuff";
String result = URLDecoder.decode(sample.replaceAll("\\+", "%2B"), "UTF8");

Java does support Unicode escapes in char and String literals, but not URL encoding.
The Unicode escapes use '\uXXXX', where XXXX is the Unicode point in hexadecimal.
Curious tidbit: The grammar allows 'u' to occur multiple times, so that '\uuuuuuuu0041' is a valid Unicode escape (for 'A').

Related

? is the only output for all unicode above U+0080 in java [duplicate]

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Encoding pinyin

I'm currently developing a program in java, and I want to display Chinese pinyin, which I get from a distant website.
But I have the following problem: Chinese pinyin is displayed this way: jiǎ
Whereas it should be displayed this way: jiǎ
(I just typed the same sequence, except I stripped the slashes).
I think the answer to this question is really simple but I'm struggling to find it.

The problem is you have an HTML encoded Unicode character and what you want is the decoded version of it. A library like commons-lang3 (part of Apache Commons) will take your HTML encoded string and decode it for Java to display like this:
String decoded = StringEscapeUtils.unescapeHtml("jiǎ");
You can also escape Unicode characters in Java like this:
String jia = "ji\u01ce";
This clever one-liner will take a Unicode character and show you its escaped form:
System.out.println( "\\u" + Integer.toHexString('ǎ' | 0x10000).substring(1) );

Android- remove URL percent symbols from string

I have a URL that looks like this:
Liberty%21%20ft.%20Whiskey%20Pete%20-%20Thunderfist%20%28Original%20Mix%29.mp3
I'm trying to extract just the words from it. Right now, I'm using string.replace("%21", "!") for each and every %20, %29, etc. because each segment represent different characters or spaces. Is there a way to just covert those symbols and numbers to what they actually mean?
Thanks.

Those symbols are URLEncoded representations of characters that can't legally exist in a URL. (%20 = a single space, etc)
You need to UrlDecode those strings:
http://icfun.blogspot.com/2009/08/java-urlencode-and-urldecode-options.html
Official documentation here:
http://download.oracle.com/javase/6/docs/api/java/net/URLDecoder.html

It seems the input string is written using the URL encoding. Instead of writing all possible replacements manually (you can hardly cover all possibilities), you can use URLDecoder class in Java.
String input = "Liberty%21%20ft.%20Whiskey%20Pete...";
String decoded = URLDecoder.decode(input, "UTF-8");

Print string literal unicode as the actual character

In my Java application I have been passed in a string that looks like this:
"\u00a5123"
When printing that string into the console, I get the same string as the output (as expected).
However, I want to print that out by having the unicode converted into the actual yen symbol (\u00a5 -> yen symbol) - how would I go about doing this?
i.e. so it looks like this: "[yen symbol]123"

I wrote a little program:
public static void main(String[] args) {
System.out.println("\u00a5123");
}
It's output:
¥123
i.e. it output exactly what you stated in your post. I am not sure there is not something else going on. What version of Java are you using?
edit:
In response to your clarification, there are a couple of different techniques. The most straightforward is to look for a "\u" followed by 4 hex-code characters, extract that piece and replace with a unicode version with the hexcode (using the Character class). This of course assumes the string will not have a \u in front of it.
I am not aware of any particular system to parse the String as though it was an encoded Java String.

As has been mentioned before, these strings will have to be parsed to get the desired result.
Tokenize the string by using \u as separator. For example: \u63A5\u53D7 => { "63A5", "53D7" }
Process these strings as follows:
String hex = "63A5";
int intValue = Integer.parseInt(hex, 16);
System.out.println((char)intValue);

You're probably going to have to write a parse for these, unless you can find one in a third party library. There is nothing in the JDK to parse these for you, I know because I fairly recently had an idea to use these kind of escapes as a way to smuggle unicode through a Latin-1-only database. (I ended up doing something else btw)
I will tell you that java.util.Properties escapes and unescapes Unicode characters in this manner when reading and writing files (since the files have to be ASCII). The methods it uses for this are private, so you can't call them, but you could use the JDK source code to inspire your solution.

Could replace the above with this:
System.out.println((char)0x63A5);
Here is the code to print all of the box building unicode characters.
public static void printBox()
{
for (int i=0x2500;i<=0x257F;i++)
{
System.out.printf("0x%x : %c\n",i,(char)i);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Automatic Unicode string formatting in Java - java

That string is probably URL encoded You'd decode that in java using the URLDecoder String res = java.net.URLDecoder.decode(sample, "UTF8");

You can do something like this, String sample = "somejunk+%3cfoobar%3e+morestuff"; String result = URLDecoder.decode(sample.replaceAll("\\+", "%2B"), "UTF8");

Java does support Unicode escapes in char and String literals, but not URL encoding. The Unicode escapes use '\uXXXX', where XXXX is the Unicode point in hexadecimal. Curious tidbit: The grammar allows 'u' to occur multiple times, so that '\uuuuuuuu0041' is a valid Unicode escape (for 'A').

Related

? is the only output for all unicode above U+0080 in java [duplicate]

why '?' appears as output while Printing unicode characters in java

Encoding pinyin

Android- remove URL percent symbols from string

Print string literal unicode as the actual character

Categories

Resources