php json_encode euro symbol for android - java

I am loading data in an android app from a php service.
In php i use json_encode to convert my data.
Now i have a string with a € character in it.
json_encode converts this to \u0080, but as far as i know the actual correct unicode should be \u20AC.
Usually thats not a problem but the Droid Sans Font does only render \u20AC as the euro symbol.
My question: Is there a way to make the € character convert correctly (i dont care if thats in Javaor in PHP, although i would prefer a php solution) without using any string replaces or regex etc..
replacing seems ugly and there might be more symbols that dont get converted properly that i dont know of yet.

\u0080 means that the input character was \x80 which is the Euro sign in Windows-1252. So I assume your string is encoded in this charset, then you should convert it to UTF-8 because json_encode only works with UTF-8 input:
$string = iconv('Windows-1252', 'UTF-8', $string);

Related

How do I store accented characters in S3 metadata?

I am trying to store accented characters such as ò in the metadata of an S3 object. I am using the REST API which according to this page only accepts US-ASCII: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
Is there a way to convert Strings in Scala or Java from Bòrd to B\u00F2rd?
I have tried using Normalizer.normalize(str, Normalizer.Form.NFD) but the character when submitted to S3 is still causing an error because it appears as ò. When I try to print out the returned String it is also showing ò.
A normalized unicode string is just normalized in terms of composing characters, not necessarily to ASCII. Using NFKC would be more likely to convert characters to ASCII forms, but certainly would not reliably to do so.
It sounds like what you want is to escape non-ascii characters. You could use e.g. UnicodeEscaper from commons-lang, and UnicodeUnescaper to translate back.

Convert Freebase Unicode codepoints to Java String

I'm doing some Freebase queries. Sometimes the result of the query contains Unicode characters. How could I convert those characters into a Java String? (e.g., The_Police_$0028band$0029 → The_Police_(band)). I've tried:
new String(arg_in_byte,"UTF-8")
but it doesn't work. I saw in another question that one solution is the method replaceAll but I think that there is some other method that will be cleaner.
Those aren't UTF-8 encoded, but rather private encoding of Unicode codepoints. If your Java client library for Freebase doesn't include the necessary decoding method, you'll need to write one yourself to take the four digits after the dollar sign ($), interpret them as a hexadecimal integer and then convert that to a Java character (which also uses Unicode code points internally).
Here is some documentation on the encoding:
http://wiki.freebase.com/wiki/MQL_key_escaping

Encoding pinyin

I'm currently developing a program in java, and I want to display Chinese pinyin, which I get from a distant website.
But I have the following problem: Chinese pinyin is displayed this way: jiǎ
Whereas it should be displayed this way: jiǎ
(I just typed the same sequence, except I stripped the slashes).
I think the answer to this question is really simple but I'm struggling to find it.
The problem is you have an HTML encoded Unicode character and what you want is the decoded version of it. A library like commons-lang3 (part of Apache Commons) will take your HTML encoded string and decode it for Java to display like this:
String decoded = StringEscapeUtils.unescapeHtml("jiǎ");
You can also escape Unicode characters in Java like this:
String jia = "ji\u01ce";
This clever one-liner will take a Unicode character and show you its escaped form:
System.out.println( "\\u" + Integer.toHexString('ǎ' | 0x10000).substring(1) );

Java: Advise on Charset Conversion

I have been working on a scenario that does the following:
Get input data in Unicode format; [UTF-8]
Convert to ISO-8559;
Detect & replace unsupported characters for encoding; [Based on user-defined key-value pairs]
My question is, I have been trying to find information on ISO-8559 in depth with no luck yet. Has anybody happen to know more about this? How different is this one from ISO-8859? Any details will be much helpful.
Secondly, keeping the ISO-8559 requirement aside, I went ahead to write my program to convert the incoming data to ISO-8859 in Java. While I am able to achieve what is needed using character based replacement, it obviously seem to be time-consuming when data size is huge. [in MBs]
I am sure there must be a better way to do this. Can someone advise me, please?
I assume you want to convert UTF-8 to ISO-8859 -1, that is Western Latin-1. There are many char set tables in the net.
In general for web browsers and Windows, it would be better to convert to Windows-1252, which is an extension redefining the range 0x80 - 0xBF, undermore with special quotes as seen in MS Word. Browsers are defacto capable to interprete these codes in an ISO-559-1 even on a Mac.
Java standard conversion like new OutputStreamWriter(new FileOutputStream("..."), "Windows-1252") does already much. You can either write a kind of filter, or find introduced ? untranslated special characters. You could translate latin letters with accents not in Windows-1252 as ASCII letters:
String s = ...
s = Normalizer.normalize(s, Normalizer.Form.NFD);
return s = s.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
For other scripts like Hindi or Cyrillic the keyword to search for is transliteration.

String Encoding doesn't ouput all characters

My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");
You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting  characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.

Categories