UTF-8 -- ISO 8859-1 mapping tool

UTF-8 -- ISO 8859-1 mapping tool - java

When I convert a UTF-8 String with chars that are not known in 8859-1 to 8859-1 then i get question marks here and there. Sure what sould he do else!
Is there a java tool that can map a string like "İKEA" to "IKEA" and avoid ? to make the best out of it?

For the specific example, you can:
decompose the letters and diacritics using compatibility form Unicode normalization
instruct the encoder to drop unsupported characters (the diacritics)
Example:
ByteArrayOutputStream out = new ByteArrayOutputStream();
// create encoder
CharsetEncoder encoder = StandardCharsets.ISO_8859_1.newEncoder();
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
// write data
String ikea = "\u0130KEA";
String decomposed = Normalizer.normalize(ikea, Form.NFKD);
CharBuffer cbuf = CharBuffer.wrap(decomposed);
ByteBuffer bbuf = encoder.encode(cbuf);
out.write(bbuf.array());
// verify
String decoded = new String(out.toByteArray(), StandardCharsets.ISO_8859_1);
System.out.println(decoded);
You're still transcoding from a character set that defines 109,384 values (Unicode 6) to one that supports 256 so there will always be limitations.
Also consider a more sophisticated transformation API like ICU for features like transliteration.

Related

Byte streams in java

Can we write Unicode Data in a File with ByteStreams?
My code is:
public static void main(String[] args) throws Exception {
String str = "Русский язык ";
FileOutputStream fos = new FileOutputStream("file path");
fos.write(str.getBytes());
fos.flush();
fos.close();
}
Here i am using a byte stream to write unicode data, but it is writing properly.I am new to java but i have read that byte streams do not support unicode characters. So, why does it is working in this case?

i have read that byte streams do not support unicode characters.
Either you have used a bad source of information or you have probably misunderstood something. Byte streams support bytes. Therefore byte streams support anything that can be represented in bytes. Videos, text, pictures, music... If byte stream doesn't support it, it cannot be used in a digital computer at all.
The trick to represent those things in what is a simply a sequence of 1 and 0's, is to use agreed upon rules. You would encode your text according to certain rules, and then the receiver can decode it back using the same rules.
"Русский язык" can be represented as bytes in any encoding that supports cyrillic characters. In any of the encodings of unicode: UTF-8, UTF-16, UTF-32; Windows-1251, KOI8-R, KOI8-U, ISO-8859-5...
That doesn't mean these encodings are compatible with each other. They are all incompatible when it comes to encoding Cyrillic script, so text encoded in one the encodings, must strictly be decoded in that encoding.
.getBytes() uses the platform default encoding, which happened to be a one that supported Cyrillic script. You might believe it's UTF-8 but if you are on Windows, it's far more likely to be Cp1251. Don't fall into trap that just because you used "unicode characters", that your files are physically encoded in an UTF encoding. That will lead to encoding problems.
So always be explicit about encoding, so that your program works the same on any platform and so that you always know what encoding the files your program created are in. With your code, you could have done this:
String str = "Русский язык ";
FileOutputStream fos = new FileOutputStream("file path");
fos.write(str.getBytes("UTF-8"));
fos.flush();
fos.close();
Or as suggested by the other answer:
String str = "Русский язык ";
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("file path"), "UTF-8"
);
osw.write(str);
osw.flush();
osw.close();
These are technically exactly the same; text is being converted to bytes according to UTF-8 rules.

throw exception when string is not encoded in UTF-8

I've got method where one of input attributes is String xml. I just want to create control for encoding of that xml. If any character is in other encoding that UTF-8, error will be thrown.
can you please tell me the easiest way how to create and test it?
I've used something like this:
String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"));
Document doc = builder.parse(IOUtils.toInputStream(xml, "UTF-8"));
added letters like Ľ,Š,Ť,Ž,ľ,š,ť,ž and save it as cp1250 file.
but no error.
what am I doing wrong?

This cannot be done natively in Java. A file is just a string of bytes, they can be interpreted however you feel like, Java by default has no way to add meaning. I recommend using this library (no I didn't write it):
http://code.google.com/p/juniversalchardet/
Follow these instructions (copy pasted from that link):
How to use it
Construct an instance of org.mozilla.universalchardet.UniversalDetector.
Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
Notify the detector of the end of data by calling UniversalDetector.dataEnd().
Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
Don't forget to call UniversalDetector.reset() before you reuse the detector instance.

String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"));
If this IOUtils is org.apache.commons.io.IOUtils then its Javadoc says
"Get the contents of an InputStream as a String using the default character encoding of the platform."
As you are saving as cp1250, I guess cp1250 is also your platform character encoding. What your code would be doing is
Read the file as a byte stream
Convert the byte stream to chars using cp1250 (platform encoding)
Transform the chars to Java internal representation (UTF-16)
Convert from UTF-16 to UTF-8
Create XML document
That will always work as cp1250 really is your file encoding, UTF-16 has every character in cp1250 and UTF-8 has every character in UTF-16.
If you want to read the bytes as UTF-8 and avoid automatic conversions, you should use one of the two-parameter variant of IOUtils.toString():
public static String toString(InputStream input, Charset encoding)
public static String toString(InputStream input, String encoding)
So I would try:
// Helper import: I always forget if the constant is "UTF8" or "UTF-8"
import org.apache.commons.lang.CharEncoding;
String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"), CharEncoding.UTF_8);
Document doc = builder.parse(IOUtils.toInputStream(xml, CharEncoding.UTF_8));
The rule of thumb here is: NEVER do any byte-to-string / string-to-byte conversion without specifying the source / destination encoding.
A minor rule of thumb would be: Unless you need to use some other encoding, use UTF-8 everywhere.
Both of those rules of thumb are independent of your programming language of choice.

Java Unicode to readable text conversion decoding

I am developing a Java application where I am consuming a web service. The web service is created using a SAP server, which encodes the data automatically in Unicode. I get a Unicode string from the web service.
"
倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭⁬慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
"
above is the response.
I want to convert it to readable text format like String. I am using core Java.

倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭⁬慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
That's a PDF file that has been interpreted as UTF-16LE.
You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. Extracting the document text out of a PDF file is a much bigger problem!
(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.)

If you have byte[] or an InputStream (both binary data) you can get a String or a Reader (both text) with:
final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"
byte[] b = ...;
String s = new String(b, encoding);
InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
String line = reader.readLine();
}
The reverse process uses:
byte[] b = s.geBytes(encoding);
OutputStream os = ...;
BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);
Unicode is a numbering system for all characters. The UTF variants implement Unicode as bytes.
Your problem:
In normal ways (web service), you would already have received a String. You could write that string to a file using the Writer above for instance. Either to check it yourself with a full Unicode font, or to pass the file on for a check.
You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. In XML it would be defined already.
Addition:
FileWriter writes to a file using the default encoding (from operating system on your machine). Instead use:
new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")
If it is a binary PDF, as #bobince said, use just a FileOutputStream on byte[] or InputStream.

This is definitely not a valid string. This looks like mangled UTF-16.
UPDATE
Indeed #Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. When Displayed in UTF-8 this string indeed shows PDF source code. Good catch.

Java UTF-8 string serialization \xF2 instead of \u00F2

I have a string with UTF8 characters in it and I'm using StringEntity to put it into a HttpEntityEnclosingRequestBase and send it to a server.
My problem is that the UTF8 characters are coded as \xF2 and the server would like \u00f2. How can I fix this? Or how can I easily convert an UTF-8 string to a string where I have \u00f2 like substrings instead of the UTF8 chars?
Solution:
In the end, the solution was:
StringEntity(string, "UTF-8"));
Thanks in advance, David

You can convert between Java's internal character encoding (UTF-16) and UTF-8 byte sequences in a variety of ways. The simplest is:
byte[] utf8data = "my string".getBytes("UTF-8");
String myString = new String(utf8data, "UTF-8");
There are also stream-oriented classes that can translate between byte streams and character streams using an encoding. See the java.io package.

Just do
String str = ...;
str = replace("\\x", "\\u00");

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.

You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.

UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252

String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

UTF-8 -- ISO 8859-1 mapping tool - java

When I convert a UTF-8 String with chars that are not known in 8859-1 to 8859-1 then i get question marks here and there. Sure what sould he do else! Is there a java tool that can map a string like "İKEA" to "IKEA" and avoid ? to make the best out of it?

Related

Byte streams in java

throw exception when string is not encoded in UTF-8

Java Unicode to readable text conversion decoding

Java UTF-8 string serialization \xF2 instead of \u00F2

UTF8 convertion for text obtained from internet

Categories

Resources