Reading file with bad encoding. CP1252 vs UTF-8

Reading file with bad encoding. CP1252 vs UTF-8 - java

I have byte array, which put in InputStreamReader and do some manipulations with it.
Reader reader = new InputStreamReader(new ByteArrayInputStream(byteArr));
JVM has default cp1252 encoding, but file, which I translating to byte array has utf-8 encoding. Also this file has german umlauts. And when I put byte array in InputStreamReader, java decode umlauts to wrong symbols. For example ü represent as Ã¼. I'm tried to put "UTF-8" and Charset.forName("UTF-8").newDecoder()); to InputStreamReader constructor, translate strings from reader to string with new encoding via new String(oldStr.getBytes("cp1252"), "UTF-8); but it's not helped. In debugger in reader variable I see StreamDecoder parameter, which has "decoder" with MS1252$Decoder value. Maybe It's solving of my problem, but I not understand, how I can fix it.

Try to use InputStreamReader(InputStream in, String charsetName) constructor and set charset by yourself.
Reader reader = new InputStreamReader(new ByteArrayInputStream(byteArr), "UTF-8");

I had exactly the same error and finally solved the issue by adding this to the JVM startup options :
-Dfile.encoding=UTF8

Related

Reading any text file having strange encoding?

I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.
As you can see in the above screenshot the file contents appear fine in Notepad++, but when i read it using this code, just garbage is being printed in the console:
String textFilePath = "c:\strange_file_encoding.txt"
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) );
String line = "";
while ( ( line = reader.readLine() ) != null ) {
System.out.println( line ); // Prints garbage characters
}
The main point is that the user selects the file to read, so it can be of any encoding, and since I can't detect the file encoding I decode it using "UTF8" but as in the above example it fails to read it right.
Is there away to read such strange files in a right way ? Or at least can i detect if my code will fail to read it right ?

You are using UTF-8 as your encoding in the InputStreamReader constructor, so it will try to interpret the bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset
I suppose you need to use UTF-16LE according to it.
Here is more info on the supported character sets and their Java names:
Supported Encodings

You're providing the wrong encoding in InputStreamReader. Have you tried using UTF-16LE instead if UTF8?
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF-16LE" ) );
According to Charset:
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte
order

You cannot use UTF-8 encoding for all files, especially if you do not know which file encoding to expect. Use a library which can detect the file encoding before your read the file, for example: juniversalchardet or jChardet
For more info see Java : How to determine the correct charset encoding of a stream

Java Unicode to readable text conversion decoding

I am developing a Java application where I am consuming a web service. The web service is created using a SAP server, which encodes the data automatically in Unicode. I get a Unicode string from the web service.
"
倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭⁬慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
"
above is the response.
I want to convert it to readable text format like String. I am using core Java.

倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭⁬慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
That's a PDF file that has been interpreted as UTF-16LE.
You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. Extracting the document text out of a PDF file is a much bigger problem!
(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.)

If you have byte[] or an InputStream (both binary data) you can get a String or a Reader (both text) with:
final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"
byte[] b = ...;
String s = new String(b, encoding);
InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
String line = reader.readLine();
}
The reverse process uses:
byte[] b = s.geBytes(encoding);
OutputStream os = ...;
BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);
Unicode is a numbering system for all characters. The UTF variants implement Unicode as bytes.
Your problem:
In normal ways (web service), you would already have received a String. You could write that string to a file using the Writer above for instance. Either to check it yourself with a full Unicode font, or to pass the file on for a check.
You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. In XML it would be defined already.
Addition:
FileWriter writes to a file using the default encoding (from operating system on your machine). Instead use:
new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")
If it is a binary PDF, as #bobince said, use just a FileOutputStream on byte[] or InputStream.

This is definitely not a valid string. This looks like mangled UTF-16.
UPDATE
Indeed #Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. When Displayed in UTF-8 this string indeed shows PDF source code. Good catch.

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.

You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.

UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252

String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

How read Japanese fields from CSV file into java beans?

I've tried several popular CSV to java deserializers - OpenCSV, JSefa, and Smooks - none correctly read the file:
First Name,Last Name
エリック,山中
花子,鈴木
一郎,鈴木
裕子,田中
政治,山村
into my java object collection.
OpenCsv code:
HeaderColumnNameTranslateMappingStrategy<Contact> strat = new HeaderColumnNameTranslateMappingStrategy<Contact>();
strat.setType(Contact.class);
strat.setColumnMapping(colNameTranslateMap);
InputStreamReader fileReader=null;
CsvToBean<Contact> csv = new CsvToBean<Contact>();
fileReader = new InputStreamReader(new FileInputStream(file), "UTF-8");
contacts = csv.parse(strat, new CSVReader(fileReader));
I've tried setting the Charset to UTF-8, UTF-16 and ISO-8859-1 when I create the FileInputStream, but the collection is never populated properly. As seen in the debugger and System.out the fields contain garbage and often the number of records is wrong.

FileInputStream is for reading streams of binary data, like an mp3 or PNG. Instead of a FIS, use a FileReader for reading streams of characters.
To be blunt: who cares what charsets you tried using if they didn't work? You need to figure out what encoding the CSV file is actually using, and set that encoding when reading the file. To specify the encoding when using a FileReader:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

BufferedReader returns ISO-8859-15 String - how to convert to UTF16 String?

I have an FTP client class which returns InputStream pointing the file. I would like to read the file row by row with BufferedReader. The issue is, that the client returns the file in binary mode, and the file has ISO-8859-15 encoding.

If the file/stream/whatever really contains ISO-8859-15 encoded text, you just need to specify that when you create the InputStreamReader:
BufferedReader br = new BufferedReader(
new InputStreamReader(ftp.getInputStream(), "ISO-8859-15"));
Then readLine() will create valid Strings in Java's native encoding (which is UTF-16, not UTF-8).

Try this:
BufferedReader br = new BufferedReader(
new InputStreamReader(
ftp.getInputStream(),
Charset.forName("ISO-8859-15")
)
);
String row = br.readLine();

The original string is in ISO-8859-15, so the byte stream read by your InputStreamReader will be in this encoding. So read in using that encoding (specify this in the InputStreamReader constructor). That tells the InputStreamReader that the incoming byte stream is in ISO-8859-15 and to perform the appropriate byte-to-character conversions.
Now it will be in the standard Java UTF-16 format, and you can then do what you wish.
I think the current problem is that you're reading it using your default encoding (by not specifying an encoding in InputStreamReader), and then trying to convert it, by which time it's too late.
Using default behaviour for these sort of classes often ends in grief. It's a good idea to specify encodings wherever you can, and/or default the VM encoding via -Dfile.encoding

Have you tried:
BufferedReader r = new BufferedReader(new InputStreamReader("ISO-8859-1"))
...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading file with bad encoding. CP1252 vs UTF-8 - java

Try to use InputStreamReader(InputStream in, String charsetName) constructor and set charset by yourself. Reader reader = new InputStreamReader(new ByteArrayInputStream(byteArr), "UTF-8");

I had exactly the same error and finally solved the issue by adding this to the JVM startup options : -Dfile.encoding=UTF8

Related

Reading any text file having strange encoding?

Java Unicode to readable text conversion decoding

UTF8 convertion for text obtained from internet

How read Japanese fields from CSV file into java beans?

BufferedReader returns ISO-8859-15 String - how to convert to UTF16 String?

Categories

Resources