reading file with accented characters in Java - java

I came across two special characters which seem not to be covered by the ISO-8859-1 character set i.e. they don't make it through to my program.
The German ß
and the Norwegian ø
i'm reading the files as follows:
FileInputStream inputFile = new FileInputStream(corpus[i]);
InputStreamReader ir = new InputStreamReader(inputFile, "ISO-8859-1") ;
Is there a way for me to read these characters without having to apply manual replacement as a workaround?
[EDIT]
this is how it looks on screen. Note that i have no problems with other accents e.g. è and the lot...

Both characters are present in ISO-Latin-1 (check my name to see why I've looked into this).
If the characters are not read in correctly, the most likely cause is that the text in the file is not saved in that encoding, but in something else.
Depending on your operating system and the origin of the file, possible encodings could be UTF-8 or a Windows code page like 850 or 437.
The easiest way is to look at the file with a hex editor and report back what exact values are saved for these two characters.

Assuming that your file is probably UTF-8 encoded, try this:
InputStreamReader ir = new InputStreamReader(inputFile, "UTF-8");

ISO-8859-1 covers ß and ø, so the file is probably saved in a different encoding. You should pass in file's encoding to new InputStreamReader().

Related

Fix mixed encoding in string

I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdοbe Dοcument Clοud
How I can get the same encoding in Java as with Notepad++?
Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.
You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)
You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.

java's inconsistent behavior when handling a utf-8 with BOM string

I open my Windows notepad, enter 18, and save the file as utf-8 encoding. I know that my file will have a BOM header, and my file is a utf-8 encoded file(with a BOM header).
Problem is that, when printing that string by below code:
//str is that string read from the file using StandardCharsets.UTF_8 encoding
System.out.println(str);
In windows I got:
?18
But in linux I got:
18
So why the behavior of java is different? How to understand it?
A BOM is a zero-width space, so invisible in principle.
However Window has no UTF-8 encoding but uses one of the many single byte encodings. The conversion from String to the output will turn the BOM, missing in the charset, into a question mark.
Still Notepad will recognize the BOM and display UTF-8 text.
Linux nowadays generally uses UTF-8, so has no problems, also in the console.
Further explanation
On Windows System.out uses the console, and that console for instance uses as charset/encoding for instance Cp-850, a single byte charset of some 256 characters. Missing might very well be ĉ or the BOM char. If a java String contains these chars, they can not be encoded to one of the 256 available chars. Hence they will be converted to a ?.
Using a CharsetEncoder:
String s = ...
CharsetEncoder encoder = Charset.defaultCharset().newEncoder();
if (!encoder.canEncode(s)) {
System.out.println("A problem");
}
Windows generally also runs on a single byte encoding, like Cp-1252. Again 256 chars. However editors may deal with several encodings, and if the font can represent the character (Unicode code point), then everything works.
The behavior of java is the same, FileInputStream do not handle bom.
In windows, your file is file1, file1 hex present is EF BB BF 31 38
In linux, your file is file2, file2's hex present is 31 38
when you read them, you would get different string.
I recommend you convert your bom file to without-bom file with notepad++.
Or you can use BOMInputStream

Determining ISO-8859-1 vs US-ASCII charset

I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?
If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).
It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.
If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

String Encoding doesn't ouput all characters

My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");
You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting  characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.

Character encoding UTF and ISO-8859-1 in CSV [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to add a UTF-8 BOM in java
My oracle database has a character set of UTF8.
I have a Java stored procedure which fetches record from the table and creates a csv file.
BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);
ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
The german characters(fetched from the table) becomes gibberish in the csv if I use the above code. But if I change the encoding to use ISO-8859-1, then I can see the german characters properly in the csv file.
PrintStream out = new PrintStream(zipOut,false,"ISO-8859-1");
I have read in some posts which says that we should use UTF8 as it is safe and will also encode other language (chinese etc) properly which ISO-8859-1 will fail to do so.
Please suggest me which encoding I should use. (There are strong chances that we might have chinese/japanese words stored in the table in the future.)
You're currently only talking about one part of a process that is inherently two-sided.
Encoding something to bytes is only really relevant in the sense that some other process comes along and decodes it back into text at some later point. And of course, both processes need to use the same character set else the decode will fail.
So it sounds to me that the process that takes the BLOB out of the database and into the CSV file, is assuming that the bytes are an ISO-8859-1 encoding of text. Hence if you store them as UTF-8, the decoding messes (though the basic ASCII characters have the same byte representation in both, which is why they still decode correctly).
UTF-8 is a good character set to use in almost all circumstances, but it's not magic enough to overcome the immutable law that the same character set must be used for decoding as was used for encoding. So you can either change your CSV-creator to decode with UTF-8, else you'll have to continue encoding with ISO-8859-1.
I suppose your BLOB data is ISO-8859-1 encoded. As it's stored as binary and not as text its encoding is not depended on the databases encoding. You should check if the the BLOB was originaly written in UTF-8 encoding and if not, do so.
I think the problem is [Excel]csv could not figure out the utf8 encoding.
utf-8 csv issue
But I m still not able to resolve the issue even if I put a BOM on the PrintStream.
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
out.write('\ufeff');
I also tried:
out.write(new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF });
but to no avail.

Categories