Java's UTF-8 encoding - java

I have this code:
BufferedWriter w = Files.newWriter(file, Charsets.UTF_8);
w.newLine();
StringBuilder sb = new StringBuilder();
sb.append("\"").append("éééé").append("\";")
w.write(sb.toString());
But it ain't work. In the end my file hasn't an UTF-8 encoding. I tried to do this when writing:
w.write(new String(sb.toString().getBytes(Charsets.US_ASCII), "UTF8"));
It made question marks appear everywhere in the file...
I found that there was a bug regarding the recognition of the initial BOM charcater (http://bugs.java.com/view_bug.do?bug_id=4508058), so I tried using the BOMInputStream class. But bomIn.hasBOM() always returns false, so I guess my problem is not BOM related maybe?
Do you know how I can make my file encoded in UTF-8? Was the problem solved in Java 8?

You're writing UTF-8 correctly in your first example (although you're redundantly creating a String from a String)
The problem is that the viewer or tool you're using to view the file doesn't read the file as UTF-8.
Don't mix in ASCII, that just converts all the non-ASCII bytes to question marks.

Related

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}
Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");
In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();

Does Java read 0xA0 as 0xFFFD?

One of my data processing modules crashed while reading ANSI input. Looking at the string in question using a hex viewer, there was a mysterious 0xA0 byte at the end of it.
Turns out this is
Unicode Character 'NO-BREAK SPACE' (U+00A0).
I tried replacing that:
String s = s.replace("\u00A0", "");
But it didn't work.
I then went and printed out what that character is using charAt and Java reports
65533
or 0xFFFD
(Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD))
Plugging that into the replace code, I finally got rid of it!
But why do I see an 0xA0 in the file, but Java reads it as 0xFFFD?
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8"));
String line = r.readLine();
while (line != null){
// do stuff
line = r.readLine();
}
U+FFFD is the "Unicode replacement character", which is generally used to represent "some binary data which couldn't be decoded correctly in the encoding you were using". (Sometimes ? is used for this instead, but U+FFFD is generally a better idea, as it's unambiguous.)
Its presence is usually a sign that you've tried to use the wrong encoding. You haven't specified which encoding you were using - or indeed how you were using it - but that's probably the problem. Check the encoding you're using and the encoding of the file. Be aware that "ANSI" isn't an encoding - there are lots of encodings which are known as ANSI encodings, and you'll need to pick the right one for your file.
How did you open the file?
If you use InputStreamReader(InputStream, CharSet) your can specify the 'true' charset of the file you would like to open. If you do not specify the charset yourself, java is using the default charset of your platform. On unix this is often UTF8 while on windows its often ISO8859.

how to read utf-8 chars in opencsv

I am trying to read from csv file. The file contains UTF-8 characters. So based on Parse CSV file containing a Unicode character using OpenCSV and How read Japanese fields from CSV file into java beans? I just wrote
CSVReader reader = new CSVReader(new InputStreamReader(new FileInputStream("data.csv"), "UTF-8"), ';');
But it does not work. The >>Sí, es nuevo<< text is visible correctly in Notepad, Excel and various other text editing tools, but when I parse the file via opencsv I'm getting >>S�, es nuevo<< ( The í is a special character if you were wondering ;)
What am I doing wrong?
you can use encoder=UTF-16LE,I'm write a file for Japanese
Thanks aioobe. It turned out the file was not really UTF-8 despite most Win programs showing it as such. Notepad++ was the only one that did not show the file as UTF-8 encoded and after converting the data file the code works.
Use the below code for your issue it might helpful to you...
String value = URLEncoder.encode(msg[no], "UTF-8");
thanks,
Yash
Use ISO-8859-1 or ISO-8859-14 or ISO-8859-15 or ISO-8859-10 or ISO-8859-13 or ISO-8859-2 instead of using UTF-8

Base64 InputStream to String

I have been trying to get an input stream reading a file, which isa plain text and has embeded some images and another files in base64 and write it again in a String. But keeping the encoding, I mean, I want to have in the String something like:
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAoHBwgHBgoICAgLCgoLDhgQDg0NDh0VFhEYIx8lJCIf
IiEmKzcvJik0KSEiMEExNDk7Pj4+JS5ESUM8SDc9Pjv/2wBDAQoLCw4NDhwQEBw7KCIoOzs7Ozs7
I have been trying with the classes Base64InputStream and more from packages as org.apache.commons.codec but I just can not fiugure it out. Any kind of help would be really appreciated. Thanks in advance!
Edit
Piece of code using a reader:
BufferedReader br= new BufferedReader(new InputStreamReader(bodyPart.getInputStream()));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
Getting as a result something like: .DIC;ÿÛC;("(;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;ÿÀ##"ÿÄ
Have you tried doing this:
final byte[] bytes64bytes = Base64.encodeBase64(IOUtils.toByteArray(is));
final String content = new String(bytes64bytes);
A text file containing some base64 data can be read with the charset of the rest of the file.
Base64 encoding is a mean to encode bytes in a limited set of characters that are unchanged with almost all char encodings, for example ASCII or UTF-8.
Base64 isn't a charset encoding, you don't have to specify you have some base64 encoded data when reading a file into a string.
So if your text file is generally UTF-8 (that's probable), you can read it without problem even if it contains a base64 encoded stream. Simply use a basic reader and don't use a Base64InputStream if you don't want to decode it.
When opening a file with a reader, you have to specify the encoding. If you don't know it, I suggest you test with the probable ones, like UTF-8, US-ASCII or ISO-8859-1.
If you have a normal InputStream object than You can directly get Base64 encoded stream from it using apache common library class Base64InputStream constructor
I found the solution, inspired by this post getting base64 content string of an image from a mimepart in Java
I think it is kind of stupid decode and encode again the base64 code, but it is the only way I found to manage this issue. If someone could give a better solution, it would be also really appreciated.
Thanks

File is not saved in UTF-8 encoding even when I set encoding to UTF-8

When I check my file with Notepad++ it's in ANSI encoding. What I am doing wrong here?
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}
UPDATE:
This is solved now, reason for jboss not understanding my xml wasn't encoding, but it was naming of my xml. Thanx all for help, even there really wasn't any problem...
If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.
Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.
You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.
UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.
UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.
It's only when you start to get into the non-ASCII codepoints that things start looking different.
But in the common case, byte-for-byte, ANSI and UTF-8 are identical.
If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.
(And there seems to be an issue with UTF8 in Java anyways...)
The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.
I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.
Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.
You can do it like this,
public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}

Categories