In previous versions of Java, I would read a file by creating a buffered reader like this:
BufferedReader in = new BufferedReader(new FileReader("file.txt"));
In Java 7, I would like to use Files.newBufferedReader, but I need to pass in a charset as well. For example:
BufferedReader in = Files.newBufferedReader(Paths.get("file.txt"),
Charset.forName("US-ASCII"));
Previously, I did not have to worry about charsets when reading plain text files. What charset shall I use? Do you know what charset was used by default in previous versions of Java? I simply want to be able to find and replace the old statement with the new one.
Previously, I did not have to worry about charsets when reading plain text files.
Well, you should have done. If you were just using FileReader, it was using the default character encoding for the system. That was a bad idea, which is why I always used FileInputStream and an InputStreamReader. You should always be explicit about it. If you really want the default character encoding for the system, you should use Charset.defaultCharset() - but I strongly suggest that you don't.
If you're going to read a file, you should know the character encoding, and specify that. If you get to decide what character encoding to use when writing a file, UTF-8 is a good default choice.
PrintWriter/PrintStream in Java has by default Charset.defaultCharset()
java.nio.charset.Charset.defaultCharset()
Related
I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdοbe Dοcument Clοud
How I can get the same encoding in Java as with Notepad++?
Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.
You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)
You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.
I read that java documentation for FileWriter - it doesn't allow us to specify encoding. I'm using FileWriter in order to avoid the newline character that gets automatically appended while writing list of strings to a file(And I think this is the only thing I can use to achieve that).
I'm now facing a problem that some japanese characters in few properties file are written as "???", so I need to pass in the encoding information somehow. Is there any other way to either avoid the newline appending or a way to pass encooding info to FileWriter?
Don't use FileWriter, but instead new OutputStreamWriter(new FileOutputStream(file), encoding).
If you're working with Path (Java 7+), you can use Files.newBufferedWriter(path, charset, options) to specify a Charset. (This also wraps your output in a BufferedWriter, which is nice.)
The classic alternative is to use OutputStreamWriter and pass a Charset that way.
Always passing a charset to writers is a good habit, even if you don't think it's likely to ever deal with non-ASCII content.
I'm processing a Unicode text file using the Java platform on OS X. When I open the file using TextEdit or TextWrangler instead of seeing "Nattvardsgästerna" I see "Nattvardsg‰sterna" (which is incorrect). When I open the file using the Java io stream, I see the same incorrect String "Nattvardsg‰sterna".
When I open the file on my PC I see the correct String. I'm not sure where to start solving this problem... Is it an issue with my OS X set-up? Should I open the Java stream with a special flag?
Thanks.
P.S. I'm opening the file like so: fileReader = new BufferedReader(new FileReader(file));
P.S.S. Also, I should mention that I'd like to output the result as an SQL text file so it is important for the OS to distinguish ä correctly.
An InputStream reads bytes (not characters), so I assume when you say:
When I open the file using java io stream
... that you really mean "when I open the file using a Java Reader".
EDIT: Your comment says that you're doing this:
new BufferedReader(new FileReader(file));
An InputStreamReader has a constructor that allows you to set the character encoding. If you don't specify one, it will use the platform default. It's unlikely the platform default will be unicode (on my Macbook, it's set to "US-ASCII").
In order to set the character encoding, you must create the intermediate input stream reader rather than that letting FileReader do it for you (because FileReader uses the platform default encoding).
Assuming the file is encoding using UTF-8, use:
new BufferedReader(new InputStreamReader(new FileInputStream(file),
Charset.forName("UTF-8")));
Alternatively, you can change the platform default by supplying an argument to the JVM. You can look at this answer for the full details, but the basic idea is that you set the file.encoding Java system property. The linked answer provides a few ways to achieve this.
FURTHER EDIT:
P.S.S. Also, I should mention that I'd like to output the result as an SQL text file so it is important for the OS to distinguish ä correctly.
The OS hasn't got anything to do with this. The file system is just shuffling bytes around. How those bytes are interpreted is entirely up to the applications that are reading those files. This answer tells you how to make your Java program interpret the bytes correctly. For your database to be able to interpret the bytes correctly, you'll need to configure the database encoding.
I write a program that implements a file structure, the program prints out a product file based on the structure. Product names include letters Æ, Ø and Å. These letters are not displayed correctly in the output file. I use
PrintWriter printer = new PrintWriter(new FileOutputStream(new File("products.txt")));
IS0 8859 - 1 or Windows ANSI (CP 1252) is the character sets that the implementation requiers.
There are two possibilities:
Java is using the wrong encoding when outputting the file.
The file is actually correct, and whatever you are using to display the file is using the wrong encoding.
Assuming that the problem is the first one, the root cause is that Java has figured out that the default encoding for the platform is something other than the one you want / expect. There are three ways to solve this:
Figure out why Java has the got default locale and encoding "wrong" and remedy that. It will be something to do with your operating system's locale settings ...
Read this FAQ for details on how you can override the default locale settings at the command line.
Use a PrintWriter constructor that specifies the encoding explicitly so that your application doesn't rely on the default encoding. For example:
PrintWriter pw = new PrintWriter("filename", "ISO-8859-1");
In response to this comment:
Don’t PrintWriters all have the bug that you can’t know you had an error with them?
It is not a bug, it is a design feature.
You can find out if there was an error. You just can't find out what it was.
If you don't like it, you can use Writer instead.
They won’t raise an exception or even return failure if you try to shove a codepoint at them that can’t fit in the designated encoding.
Neither will a regular Writer I believe ... unless you specifically construct it to do this. The normal behaviour is to replace any unmappable codepoint with a specific character, though this is not specified in the javadocs (IIRC).
Do they even tell if you the filesystem fills up; I seem to recall that they don’t.
That is true. However:
For the kind of file you typically write using a PrintWriter this is not a critical issue.
If it is a critical issue AND you still want to use PrintWriter, you can always call checkError() (IIRC) to find out if there was an error.
I always end up writing my out OutputStreamWriter constructor with the explicit Charset.forName("UTF-8").newEncoder() second argument. It’s kind of tedious, so perhaps there’s a better way.
I dunno.
I have a text file with Chinese words written to a line. The line is surrounded with "\r\n", and written using fileOutputStream.write(string.getBytes()).
I have no problems reading lines of English words, my buffered reader parses it with readLine() perfectly. However, it recognizes the Chinese sentence as multiple lines, thus screwing up my programme flow.
Any solutions?
Using string.getBytes() encodes the String using the platform default encoding. That is rarely what you want, especially when you're trying to write characters that are not native to your current locale.
Specify the encoding instead (using string.getBytes("UTF-8"), for example).
A cleaner and more Java-esque way would be to wrap your OutputStream in an OutputStreamWriter like this:
Writer w = new OutputStreamWriter(out, "UTF-8");
Then you can simply call writer.write(string) and don't need to repeat the encoding each time you want to write a String.
And, as commented below, specify the same encoding when reading the file (using a Reader, preferably).
If you're outputting the text via fileOutputStream.write(string.getBytes()), you're outputting with the default encoding for the platform. It's important to ensure you're then reading with the appropriate encoding, and using methods that are encoding-aware. The problem won't be in your BufferedReader instance, but whatever Reader you have under it that's converting bytes into characters.
This article may be of use: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)