Reliance on default encoding, what should I use and why? - java

FindBugs reports a bug:
Reliance on default encoding
Found a call to a method which will perform a byte to String (or String to byte) conversion, and will assume that the default platform encoding is suitable. This will cause the application behaviour to vary between platforms. Use an alternative API and specify a charset name or Charset object explicitly.
I used FileReader like this (just a piece of code):
public ArrayList<String> getValuesFromFile(File file){
String line;
StringTokenizer token;
ArrayList<String> list = null;
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(file));
list = new ArrayList<String>();
while ((line = br.readLine())!=null){
token = new StringTokenizer(line);
token.nextToken();
list.add(token.nextToken());
...
To correct the bug I need to change
br = new BufferedReader(new FileReader(file));
to
br = new BufferedReader(new InputStreamReader(new FileInputStream(file), Charset.defaultCharset()));
And when I use PrintWriter the same error occurred. So now I have a question. When I can (should) use FileReader and PrintWriter, if it's not good practice rely on default encoding?
And the second question is to properly use Charset.defaultCharset ()? I decided use this method for automatically defining charset of the user's OS.

Ideally, it should be:
try (InputStream in = new FileInputStream(file);
Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(reader)) {
...or:
try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...assuming the file is encoded as UTF-8.
Pretty much every encoding that isn't a Unicode Transformation Format is obsolete for natural language data. There are languages you cannot support without Unicode.

If the file is under the control of your application, and if you want the file to be encoded in the platform's default encoding, then you can use the default platform encoding. Specifying it explicitely makes it clearer, for you and future maintainers, that this is your intention. This would be a reasonable default for a text editor, for example, which would then write files that any other editor on this platform would then be able to read.
If, on the other hand, you want to make sure that any possible character can be written in your file, you should use a universal encoding like UTF8.
And if the file comes from an external application, or is supposed to be compatible with an external application, then you should use the encoding that this external application expects.
What you must realize is that if you write a file like you're doing on a machine, and read it as you're doing on another machine, which doesn't have the same default encoding, you won't necessarily be able to read what you have written. Using a specific encoding, to write and read, like UTF8 makes sure the file will always be the same, whatever platform is used when writing the file.

You should use default encoding whenever you read a file that is outside your application and can be assumed to be in the user's local encoding, for example user written text files. You might want to use the default encoding when writing such files, depending on what the user is going to do with that file later.
You should not use default encoding for any other file, especially application relevant files.
If you application for example writes configuration files in text format, you should always specify the encoding. In general UTF-8 is always a good choice, as it is compatible to almost everything. Not doing so might cause surprise crashes by users in other countries.
This is not only limited to character encoding, but as well to date/time, numeric or other language specific formats. If you for example use default encoding and default date/time strings on a US machine, then try to read that file on a German server, you might be surprised why one half is gibberish and the other half has month/days confused or is off by one hour because of daylight saving time.

When you are using a PrintWriter,
File file = new File(file_path);
Writer w = new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_16.name());
PrintWriter pw = new PrintWriter(w);
pw.println(content_to_write);
pw.close();

This will work:-
FileReader file = new FileReader(csvFile, Charset.forName("UTF-8"));
BufferedReader csvReader = new BufferedReader(file);

Related

BufferedWriter to write at BufferedReader position

My code reads through an xml file encoded with UTF-8 until a specfied string has been found. It finds the specified string fine, but I wish to write at this point in the file.
I would much prefer to do this through a stream as only small tasks need to be done.
I cannot find a way to do this. Any alternative methods are welcome.
Code so far:
final String RESOURCE = "/path/to/file.xml";
BufferedReader in = new BufferedReader(new InputStreamReader(ClassLoader.class.getResourceAsStream(RESOURCE), "UTF-8"));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(ClassLoader.class.getResource(RESOURCE).getPath()),"UTF-8"));
String fileLine = in.readLine();
while (!fileLine.contains("some string")) {
fileLine = in.readLine();
}
// File writing code here
You can't really write into the middle of the file, except for overwriting existing bytes (using something like RandomAccessFile). that would only work, however, if what you needed to write was exactly the same byte length as what you were replacing, which i highly doubt.
instead, you need to re-write the file to a new file, copying the input to the output, replacing the parts you need to replace in the process. there are a variety of ways you could do this. i would recommend using a StAX event reader and writer as the StAX api is fairly user friendly (compared to SAX) as well as fast and memory efficient.

Character encoding via JDBC/ODBC/Microsoft Access

I'm doing a connection via JDBC/ODBC to Microsoft Access successfully. After that, I make a query to select rows from Microsoft Access, and I write these results to a TXT file. Everything is OK, but I have some strings that include accents, and these appear as '?' in TXT file. I already tried various forms of methods to write files in java, as PrintWriter, FileWriter, Outputstream, and others, including adding character encoding parameter (UTF-8 or ISO-8859-1) to some these methods. I need any help about some way to show these characters in a right way. Thanks.
Try the below line,
String OUTPUTFILE = "PATH/TO/FILE/";
BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(OUTPUTFILE),"UTF8"));
Once you add that to your code you should be fine using bf.write('VALUE') to write UTF8 characters to your file. And, also make sure to set your text editor encoding to Unicode or UTF8, if you don't it might seem like the hole process didn't work which would lead to even more confusion.
Edited:
To read UTF8 txts
String IPUTFILE = "PATH/TO/File";
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(INPUTFILE), "UTF8"));
then to read line String str = in.readLine();

how to read text file on any machine in java

I am trying to read file,but it is reading only on my machine,it is not working on another machine.Here is my code..
FileInputStream fstream=new FileInputStream("/path of myfile/User.txt");
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String str;
while ((str = br.readLine()) != null) {
System.out.println(str);
}
Please help me,how to read file on another machine as well,what changes should I make?
I'm just guessing that you already found a way to share the file, either with HTTP, FTP, SMB or NFS, but you've some problems, perhaps some funny characters appearing in the text. If you don't name the encoding that you want to use, the default one for the machine will be used, and if they have different defaults, you'll run into problems.
Choose an encoding when writing and reading, for example for UTF8 universal encoding, your source should be modified as:
BufferedReader br = new BufferedReader(new InputStreamReader(in, "UTF8"));
When you write your file, of course, you've to use the same encoding, for instance:
FileOutputStream fos = new FileOutputStream("/path of myfile/User.txt");
OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8");
If you want to read a file that resides on another machine, you have to serve that file using some kind of network server, like an http-server or an smb-server.

Displaying special characters

I am running into issues when displaying special characters on the Windows console.
I have written the following code:
public static void main(String[] args) throws IOException {
File newFile = new File("sampleInput.txt");
File newOutFile = new File("sampleOutput.txt");
FileReader read = new FileReader(newFile);
FileWriter write = new FileWriter(newOutFile);
PushbackReader reader = new PushbackReader(read);
int c;
while ((c = reader.read()) != -1)
{
write.write(c);
}
read.close();
write.close();
}
The output file looks exactly what the input file would be containing special characters. i.e. for the contents in input file © Ø ŻƩ abcdefĦ, the output file contains exactly the same contents. But when I add the line System.out.printf("%c", (char) c), the contents on the console are:ÿþ©(containing more characters but I am not able to copy paste here). I did read that the issue might be with the Windows console character set, but not able to figure out the fix for it.
Considering the output medium can be anything in future, I do not want to run into issues with Unicode character display for any type of out stream.
Can anyone please help me understand the issue and how can I fix the same ?
The Reader and Writer will use the platform default charset for transforming characters to bytes. In your environment that's apparently not an Unicode compatible charset like UTF-8.
You need InputStreamReader and OutputStreamWriter wherein you can explicitly specify the charset.
Reader read = new InputStreamReader(new FileInputStream(newFile), "UTF-8"));
Writer write = new OutputStreamWriter(new FileOutputStream(newOutFile), "UTF-8"));
// ...
Also, the console needs to be configured to use UTF-8 to display the characters. In for example Eclipse you can do that by Window > Preferences > General > Workspace > Text File Encoding.
In the command prompt console it's not possible to display those characters due to lack of a font supporting those characters. You'd like to head to a Swing-like UI console approach.
See also:
Unicode - How to get the characters right?
Instead of FileWriter try using OutputStreamWriter and specify the encoding of the output.

Unicode in Jar resources

I have a Unicode (UTF-8 without BOM) text file within a jar, that's loaded as a resource.
URL resource = MyClass.class.getResource("datafile.csv");
InputStream stream = resource.openStream();
BufferedReader reader = new BufferedReader(
new InputStreamReader(stream, Charset.forName("UTF-8")));
This works fine on Windows, but on Linux it appear not to be reading the file correctly - accented characters are coming out broken. I'm aware that different machines can have different default charsets, but I'm giving it the correct charset. Why would it not be using it?
The reading part looks correct, I use that all the time on Linux.
I suspect you used default encoding somewhere when you export the text to the web page. Due to the different default encoding on Linux and Windows, you saw different result.
For example, you use default encoding if you do anything like this in servlet,
PrintWriter out = response.getWriter();
out.println(text);
You need to specifically write in UTF-8 like this,
response.setContentType("text/html; charset=UTF-8");
out = new PrintWriter(
new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true);
out.println(text);
I wonder if reviewing UTF-8 on Linux would help. Could be a setup issue.

Categories