Reading UTF-8 file and writing plain ANSI?

Reading UTF-8 file and writing plain ANSI? - java

I have an UTF-8 file (it's a csv).
I need to read line by line this file do some replace and then write line by line into another file.
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileFix), "ASCII")
);
bw.write(""); //clean current file
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(file),"UTF-8")
);
String line;
while ((line = br.readLine()) != null) {
line = line.replace(";", ",");
bw.append(line + "\n");
}
Simple as that.
The problem is that the output file (fileFix) is UTF-8 and i think it has the BOM character.
How can I write the file as plain ANSI without the BOM?
The error I am getting while reading my file with a software (weka)
The first line of this file:
Consider that notepad++ tells me the charset is UTF-8. If i try to convert this file in plain ASCII (with windows notepad), that chars disappers
Solution
When you are on the first line run:
line = line.substring(1);
To remove any BOM char.

It sounds like this is a BOM issue rather than an encoding issue as such.
You can just remove any BOM characters as you write the file, with:
line = line.replace("\ufeff", "");
That leaves the question of whether you're reading the data accurately in the first place... I'd strongly advise you not to use FileWriter and FileReader at all - instead, use InputStreamReader and OutputStreamWriter, specifying the encoding explicitly for both of them. Set the reader encoding to UTF-8 (assuming the input file really is UTF-8), and set the writer encoding to whatever you want... but I'd recommend sticking with UTF-8, to be honest.
Also note that you should be closing your reader/writer in finally blocks, or using the try-with-resources statement if you're using Java 7.

Look at http://en.wikipedia.org/wiki/Byte_order_mark for the pattern to replace, looks like EF BB BF rather than FE FF
This solution is wrong check Jons answer intsead

Related

Reliance on default encoding, what should I use and why?

FindBugs reports a bug:
Reliance on default encoding
Found a call to a method which will perform a byte to String (or String to byte) conversion, and will assume that the default platform encoding is suitable. This will cause the application behaviour to vary between platforms. Use an alternative API and specify a charset name or Charset object explicitly.
I used FileReader like this (just a piece of code):
public ArrayList<String> getValuesFromFile(File file){
String line;
StringTokenizer token;
ArrayList<String> list = null;
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(file));
list = new ArrayList<String>();
while ((line = br.readLine())!=null){
token = new StringTokenizer(line);
token.nextToken();
list.add(token.nextToken());
...
To correct the bug I need to change
br = new BufferedReader(new FileReader(file));
to
br = new BufferedReader(new InputStreamReader(new FileInputStream(file), Charset.defaultCharset()));
And when I use PrintWriter the same error occurred. So now I have a question. When I can (should) use FileReader and PrintWriter, if it's not good practice rely on default encoding?
And the second question is to properly use Charset.defaultCharset ()? I decided use this method for automatically defining charset of the user's OS.

Ideally, it should be:
try (InputStream in = new FileInputStream(file);
Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(reader)) {
...or:
try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...assuming the file is encoded as UTF-8.
Pretty much every encoding that isn't a Unicode Transformation Format is obsolete for natural language data. There are languages you cannot support without Unicode.

If the file is under the control of your application, and if you want the file to be encoded in the platform's default encoding, then you can use the default platform encoding. Specifying it explicitely makes it clearer, for you and future maintainers, that this is your intention. This would be a reasonable default for a text editor, for example, which would then write files that any other editor on this platform would then be able to read.
If, on the other hand, you want to make sure that any possible character can be written in your file, you should use a universal encoding like UTF8.
And if the file comes from an external application, or is supposed to be compatible with an external application, then you should use the encoding that this external application expects.
What you must realize is that if you write a file like you're doing on a machine, and read it as you're doing on another machine, which doesn't have the same default encoding, you won't necessarily be able to read what you have written. Using a specific encoding, to write and read, like UTF8 makes sure the file will always be the same, whatever platform is used when writing the file.

You should use default encoding whenever you read a file that is outside your application and can be assumed to be in the user's local encoding, for example user written text files. You might want to use the default encoding when writing such files, depending on what the user is going to do with that file later.
You should not use default encoding for any other file, especially application relevant files.
If you application for example writes configuration files in text format, you should always specify the encoding. In general UTF-8 is always a good choice, as it is compatible to almost everything. Not doing so might cause surprise crashes by users in other countries.
This is not only limited to character encoding, but as well to date/time, numeric or other language specific formats. If you for example use default encoding and default date/time strings on a US machine, then try to read that file on a German server, you might be surprised why one half is gibberish and the other half has month/days confused or is off by one hour because of daylight saving time.

When you are using a PrintWriter,
File file = new File(file_path);
Writer w = new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_16.name());
PrintWriter pw = new PrintWriter(w);
pw.println(content_to_write);
pw.close();

This will work:-
FileReader file = new FileReader(csvFile, Charset.forName("UTF-8"));
BufferedReader csvReader = new BufferedReader(file);

Character encoding via JDBC/ODBC/Microsoft Access

I'm doing a connection via JDBC/ODBC to Microsoft Access successfully. After that, I make a query to select rows from Microsoft Access, and I write these results to a TXT file. Everything is OK, but I have some strings that include accents, and these appear as '?' in TXT file. I already tried various forms of methods to write files in java, as PrintWriter, FileWriter, Outputstream, and others, including adding character encoding parameter (UTF-8 or ISO-8859-1) to some these methods. I need any help about some way to show these characters in a right way. Thanks.

Try the below line,
String OUTPUTFILE = "PATH/TO/FILE/";
BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(OUTPUTFILE),"UTF8"));
Once you add that to your code you should be fine using bf.write('VALUE') to write UTF8 characters to your file. And, also make sure to set your text editor encoding to Unicode or UTF8, if you don't it might seem like the hole process didn't work which would lead to even more confusion.
Edited:
To read UTF8 txts
String IPUTFILE = "PATH/TO/File";
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(INPUTFILE), "UTF8"));
then to read line String str = in.readLine();

Displaying special characters

I am running into issues when displaying special characters on the Windows console.
I have written the following code:
public static void main(String[] args) throws IOException {
File newFile = new File("sampleInput.txt");
File newOutFile = new File("sampleOutput.txt");
FileReader read = new FileReader(newFile);
FileWriter write = new FileWriter(newOutFile);
PushbackReader reader = new PushbackReader(read);
int c;
while ((c = reader.read()) != -1)
{
write.write(c);
}
read.close();
write.close();
}
The output file looks exactly what the input file would be containing special characters. i.e. for the contents in input file © Ø ŻƩ abcdefĦ, the output file contains exactly the same contents. But when I add the line System.out.printf("%c", (char) c), the contents on the console are:ÿþ©(containing more characters but I am not able to copy paste here). I did read that the issue might be with the Windows console character set, but not able to figure out the fix for it.
Considering the output medium can be anything in future, I do not want to run into issues with Unicode character display for any type of out stream.
Can anyone please help me understand the issue and how can I fix the same ?

The Reader and Writer will use the platform default charset for transforming characters to bytes. In your environment that's apparently not an Unicode compatible charset like UTF-8.
You need InputStreamReader and OutputStreamWriter wherein you can explicitly specify the charset.
Reader read = new InputStreamReader(new FileInputStream(newFile), "UTF-8"));
Writer write = new OutputStreamWriter(new FileOutputStream(newOutFile), "UTF-8"));
// ...
Also, the console needs to be configured to use UTF-8 to display the characters. In for example Eclipse you can do that by Window > Preferences > General > Workspace > Text File Encoding.
In the command prompt console it's not possible to display those characters due to lack of a font supporting those characters. You'd like to head to a Swing-like UI console approach.
See also:
Unicode - How to get the characters right?

Instead of FileWriter try using OutputStreamWriter and specify the encoding of the output.

Newlines in string not writing out to file

I'm trying to write a program that manipulates unicode strings read in from a file. I thought of two approaches - one where I read the whole file containing newlines in, perform a couple regex substitutions, and write it back out to another file; the other where I read in the file line by line and match individual lines and substitute on them and write them out. I haven't been able to test the first approach because the newlines in the string are not written as newlines to the file. Here is some example code to illustrate:
String output = "Hello\nthere!";
BufferedWriter oFile = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("test.txt"), "UTF-16"));
System.out.println(output);
oFile.write(output);
oFile.close();
The print statement outputs
Hello
there!
but the file contents are
Hellothere!
Why aren't my newlines being written to file?

You should try using
System.getProperty("line.separator")
Here is an untested example
String output = String.format("Hello%sthere!",System.getProperty("line.separator"));
BufferedWriter oFile = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("test.txt"), "UTF-16"));
System.out.println(output);
oFile.write(output);
oFile.close();
I haven't been able to test the first
approach because the newlines in the
string are not written as newlines to
the file
Are you sure about that? Could you post some code that shows that specific fact?

Use System.getProperty("line.separator") to get the platform specific newline.

Consider using PrintWriters to get the println method known from e.g. System.out

Parse CSV file containing a Unicode character using OpenCSV

I'm trying to parse a .csv file with OpenCSV in NetBeans 6.0.1. My file contains some Unicode character. When I write it in output the character appears in other form, like (HJ1'-E/;). When when I open this file in Notepad, it looks ok.
The code that I used:
CSVReader reader=new CSVReader(new FileReader("d:\\a.csv"),',','\'',1);
String[] line;
while((line=reader.readNext())!=null){
StringBuilder stb=new StringBuilder(400);
for(int i=0;i<line.length;i++){
stb.append(line[i]);
stb.append(";");
}
System.out.println( stb);
}

First you need to know what encoding your file is in, such as UTF-8 or UTF-16. What's generating this file to start with?
After that, it's relatively straightforward - you need to create a FileInputStream wrapped in an InputStreamReader instead of just a FileReader. (FileReader always uses the default encoding for the system.) Specify the encoding to use when you create the InputStreamReader, and if you've picked the right one, everything should start working.
Note that you don't need to use OpenCSV to check this - you could just read the text of the file yourself and print it all out. I'm not sure I'd trust System.out to be able to handle non-ASCII characters though - you may want to find a different way of examining strings, such as printing out the individual values of characters as integers (preferably in hex) and then comparing them with the charts at unicode.org. On the other hand, you could try the right encoding and see what happens to start with...
EDIT: Okay, so if you're using UTF-8:
CSVReader reader=new CSVReader(
new InputStreamReader(new FileInputStream("d:\\a.csv"), "UTF-8"),
',', '\'', 1);
String[] line;
while ((line = reader.readNext()) != null) {
StringBuilder stb = new StringBuilder(400);
for (int i = 0; i < line.length; i++) {
stb.append(line[i]);
stb.append(";");
}
System.out.println(stb);
}
(I hope you have a try/finally block to close the file in your real code.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading UTF-8 file and writing plain ANSI? - java

Look at http://en.wikipedia.org/wiki/Byte_order_mark for the pattern to replace, looks like EF BB BF rather than FE FF This solution is wrong check Jons answer intsead

Related

Reliance on default encoding, what should I use and why?

Character encoding via JDBC/ODBC/Microsoft Access

Displaying special characters

Newlines in string not writing out to file

Parse CSV file containing a Unicode character using OpenCSV

Categories

Resources