Parse CSV file containing a Unicode character using OpenCSV - java

I'm trying to parse a .csv file with OpenCSV in NetBeans 6.0.1. My file contains some Unicode character. When I write it in output the character appears in other form, like (HJ1'-E/;). When when I open this file in Notepad, it looks ok.
The code that I used:
CSVReader reader=new CSVReader(new FileReader("d:\\a.csv"),',','\'',1);
String[] line;
while((line=reader.readNext())!=null){
StringBuilder stb=new StringBuilder(400);
for(int i=0;i<line.length;i++){
stb.append(line[i]);
stb.append(";");
}
System.out.println( stb);
}

First you need to know what encoding your file is in, such as UTF-8 or UTF-16. What's generating this file to start with?
After that, it's relatively straightforward - you need to create a FileInputStream wrapped in an InputStreamReader instead of just a FileReader. (FileReader always uses the default encoding for the system.) Specify the encoding to use when you create the InputStreamReader, and if you've picked the right one, everything should start working.
Note that you don't need to use OpenCSV to check this - you could just read the text of the file yourself and print it all out. I'm not sure I'd trust System.out to be able to handle non-ASCII characters though - you may want to find a different way of examining strings, such as printing out the individual values of characters as integers (preferably in hex) and then comparing them with the charts at unicode.org. On the other hand, you could try the right encoding and see what happens to start with...
EDIT: Okay, so if you're using UTF-8:
CSVReader reader=new CSVReader(
new InputStreamReader(new FileInputStream("d:\\a.csv"), "UTF-8"),
',', '\'', 1);
String[] line;
while ((line = reader.readNext()) != null) {
StringBuilder stb = new StringBuilder(400);
for (int i = 0; i < line.length; i++) {
stb.append(line[i]);
stb.append(";");
}
System.out.println(stb);
}
(I hope you have a try/finally block to close the file in your real code.)

Related

java: text file encoding support utf8 and ansi

first i like to explain our usecase.
Users can upload some text files using our website. The file will be stored in a folder and will be read using java.
Our problem: Most users using ansi encoded text files but some uses utf-8 encoding.
If i read the text file in java i did not read the file correctly. For example the Word "Äpfel" will be read as "?pfel".
I know i can use the encoding settings in my reader:
reader = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), "UTF-8"));
But how can i determine the correct coding?
My idea is to read the file once and check if there is any unknown char like "?pfel" but how can i check the char is not correct?
BufferedReader in = new BufferedReader(new FileReader( fi ));
while ( in.ready() ) {
String row = in.readLine();
...
How can i check row contains unkown chars ??????
}
Thanks for your help!

Comments in text file being read for Java

When writing a program in Java, and you want to read a file, is there a way the USER can have the program completely ignore certain lines or characters by making the certain file's line or character into more of a 'comment'? Just as a programmer can use '//' or '/* */' while programming.
You can skip the lines that start with a certain pattern (e.g. '//' or '#').
There's an example on how to read a file in Java line by line here: http://www.roseindia.net/Java/beginners/java-read-file-line-by-line.shtml
You can change the while-loop like this:
while ((strLine = br.readLine()) != null) {
if (!strLine.startWith("#"))
System.out.println (strLine);
}
In this example lines starting with '#' will not get printed.
If it is a properties file format and you are using Properties to read it, you can use the pound character. Otherwise, no, you will need to implement something yourself.
my.prop1=val1
#some comment
my.prop2=val2
Sure. If program uses code like this to read lines from the file:
BufferedReader reader =
new BufferedReader (
new InputStreamReader (new FileInputStream (inputFile)));
String line;
while ((line = reader.readLine ()) != null)
{
if (line.startWith ("#") || line.trim ().isEmpty ())
continue; // Ignore the line
// Process the line
}
User can mark some lines as comments by prefixing them with pound ('#') character.
this depends on what you want to accomplish
if you simply want to transform the new file, then you can use BufferedReader reader to read the file line by line and then check each line to see if it meet some criteria,
and take the necessary action
like if (line.startWith ("#") || line.trim ().isEmpty ()) otherwise it will be better to use standard libraries to perform operations on your file, such as compressing the file, changing the file to a different format etc

Character encoding via JDBC/ODBC/Microsoft Access

I'm doing a connection via JDBC/ODBC to Microsoft Access successfully. After that, I make a query to select rows from Microsoft Access, and I write these results to a TXT file. Everything is OK, but I have some strings that include accents, and these appear as '?' in TXT file. I already tried various forms of methods to write files in java, as PrintWriter, FileWriter, Outputstream, and others, including adding character encoding parameter (UTF-8 or ISO-8859-1) to some these methods. I need any help about some way to show these characters in a right way. Thanks.
Try the below line,
String OUTPUTFILE = "PATH/TO/FILE/";
BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(OUTPUTFILE),"UTF8"));
Once you add that to your code you should be fine using bf.write('VALUE') to write UTF8 characters to your file. And, also make sure to set your text editor encoding to Unicode or UTF8, if you don't it might seem like the hole process didn't work which would lead to even more confusion.
Edited:
To read UTF8 txts
String IPUTFILE = "PATH/TO/File";
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(INPUTFILE), "UTF8"));
then to read line String str = in.readLine();

Reading UTF-8 file and writing plain ANSI?

I have an UTF-8 file (it's a csv).
I need to read line by line this file do some replace and then write line by line into another file.
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileFix), "ASCII")
);
bw.write(""); //clean current file
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(file),"UTF-8")
);
String line;
while ((line = br.readLine()) != null) {
line = line.replace(";", ",");
bw.append(line + "\n");
}
Simple as that.
The problem is that the output file (fileFix) is UTF-8 and i think it has the BOM character.
How can I write the file as plain ANSI without the BOM?
The error I am getting while reading my file with a software (weka)
The first line of this file:
Consider that notepad++ tells me the charset is UTF-8. If i try to convert this file in plain ASCII (with windows notepad), that chars disappers
Solution
When you are on the first line run:
line = line.substring(1);
To remove any BOM char.
It sounds like this is a BOM issue rather than an encoding issue as such.
You can just remove any BOM characters as you write the file, with:
line = line.replace("\ufeff", "");
That leaves the question of whether you're reading the data accurately in the first place... I'd strongly advise you not to use FileWriter and FileReader at all - instead, use InputStreamReader and OutputStreamWriter, specifying the encoding explicitly for both of them. Set the reader encoding to UTF-8 (assuming the input file really is UTF-8), and set the writer encoding to whatever you want... but I'd recommend sticking with UTF-8, to be honest.
Also note that you should be closing your reader/writer in finally blocks, or using the try-with-resources statement if you're using Java 7.
Look at http://en.wikipedia.org/wiki/Byte_order_mark for the pattern to replace, looks like EF BB BF rather than FE FF
This solution is wrong check Jons answer intsead

Displaying special characters

I am running into issues when displaying special characters on the Windows console.
I have written the following code:
public static void main(String[] args) throws IOException {
File newFile = new File("sampleInput.txt");
File newOutFile = new File("sampleOutput.txt");
FileReader read = new FileReader(newFile);
FileWriter write = new FileWriter(newOutFile);
PushbackReader reader = new PushbackReader(read);
int c;
while ((c = reader.read()) != -1)
{
write.write(c);
}
read.close();
write.close();
}
The output file looks exactly what the input file would be containing special characters. i.e. for the contents in input file © Ø ŻƩ abcdefĦ, the output file contains exactly the same contents. But when I add the line System.out.printf("%c", (char) c), the contents on the console are:ÿþ©(containing more characters but I am not able to copy paste here). I did read that the issue might be with the Windows console character set, but not able to figure out the fix for it.
Considering the output medium can be anything in future, I do not want to run into issues with Unicode character display for any type of out stream.
Can anyone please help me understand the issue and how can I fix the same ?
The Reader and Writer will use the platform default charset for transforming characters to bytes. In your environment that's apparently not an Unicode compatible charset like UTF-8.
You need InputStreamReader and OutputStreamWriter wherein you can explicitly specify the charset.
Reader read = new InputStreamReader(new FileInputStream(newFile), "UTF-8"));
Writer write = new OutputStreamWriter(new FileOutputStream(newOutFile), "UTF-8"));
// ...
Also, the console needs to be configured to use UTF-8 to display the characters. In for example Eclipse you can do that by Window > Preferences > General > Workspace > Text File Encoding.
In the command prompt console it's not possible to display those characters due to lack of a font supporting those characters. You'd like to head to a Swing-like UI console approach.
See also:
Unicode - How to get the characters right?
Instead of FileWriter try using OutputStreamWriter and specify the encoding of the output.

Categories