java: text file encoding support utf8 and ansi - java

first i like to explain our usecase.
Users can upload some text files using our website. The file will be stored in a folder and will be read using java.
Our problem: Most users using ansi encoded text files but some uses utf-8 encoding.
If i read the text file in java i did not read the file correctly. For example the Word "Äpfel" will be read as "?pfel".
I know i can use the encoding settings in my reader:
reader = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), "UTF-8"));
But how can i determine the correct coding?
My idea is to read the file once and check if there is any unknown char like "?pfel" but how can i check the char is not correct?
BufferedReader in = new BufferedReader(new FileReader( fi ));
while ( in.ready() ) {
String row = in.readLine();
...
How can i check row contains unkown chars ??????
}
Thanks for your help!

Related

Unable to read euro symbol(€) and German characters like (ö, ß, ü, ä, Ä) from a .CSV file in java

Unable to read euro symbol(€) and German characters like (ö, ß, ü, ä, Ä) from a CSV file in java. The point here is that this behaviour is only seen when uploading a CSV file through windows. When we upload the same file from Linux machines we are able to see the Character uploaded successfully. We are using Linux servers which by default uses UTF-8 character set. We also tried to change the character set from UTF-8 to ISO_8859_1, but some character sets are not supported in Linux environment.
Code Overview:
The code is basically a Rest Service, which accepts .csv file as multipart form data. Below is the sample code which is used to write the uploaded contents to the file system.
// Reading file data using multipart-form/data
FormDataBodyPart fileDataBodyPart = multiPart.getField("fileContent");
InputStream fileInputStream = fileDataBodyPart.getValueAs(InputStream.class)
// Writing to a TEMP location
String line = null;
BufferedReader skipLine = new BufferedReader(new InputStreamReader(fileInputStream, StandardCharsets.ISO_8859_1));
OutputStreamWriter writer = new OutputStreamWriter(outputStream, StandardCharsets.ISO_8859_1);
while ((line = skipLine.readLine()) != null) {
line = line + "\n";
writer.write(line);
}

How to Handle String Encoding in JAVA(linux os)

I have one CSV file which contains many records. Noticed that some of the records contain French characters. My script reads each record and processes it and inserts the processed record in the XML. When we view the .csv file on terminal using VIM Editor on Fedora system, the French characters are displayed in correct format. But after processing the records these characters are not getting displayed properly. Also when such a record is printed on the console, it is not displayed properly.
For eg.
String in .csv file : Crêpe Skirt
String in XML : Cr�pe Skirt
code Snippet for Reading file.
BufferedReader file = new BufferedReader(new FileReader(fileLocation));
String line = file.readLine();
Kindly suggest a way to handle such issue.
You need to know what encoding the file is in (probably UTF-8) and then when you open the file in Java specify the same encoding.
try reading the file as UTF-8 file. And provide the encoding of your xml file as UTF-8 too
BufferedReader reader=new BufferedReader(new InputStreamReader(new FileInputStream(your-file-path),"UTF-8"));
String line="";
while((line=reader.readLine())!=null) {
//Do your work here
}

Character encoding via JDBC/ODBC/Microsoft Access

I'm doing a connection via JDBC/ODBC to Microsoft Access successfully. After that, I make a query to select rows from Microsoft Access, and I write these results to a TXT file. Everything is OK, but I have some strings that include accents, and these appear as '?' in TXT file. I already tried various forms of methods to write files in java, as PrintWriter, FileWriter, Outputstream, and others, including adding character encoding parameter (UTF-8 or ISO-8859-1) to some these methods. I need any help about some way to show these characters in a right way. Thanks.
Try the below line,
String OUTPUTFILE = "PATH/TO/FILE/";
BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(OUTPUTFILE),"UTF8"));
Once you add that to your code you should be fine using bf.write('VALUE') to write UTF8 characters to your file. And, also make sure to set your text editor encoding to Unicode or UTF8, if you don't it might seem like the hole process didn't work which would lead to even more confusion.
Edited:
To read UTF8 txts
String IPUTFILE = "PATH/TO/File";
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(INPUTFILE), "UTF8"));
then to read line String str = in.readLine();

Displaying special characters

I am running into issues when displaying special characters on the Windows console.
I have written the following code:
public static void main(String[] args) throws IOException {
File newFile = new File("sampleInput.txt");
File newOutFile = new File("sampleOutput.txt");
FileReader read = new FileReader(newFile);
FileWriter write = new FileWriter(newOutFile);
PushbackReader reader = new PushbackReader(read);
int c;
while ((c = reader.read()) != -1)
{
write.write(c);
}
read.close();
write.close();
}
The output file looks exactly what the input file would be containing special characters. i.e. for the contents in input file © Ø ŻƩ abcdefĦ, the output file contains exactly the same contents. But when I add the line System.out.printf("%c", (char) c), the contents on the console are:ÿþ©(containing more characters but I am not able to copy paste here). I did read that the issue might be with the Windows console character set, but not able to figure out the fix for it.
Considering the output medium can be anything in future, I do not want to run into issues with Unicode character display for any type of out stream.
Can anyone please help me understand the issue and how can I fix the same ?
The Reader and Writer will use the platform default charset for transforming characters to bytes. In your environment that's apparently not an Unicode compatible charset like UTF-8.
You need InputStreamReader and OutputStreamWriter wherein you can explicitly specify the charset.
Reader read = new InputStreamReader(new FileInputStream(newFile), "UTF-8"));
Writer write = new OutputStreamWriter(new FileOutputStream(newOutFile), "UTF-8"));
// ...
Also, the console needs to be configured to use UTF-8 to display the characters. In for example Eclipse you can do that by Window > Preferences > General > Workspace > Text File Encoding.
In the command prompt console it's not possible to display those characters due to lack of a font supporting those characters. You'd like to head to a Swing-like UI console approach.
See also:
Unicode - How to get the characters right?
Instead of FileWriter try using OutputStreamWriter and specify the encoding of the output.

Parse CSV file containing a Unicode character using OpenCSV

I'm trying to parse a .csv file with OpenCSV in NetBeans 6.0.1. My file contains some Unicode character. When I write it in output the character appears in other form, like (HJ1'-E/;). When when I open this file in Notepad, it looks ok.
The code that I used:
CSVReader reader=new CSVReader(new FileReader("d:\\a.csv"),',','\'',1);
String[] line;
while((line=reader.readNext())!=null){
StringBuilder stb=new StringBuilder(400);
for(int i=0;i<line.length;i++){
stb.append(line[i]);
stb.append(";");
}
System.out.println( stb);
}
First you need to know what encoding your file is in, such as UTF-8 or UTF-16. What's generating this file to start with?
After that, it's relatively straightforward - you need to create a FileInputStream wrapped in an InputStreamReader instead of just a FileReader. (FileReader always uses the default encoding for the system.) Specify the encoding to use when you create the InputStreamReader, and if you've picked the right one, everything should start working.
Note that you don't need to use OpenCSV to check this - you could just read the text of the file yourself and print it all out. I'm not sure I'd trust System.out to be able to handle non-ASCII characters though - you may want to find a different way of examining strings, such as printing out the individual values of characters as integers (preferably in hex) and then comparing them with the charts at unicode.org. On the other hand, you could try the right encoding and see what happens to start with...
EDIT: Okay, so if you're using UTF-8:
CSVReader reader=new CSVReader(
new InputStreamReader(new FileInputStream("d:\\a.csv"), "UTF-8"),
',', '\'', 1);
String[] line;
while ((line = reader.readNext()) != null) {
StringBuilder stb = new StringBuilder(400);
for (int i = 0; i < line.length; i++) {
stb.append(line[i]);
stb.append(";");
}
System.out.println(stb);
}
(I hope you have a try/finally block to close the file in your real code.)

Categories