Displaying special characters - java

I am running into issues when displaying special characters on the Windows console.
I have written the following code:
public static void main(String[] args) throws IOException {
File newFile = new File("sampleInput.txt");
File newOutFile = new File("sampleOutput.txt");
FileReader read = new FileReader(newFile);
FileWriter write = new FileWriter(newOutFile);
PushbackReader reader = new PushbackReader(read);
int c;
while ((c = reader.read()) != -1)
{
write.write(c);
}
read.close();
write.close();
}
The output file looks exactly what the input file would be containing special characters. i.e. for the contents in input file © Ø ŻƩ abcdefĦ, the output file contains exactly the same contents. But when I add the line System.out.printf("%c", (char) c), the contents on the console are:ÿþ©(containing more characters but I am not able to copy paste here). I did read that the issue might be with the Windows console character set, but not able to figure out the fix for it.
Considering the output medium can be anything in future, I do not want to run into issues with Unicode character display for any type of out stream.
Can anyone please help me understand the issue and how can I fix the same ?

The Reader and Writer will use the platform default charset for transforming characters to bytes. In your environment that's apparently not an Unicode compatible charset like UTF-8.
You need InputStreamReader and OutputStreamWriter wherein you can explicitly specify the charset.
Reader read = new InputStreamReader(new FileInputStream(newFile), "UTF-8"));
Writer write = new OutputStreamWriter(new FileOutputStream(newOutFile), "UTF-8"));
// ...
Also, the console needs to be configured to use UTF-8 to display the characters. In for example Eclipse you can do that by Window > Preferences > General > Workspace > Text File Encoding.
In the command prompt console it's not possible to display those characters due to lack of a font supporting those characters. You'd like to head to a Swing-like UI console approach.
See also:
Unicode - How to get the characters right?

Instead of FileWriter try using OutputStreamWriter and specify the encoding of the output.

Related

Writing to Buffered Writer UTF-8 Characters With Accents Are Coming Out Garbled

I am reading from a UTF-8 input file with accented characters, reading the lines and writing them back to a different file (also UTF-8) but the accented characters are coming out garbled in the output. For instance the following words:
León
Mānoa
are output as:
Le�n
Manoa
I've looked at about 100 answers to this question which all suggest reading and writing the files as the code indicates below, but I keep having the same result.
I've broken down the code to the elemental features below:
public class UTF8EncoderTest
{
public static void main(String[] args)
{
try
{
BufferedReader inputFileReader = new BufferedReader(new InputStreamReader(new FileInputStream("utf8TestInput.txt"), "UTF-8"));
BufferedWriter outputFileWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("utf8TestOutput.txt"), "UTF-8"));
String line = inputFileReader.readLine();
while (line != null)
{
outputFileWriter.write(line + "\r\n");
line = inputFileReader.readLine();
}
inputFileReader.close();
outputFileWriter.close();
System.out.println("Finished!");
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
But this still results in garbled characters in the output file. Any help would be appreciated!
Try this:
String sText = "This león and this is Mānoa";
File oFile = new File(getExternalFilesDir("YourFolder"), "YourFile.txt");
try {
FileOutputStream oFileOutputStream = new FileOutputStream(oFile, true); //append
OutputStreamWriter writer = new OutputStreamWriter(oFileOutputStream, StandardCharsets.ISO_8859_1);
writer.append(sText);
writer.close();
} catch (IOException e) {
}
I tried your code with your examples and it works without problems (characters are not changed or lost).
Few tips when you deal with charsets in Java:
Default character encoding in Java is the character encoding used by JVM.
By default, JVM uses platform encoding i.e. character encoding of your server (OS).
Java gets character encoding by calling System.getProperty("file.encoding","UTF-8") at the time of JVM start-up. So if Java doesn't get any file.encoding attribute it uses UTF-8 character encoding. Most important point to remember is that Java caches character encoding or value of system property file.encoding in most of its core classes like InputStreamReader, which needs character encoding after JVM started. So if you change system property file.encoding programmatically when application is running you will not see desired effect (change) in your application and that's why you should always work with your own character encoding provided to your application and if its need to be set than set character encoding or charset while you start JVM.
How to get default character encoding?
The easiest way to get default character encoding is to call System.getProperty("file.encoding"), which will return default character encoding if JVM started with -Dfile.encoding property or program has not called System.setProperty("file.encoding", someEncoding).
java.nio.Charset provides a convenient static method Charset.defaultCharset() which returns default character encoding.
By using InputStreamReader#getEncoding().
How to set default character encoding?
By providing the file.encoding system property when JVM starts e.g.:
java -Dfile.encoding="UTF-8" HelloWorld
If you don't have control how JVM starts up, you can set environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding="UTF-16" or any other character encoding, and it will be picked up when JVM starts in your windows machine. JVM will also print Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF16 on console to indicate that it has picked JAVA_TOOS_OPTIONS.
Alternatively, you can try:
Path inputFilePath = Paths.get("utf8TestInput.txt");
BufferedReader inputFileReader = Files.newBufferedReader(inputFilePath, StandardCharsets.UTF_8);
Path outputFilePath = Paths.get("utf8TestOutput");
BufferedWriter outputFileWriter = Files.newBufferedWriter(outputFilePath, StandardCharsets.UTF_8);

Character encoding via JDBC/ODBC/Microsoft Access

I'm doing a connection via JDBC/ODBC to Microsoft Access successfully. After that, I make a query to select rows from Microsoft Access, and I write these results to a TXT file. Everything is OK, but I have some strings that include accents, and these appear as '?' in TXT file. I already tried various forms of methods to write files in java, as PrintWriter, FileWriter, Outputstream, and others, including adding character encoding parameter (UTF-8 or ISO-8859-1) to some these methods. I need any help about some way to show these characters in a right way. Thanks.
Try the below line,
String OUTPUTFILE = "PATH/TO/FILE/";
BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(OUTPUTFILE),"UTF8"));
Once you add that to your code you should be fine using bf.write('VALUE') to write UTF8 characters to your file. And, also make sure to set your text editor encoding to Unicode or UTF8, if you don't it might seem like the hole process didn't work which would lead to even more confusion.
Edited:
To read UTF8 txts
String IPUTFILE = "PATH/TO/File";
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(INPUTFILE), "UTF8"));
then to read line String str = in.readLine();

Reading UTF-8 file and writing plain ANSI?

I have an UTF-8 file (it's a csv).
I need to read line by line this file do some replace and then write line by line into another file.
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileFix), "ASCII")
);
bw.write(""); //clean current file
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(file),"UTF-8")
);
String line;
while ((line = br.readLine()) != null) {
line = line.replace(";", ",");
bw.append(line + "\n");
}
Simple as that.
The problem is that the output file (fileFix) is UTF-8 and i think it has the BOM character.
How can I write the file as plain ANSI without the BOM?
The error I am getting while reading my file with a software (weka)
The first line of this file:
Consider that notepad++ tells me the charset is UTF-8. If i try to convert this file in plain ASCII (with windows notepad), that chars disappers
Solution
When you are on the first line run:
line = line.substring(1);
To remove any BOM char.
It sounds like this is a BOM issue rather than an encoding issue as such.
You can just remove any BOM characters as you write the file, with:
line = line.replace("\ufeff", "");
That leaves the question of whether you're reading the data accurately in the first place... I'd strongly advise you not to use FileWriter and FileReader at all - instead, use InputStreamReader and OutputStreamWriter, specifying the encoding explicitly for both of them. Set the reader encoding to UTF-8 (assuming the input file really is UTF-8), and set the writer encoding to whatever you want... but I'd recommend sticking with UTF-8, to be honest.
Also note that you should be closing your reader/writer in finally blocks, or using the try-with-resources statement if you're using Java 7.
Look at http://en.wikipedia.org/wiki/Byte_order_mark for the pattern to replace, looks like EF BB BF rather than FE FF
This solution is wrong check Jons answer intsead

How can i read a Russian file in Java?

I tried adding UTF-8 for this but it didn't work out. What should i do for reading a Russian file in Java?
FileInputStream fstream1 = new FileInputStream("russian.txt");
DataInputStream in = new DataInputStream(fstream1);
BufferedReader br = new BufferedReader(new InputStreamReader(in,"UTF-8"));
If the file is from Windows PC, try either "windows-1251" or "Cp1251" for the charset name.
If the file is somehow in the MS-DOS encoding, try using "Cp866".
Both of these are single-byte encodings and changing the file type to UTF-8 (which is multibyte) does nothing.
If all else fails, use the hex editor and dump a few hex lines of these file to you question. Then we'll detect the encoding.
As others mentioned you need to know how the file is encoded. A simple check is to (ab)use Firefox as an encoding detector: answer to similar question
If this is a display problem, it depends what you mean by "reads": in the console, in some window? See also How can I make a String with cyrillic characters display correctly?

Parse CSV file containing a Unicode character using OpenCSV

I'm trying to parse a .csv file with OpenCSV in NetBeans 6.0.1. My file contains some Unicode character. When I write it in output the character appears in other form, like (HJ1'-E/;). When when I open this file in Notepad, it looks ok.
The code that I used:
CSVReader reader=new CSVReader(new FileReader("d:\\a.csv"),',','\'',1);
String[] line;
while((line=reader.readNext())!=null){
StringBuilder stb=new StringBuilder(400);
for(int i=0;i<line.length;i++){
stb.append(line[i]);
stb.append(";");
}
System.out.println( stb);
}
First you need to know what encoding your file is in, such as UTF-8 or UTF-16. What's generating this file to start with?
After that, it's relatively straightforward - you need to create a FileInputStream wrapped in an InputStreamReader instead of just a FileReader. (FileReader always uses the default encoding for the system.) Specify the encoding to use when you create the InputStreamReader, and if you've picked the right one, everything should start working.
Note that you don't need to use OpenCSV to check this - you could just read the text of the file yourself and print it all out. I'm not sure I'd trust System.out to be able to handle non-ASCII characters though - you may want to find a different way of examining strings, such as printing out the individual values of characters as integers (preferably in hex) and then comparing them with the charts at unicode.org. On the other hand, you could try the right encoding and see what happens to start with...
EDIT: Okay, so if you're using UTF-8:
CSVReader reader=new CSVReader(
new InputStreamReader(new FileInputStream("d:\\a.csv"), "UTF-8"),
',', '\'', 1);
String[] line;
while ((line = reader.readNext()) != null) {
StringBuilder stb = new StringBuilder(400);
for (int i = 0; i < line.length; i++) {
stb.append(line[i]);
stb.append(";");
}
System.out.println(stb);
}
(I hope you have a try/finally block to close the file in your real code.)

Categories