How to convert strange character from web page? - java

In the web page, it is "Why don't we" as follows:
But when I parse the webpage and save it to a text file, it becomes this under eclipse:
Why don鈥檛 we
More information about my implementation:
The webpage is: utf-8
I use jSoup to parse, the file is saved as a txt.
I use FileWriter f = new FileWriter() to write to file.
UPDATE:
I actually solve the display problem in eclipse by changing eclipse's encoding to utf-8.

FileWriter is a utility class that uses the default current platform encoding. That is non-portable, and probably incorrect.
BufferedWriter f = new BufferedWriter(New OutputStreamWriter(
new FileOutputStream(file), StandardCharsets.UTF_9));
f,Write("\uFEFF"); // Redundant BOM character might be written to be sure
// the text is read as UTF-8
...

Related

How to save text file with UTF-8 encoding in java?

I am facing a problem in saving a text file in UTF-8 format using java. When i click on save as for the generated text file, it gives ANSI as text format but not as UTF-8. Below is the code i am writing while creating the file:
String excelFile="";
excelFile = getFullFilePath(fileName);
File file = new File(excelFile + fileName,"UTF-8");
File output = new File(excelFile,"UTF-8");
FileUtils.writeStringToFile(file, content, "UTF-8");
While creating the text file, I am using UTF-8 encoding, but the file still shows the encoding as ANSI while saving.
Kindly help.
Instead of using File, create a FileOutputStream.
Try this.
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("outfilename"), "UTF-8"));
try {
out.write(aString);
} finally {
out.close();
}
I had a problem very similar to that and I solved saving the original file with the UTF-8 encoding. In this case, go to the excel file and save it again, or create a new file and copy the content, and make sure that its encoding is UTF-8. This link has a tutorial on how to save an excel file with UTF-8 encoding: https://help.surveygizmo.com/help/encode-an-excel-file-to-utf-8-or-utf-16. At the most your code seems to be correct.

can not save utf8 file in windows server with java

I have a simple java application that saves some String in utf-8 encode.
But when I open that file with notepad and save as,it shows it's encode ANSI.Now I don't know where is the problem?
My code that save the file is
File fileDir = new File("c:\\Sample.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append("kodehelp UTF-8").append("\r\n");
out.append("??? UTF-8").append("\r\n");
out.append("???? UTF-8").append("\r\n");
out.flush();
out.close();
The characters you are writing to the file, as they appear in the code snippet, are in the basic ASCII subset of UFT-8. Notepad is likely auto-detecting the format, and seeing nothing outside the ASCII range, decides the file is ANSI.
If you want to force a different decision, place characters such as 字 or õ which are well out of the ASCII range.
It is possible that the ??? strings in your example were intended to be UTF-8. If so. make sure your IDE and/or build tool recognizes the files as UTF-8, and the files are indeed UTF-8 encoded. If you provide more information about your build system, then we can help further.

How can I write UTF-8 chars on java application?

I want to write
ısı
to csv on java netbeans. It works fine when I debug the code. But when I clean and build the project, I run .jar application and then when I look the csv I see
?s?
How can I solve this ?
thanks in advance.
EDIT
I use this to write :
PrintWriter csvWriter = new PrintWriter(new File("myfile.csv")) ;
csvWriter.println("ısı") ;
With this code:
PrintWriter csvWriter = new PrintWriter(new File("myfile.csv")) ;
csvWriter.println("ısı") ;
you are using the default character encoding of your system, which may or may not be UTF-8. If you want to use UTF-8, you have to specify that:
PrintWriter csvWriter = new PrintWriter(new File("myfile.csv"), "UTF-8");
Note that even if you do this, you might still see unexpected output. If that's the case, then you will need to check if whatever program you use to display the output (the Windows command prompt, or a text editor, or ...) understands that the file is in UTF-8 and displays it correctly.

Exporting CSV in french language shows junk charcters

I am having a problem in exporting a csv file using au.com.bytecode.opencsv.CSVWriter. I did something like:
File file = File.createTempFile("UserDetails_", ".csv");
CSVWriter writer = new CSVWriter(new OutputStreamWriter(
new FileOutputStream(file), "UTF-8"),
',');
and then when I exporting the .csv file, it shows the junk characters for french letters.[Data to be saved in the .csv are french characters].
But previously I was doing something like:
CSVWriter writer = new CSVWriter(new FileWriter(file));, then it was perfectly showing all french characters in Windows environment, but in Prod environment[Linux] it was showing junks. So I thought to use the Character set UTF-8 for the file format to be exported.
How can I get rid of the problem?
Please Suggest!!
Thanks in advance!
Hypothesis: you use Excel to open your CSVs under Windows.
Unfortunately for you, Excel is crap at reading UTF-8. Even though it should not be required, Excel expects to have a byte order mark at the beginning of the CSV if it uses any UTF-* encoding, otherwise it will try and read it using Windows 1252!
Solution? Errr... Don't use Excel?
Anyway, with your old way:
CSVWriter writer = new CSVWriter(new FileWriter(file));
this would use the JVM's default encoding; this is windows-1252 under Windows and UTF-8 under Linux.
Note that Apache's commons-io has BOM{Input,Output}Stream classes which may help you here.
Another solution would be (ewwww) to always read/write using Windows-1252.
Other note: if you use Java 7, use the Files.newBuffered{Reader,Writer}() methods -- and the try-with-resources statement.

How can i read a Russian file in Java?

I tried adding UTF-8 for this but it didn't work out. What should i do for reading a Russian file in Java?
FileInputStream fstream1 = new FileInputStream("russian.txt");
DataInputStream in = new DataInputStream(fstream1);
BufferedReader br = new BufferedReader(new InputStreamReader(in,"UTF-8"));
If the file is from Windows PC, try either "windows-1251" or "Cp1251" for the charset name.
If the file is somehow in the MS-DOS encoding, try using "Cp866".
Both of these are single-byte encodings and changing the file type to UTF-8 (which is multibyte) does nothing.
If all else fails, use the hex editor and dump a few hex lines of these file to you question. Then we'll detect the encoding.
As others mentioned you need to know how the file is encoded. A simple check is to (ab)use Firefox as an encoding detector: answer to similar question
If this is a display problem, it depends what you mean by "reads": in the console, in some window? See also How can I make a String with cyrillic characters display correctly?

Categories