How can i read a Russian file in Java? - java

I tried adding UTF-8 for this but it didn't work out. What should i do for reading a Russian file in Java?
FileInputStream fstream1 = new FileInputStream("russian.txt");
DataInputStream in = new DataInputStream(fstream1);
BufferedReader br = new BufferedReader(new InputStreamReader(in,"UTF-8"));

If the file is from Windows PC, try either "windows-1251" or "Cp1251" for the charset name.
If the file is somehow in the MS-DOS encoding, try using "Cp866".
Both of these are single-byte encodings and changing the file type to UTF-8 (which is multibyte) does nothing.
If all else fails, use the hex editor and dump a few hex lines of these file to you question. Then we'll detect the encoding.

As others mentioned you need to know how the file is encoded. A simple check is to (ab)use Firefox as an encoding detector: answer to similar question
If this is a display problem, it depends what you mean by "reads": in the console, in some window? See also How can I make a String with cyrillic characters display correctly?

Related

Different character after export application on Eclipse

When I run my application on Eclipse, I can see the correct latin character, like this:
But when I export to runnable jar file and execute it, the special character is wrong, like this:
I have no idea why this happen. On Mac it's ok both on Eclipse and .jar file. But on Windows it's not ok.
I get the data from webserver and I show in a JavaFX ListView.
It is a String turned into UTF-8 bytes shown as some Windows encoding.
My guess you did this:
URL url = ...
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream());
Whereas you should have done this:
URL url = ...
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream(),
StandardCharsets.UTF_8));
The constructor InputStreamReader without Charset will use the current default platform encoding - wrong.
For any URL you could first do an openConnection and try to divine the delivered encoding. The strategy is a bit circumstantial:
connection.getContentEncoding() / getContentType
default is ISO-8859-1
When ISO-8859-1 take Windows-1252 instead, as browser do that too
Java keeps Unicode in String, char, so all scripts can be handled simultaneous.
Binary data, byte[], InputStream, OutputStream, need to have the charset/encoding specified, when it must be converted from/to text.

can not save utf8 file in windows server with java

I have a simple java application that saves some String in utf-8 encode.
But when I open that file with notepad and save as,it shows it's encode ANSI.Now I don't know where is the problem?
My code that save the file is
File fileDir = new File("c:\\Sample.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append("kodehelp UTF-8").append("\r\n");
out.append("??? UTF-8").append("\r\n");
out.append("???? UTF-8").append("\r\n");
out.flush();
out.close();
The characters you are writing to the file, as they appear in the code snippet, are in the basic ASCII subset of UFT-8. Notepad is likely auto-detecting the format, and seeing nothing outside the ASCII range, decides the file is ANSI.
If you want to force a different decision, place characters such as 字 or õ which are well out of the ASCII range.
It is possible that the ??? strings in your example were intended to be UTF-8. If so. make sure your IDE and/or build tool recognizes the files as UTF-8, and the files are indeed UTF-8 encoded. If you provide more information about your build system, then we can help further.

How to convert strange character from web page?

In the web page, it is "Why don't we" as follows:
But when I parse the webpage and save it to a text file, it becomes this under eclipse:
Why don鈥檛 we
More information about my implementation:
The webpage is: utf-8
I use jSoup to parse, the file is saved as a txt.
I use FileWriter f = new FileWriter() to write to file.
UPDATE:
I actually solve the display problem in eclipse by changing eclipse's encoding to utf-8.
FileWriter is a utility class that uses the default current platform encoding. That is non-portable, and probably incorrect.
BufferedWriter f = new BufferedWriter(New OutputStreamWriter(
new FileOutputStream(file), StandardCharsets.UTF_9));
f,Write("\uFEFF"); // Redundant BOM character might be written to be sure
// the text is read as UTF-8
...

Displaying special characters

I am running into issues when displaying special characters on the Windows console.
I have written the following code:
public static void main(String[] args) throws IOException {
File newFile = new File("sampleInput.txt");
File newOutFile = new File("sampleOutput.txt");
FileReader read = new FileReader(newFile);
FileWriter write = new FileWriter(newOutFile);
PushbackReader reader = new PushbackReader(read);
int c;
while ((c = reader.read()) != -1)
{
write.write(c);
}
read.close();
write.close();
}
The output file looks exactly what the input file would be containing special characters. i.e. for the contents in input file © Ø ŻƩ abcdefĦ, the output file contains exactly the same contents. But when I add the line System.out.printf("%c", (char) c), the contents on the console are:ÿþ©(containing more characters but I am not able to copy paste here). I did read that the issue might be with the Windows console character set, but not able to figure out the fix for it.
Considering the output medium can be anything in future, I do not want to run into issues with Unicode character display for any type of out stream.
Can anyone please help me understand the issue and how can I fix the same ?
The Reader and Writer will use the platform default charset for transforming characters to bytes. In your environment that's apparently not an Unicode compatible charset like UTF-8.
You need InputStreamReader and OutputStreamWriter wherein you can explicitly specify the charset.
Reader read = new InputStreamReader(new FileInputStream(newFile), "UTF-8"));
Writer write = new OutputStreamWriter(new FileOutputStream(newOutFile), "UTF-8"));
// ...
Also, the console needs to be configured to use UTF-8 to display the characters. In for example Eclipse you can do that by Window > Preferences > General > Workspace > Text File Encoding.
In the command prompt console it's not possible to display those characters due to lack of a font supporting those characters. You'd like to head to a Swing-like UI console approach.
See also:
Unicode - How to get the characters right?
Instead of FileWriter try using OutputStreamWriter and specify the encoding of the output.

Unicode in Jar resources

I have a Unicode (UTF-8 without BOM) text file within a jar, that's loaded as a resource.
URL resource = MyClass.class.getResource("datafile.csv");
InputStream stream = resource.openStream();
BufferedReader reader = new BufferedReader(
new InputStreamReader(stream, Charset.forName("UTF-8")));
This works fine on Windows, but on Linux it appear not to be reading the file correctly - accented characters are coming out broken. I'm aware that different machines can have different default charsets, but I'm giving it the correct charset. Why would it not be using it?
The reading part looks correct, I use that all the time on Linux.
I suspect you used default encoding somewhere when you export the text to the web page. Due to the different default encoding on Linux and Windows, you saw different result.
For example, you use default encoding if you do anything like this in servlet,
PrintWriter out = response.getWriter();
out.println(text);
You need to specifically write in UTF-8 like this,
response.setContentType("text/html; charset=UTF-8");
out = new PrintWriter(
new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true);
out.println(text);
I wonder if reviewing UTF-8 on Linux would help. Could be a setup issue.

Categories