saving Java file in UTF-8 - java

When I run this program it gives me a '?' for the unicode code-point \u0508. This is because the default windows character encoding CP-1252 is unable to map this code-point.
But when I save this file in Eclipse as 'Text file encoding' = UTF-8 and run this program it gives me the correct output AԈC.
why does this work? I mean the java file is saved as UTF-8 but still the underlying windows OS encoding is CP-1252. My question is similar to, when I try to read a text file in UTF-16 which was originally written in UTF-8, the output is wierd with different box symbols.
public class e {
public static void main(String[] args) {
System.out.println(System.getProperty("file.encoding"));
String original = new String("A" + "\u0508" + "C");
try {
System.out.println("original = " + original);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Saving the Java source file either as UTF-8 or Windows-1252 shouldn't make any difference, because both encodings encode all the ASCII code-points the same way. And your source file is only using ASCII characters.
So, that you should try to find the bug somewhere else. I suggest to redo the steps you did with care and do the tests over.

The issue is the setting of file.encoding when you run the program, and the destination of System.out. If System.out is an eclipse console, it may well be set to be UTF-8 eclipse console. If it's just a Windows DOS box, it is a CP1252 code page, and will only display ? in this case.

Related

Java: Runtime.exec() and Unicode symbols on Windows: how to make it work with non-English letters?

Intro
I am using Runtime.exec() to execute some external command and I am using parameters that contain non-English characters. I simply want to run something like this:
python test.py шалом
It works correctly in cmd directly, but is incorrectly handled via Runtime.exec.getRuntime()("python test.py шалом")
On Windows my external program fails due to unknown symbols passed to it.
I remember similar issue from early 2010s (!) - JDK-4947220, but I thought it is already fixed since Java core 1.6.
Environments:
OS: Name Microsoft Windows 10 Pro (Version 10.0.18362 Build 18362)
Java: jdk1.8.0_221
Code
To understand the question - the best way is to use code snippet listed below:
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class MainClass {
private static void foo(String filename) {
try {
BufferedReader input = new BufferedReader(
new InputStreamReader(
Runtime.getRuntime().exec(filename).getInputStream()));
String line;
while ((line = input.readLine()) != null) {
System.out.println(line);
}
input.close();
} catch (Exception e) { /* ... */ }
}
public static void main(String[] args) {
foo("你好.bat 你好"); // ??
foo("привет.bat привет"); // ??????
foo("hi.bat hi"); // hi
}
}
Where .bat file contains only simple #echo %1
The output will be:
??
??????
hi
PS
System.out.println("привет") - works fine and prints everything correctly
Questions are the following:
1) Is this issue related to Utf-8 utf-16 formats?
2) How to fix this issue? I do not like this answer as it looks like a very dangerous and ugly workaround.
3) Does anyone know why file names of batch file is not broken and this file can be found, but the argument gets broken? May be it is problem of #echo?
Yes, issue is related with UTF. Theoretically a setting 65001 codepage for cmd that executes the bat files should solve the issue (along with setting UTF-8 charset as default from the Java side)
Unfortunately there a bug in Windows mentioning here Java, Unicode, UTF-8, and Windows Command Prompt
So there's no simple and complete solution. What it's possible to do is to set the same default language-specific encoding, like cp1251 Cyrillic, for both java and cmd. Not all languages are well reflected in the windows encodings, for example Chinese is one of them.
If there's some non-technical restriction on the windows system to change default encoding to the language-specific one for all cmd processes, the java code will be more complicated. At beginning new cmd process have to be created and to its stdin/stdout streams should be attached reader with UTF-16LE (for `cmd /U' process) and writer with CP1251 from different threads. First command sending to stdin from java should be 'chcp 1251' and second is the name of bat-file with its parameters.
Complete solution still may use UTF-16LE for reading of cmd output but to pass a text in, other universal encoding should be used, for example base64, which again leads to increasing complexity

Writing to Buffered Writer UTF-8 Characters With Accents Are Coming Out Garbled

I am reading from a UTF-8 input file with accented characters, reading the lines and writing them back to a different file (also UTF-8) but the accented characters are coming out garbled in the output. For instance the following words:
León
Mānoa
are output as:
Le�n
Manoa
I've looked at about 100 answers to this question which all suggest reading and writing the files as the code indicates below, but I keep having the same result.
I've broken down the code to the elemental features below:
public class UTF8EncoderTest
{
public static void main(String[] args)
{
try
{
BufferedReader inputFileReader = new BufferedReader(new InputStreamReader(new FileInputStream("utf8TestInput.txt"), "UTF-8"));
BufferedWriter outputFileWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("utf8TestOutput.txt"), "UTF-8"));
String line = inputFileReader.readLine();
while (line != null)
{
outputFileWriter.write(line + "\r\n");
line = inputFileReader.readLine();
}
inputFileReader.close();
outputFileWriter.close();
System.out.println("Finished!");
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
But this still results in garbled characters in the output file. Any help would be appreciated!
Try this:
String sText = "This león and this is Mānoa";
File oFile = new File(getExternalFilesDir("YourFolder"), "YourFile.txt");
try {
FileOutputStream oFileOutputStream = new FileOutputStream(oFile, true); //append
OutputStreamWriter writer = new OutputStreamWriter(oFileOutputStream, StandardCharsets.ISO_8859_1);
writer.append(sText);
writer.close();
} catch (IOException e) {
}
I tried your code with your examples and it works without problems (characters are not changed or lost).
Few tips when you deal with charsets in Java:
Default character encoding in Java is the character encoding used by JVM.
By default, JVM uses platform encoding i.e. character encoding of your server (OS).
Java gets character encoding by calling System.getProperty("file.encoding","UTF-8") at the time of JVM start-up. So if Java doesn't get any file.encoding attribute it uses UTF-8 character encoding. Most important point to remember is that Java caches character encoding or value of system property file.encoding in most of its core classes like InputStreamReader, which needs character encoding after JVM started. So if you change system property file.encoding programmatically when application is running you will not see desired effect (change) in your application and that's why you should always work with your own character encoding provided to your application and if its need to be set than set character encoding or charset while you start JVM.
How to get default character encoding?
The easiest way to get default character encoding is to call System.getProperty("file.encoding"), which will return default character encoding if JVM started with -Dfile.encoding property or program has not called System.setProperty("file.encoding", someEncoding).
java.nio.Charset provides a convenient static method Charset.defaultCharset() which returns default character encoding.
By using InputStreamReader#getEncoding().
How to set default character encoding?
By providing the file.encoding system property when JVM starts e.g.:
java -Dfile.encoding="UTF-8" HelloWorld
If you don't have control how JVM starts up, you can set environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding="UTF-16" or any other character encoding, and it will be picked up when JVM starts in your windows machine. JVM will also print Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF16 on console to indicate that it has picked JAVA_TOOS_OPTIONS.
Alternatively, you can try:
Path inputFilePath = Paths.get("utf8TestInput.txt");
BufferedReader inputFileReader = Files.newBufferedReader(inputFilePath, StandardCharsets.UTF_8);
Path outputFilePath = Paths.get("utf8TestOutput");
BufferedWriter outputFileWriter = Files.newBufferedWriter(outputFilePath, StandardCharsets.UTF_8);

Eclipse Java imports weird non-Hebrew characters instead of hebrew in file - Encoding issue?

I'm trying to import a .dat text file including both hebrew and english characters into a java program using Eclipse Neon 4.6.0:
String[] getFile(String path) throws IOException
{
BufferedReader in = new BufferedReader(new InputStreamReader(this.getClass().getResource("../../../t3utf.dat").openStream()));
String l;
String[] dataFile = new String[23213]; //Does java have push and pop or auto expanding lists?
int c = 0;
while ((l = in.readLine()) != null) {
dataFile[c] = l;
c++;
}
return dataFile;
}
For some reason, the Hebrew characters are being replaced with random gibberish:
Original: gen|1|1|בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
Once the program runs once, all the hebrew characters are replaced with gibberish:
New: gen|1|1|בְּרֵ×ש×ִ֖ית ×‘Ö¼Ö¸×¨Ö¸Ö£× ×Ö±×œÖ¹×”Ö´Ö‘×™× ×ֵ֥תהַש×Ö¼Ö¸×žÖ·Ö–×™Ö´× ×•Ö°×ֵ֥ת ×”Ö¸×ָֽרֶץ׃
In fact, the file itself changes to the gibberish - when viewed in notepad after running the program, the characters have changed somehow.
I had a version of my program running on Android in AIDE that worked and did not have this problem. Is Eclipse unnecessarily forcing a particular useless encoding?
According to this answer, you need to set the project encoding to UTF-8. The best way to do that if you'll be working on other projects involving hebrew characters is to change the encoding for your workspace:
Go to the Window Menu -> Preferences -> General -> Workspace
2.:
This will allow your program to load Hebrew characters, as the UTF-8 encoding includes Hebrew characters.

Save to file with special characters

My program works fine in my MAC, but when I try it on WINDOWS all the special characters turns into %$&.. I am norwegian so the special characters is mostly æøå.
This is the code I use to write to file:
File file = new File("Notes.txt");
if (file.exists() && !file.isDirectory()) {
try(PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter("Notes.txt", true)))) {
pw.println("");
pw.println("*****");
pw.println(notat.getId());
pw.println(notat.getTitle());
pw.println(notat.getNote());
pw.println(notat.getDate());
pw.close();
}catch (Exception e) {
//Did not find file
}
} else {
//Did not find file
}
Now how can I assure that the special characters gets written correct in both OS?
NOTE: I use IntelliJ, and my program is a .jar file.
Make sure that you use the same encoding on windows as you do on mac.
IDEA displays the encoding in the right lower corner. Furthermore, you can configure the encoding Settings -> Editor -> File Encodings.
It's possible to configure the encoding project wide or per file.
Furthermore, read java default file encoding to make sure, reading and writing files will always use the same charset.

can not save utf8 file in windows server with java

I have a simple java application that saves some String in utf-8 encode.
But when I open that file with notepad and save as,it shows it's encode ANSI.Now I don't know where is the problem?
My code that save the file is
File fileDir = new File("c:\\Sample.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append("kodehelp UTF-8").append("\r\n");
out.append("??? UTF-8").append("\r\n");
out.append("???? UTF-8").append("\r\n");
out.flush();
out.close();
The characters you are writing to the file, as they appear in the code snippet, are in the basic ASCII subset of UFT-8. Notepad is likely auto-detecting the format, and seeing nothing outside the ASCII range, decides the file is ANSI.
If you want to force a different decision, place characters such as 字 or õ which are well out of the ASCII range.
It is possible that the ??? strings in your example were intended to be UTF-8. If so. make sure your IDE and/or build tool recognizes the files as UTF-8, and the files are indeed UTF-8 encoded. If you provide more information about your build system, then we can help further.

Categories