How to get charset of Process in Java? - java

Charset.defaultCharset() and file.encoding is JVM's charset, not charset of OS, console and terminal.
Now I create a Process to run a program, and use process.getInputSteam() to read the output, how to find the correct charset for the process (sun.jnu.encoding is right but it seems not general)?

From Java 17 onwards, there is a method in Process class named inputStream.
The source code taken from Process.java:
public final BufferedReader inputReader() {
return inputReader(CharsetHolder.nativeCharset());
}
The nativeCharset is obtained from System property "native.encoding".
This property was introduced in Java 17. This is probably what you want.
Reference: https://bugs.openjdk.java.net/browse/JDK-8266075

Related

Java change system new-line character

On Windows, using System.out.println() prints out \n\r while on a Unix system you would get \n.
Is there any way to tell java what new-line characters you want to use?
As already stated by others, the system property line.separator contains the actual line separator. Strangely, the other answers missed the simple conclusion: you can override that separator by changing that system property at startup time.
E.g. if you run your program with the option -Dline.separator=X at the command line you will get the funny behavior of System.out.println(…); ending the line with an X.
The tricky part is how to specify characters like \n or \r at the command line. But that’s system/environment specific and not a Java question anymore.
Yes, there is a way and I've just tried it.
There is a system property line.separator. You can set it using System.setProperty("line.separator", whatever)
To be sure that it indeed causes JVM to use other separator I implemented the following exercise:
PrintWriter writer = new PrintWriter(new FileWriter("c:/temp/mytest.txt"));
writer.println("hello");
writer.println("world");
writer.close();
I am running on windows now, so the result was 14 bytes long file:
03/27/2014 10:13 AM 14 mytest.txt
1 File(s) 14 bytes
0 Dir(s) 409,157,980,160 bytes free
However when I added the following line to the beginning of my code:
System.setProperty("line.separator", "\n");
I got 14 bytes long file:
03/27/2014 10:13 AM 14 mytest.txt
1 File(s) 14 bytes
0 Dir(s) 409,157,980,160 bytes free
I opened this file with notepad that does not recognize single \n as a new line and saw one-line text helloworld instead of 2 separate lines. So, this works.
Because the accepted answer simply does not work, as others pointed out before me, and the JDK only initialises the value once and then never reads the property anymore, only an internal static field, it became clear that the clean way to change the property is to set it on the command line when starting the JVM. So far, so good.
The reason I am writing yet another answer is that I want to present a reflective way to change the field, which really works with streams and writers relying on System.lineSeparator(). It does not hurt to update the system property, too, but the field is more important.
I know that reflection is ugly, as of Java 16+ needs an extra JVM command line parameter in order to allow it, and only works as long as the internals of System do not change in OpenJDK. But FWIW, here is my solution - don't do this at home, kids:
import java.io.*;
import java.lang.reflect.Field;
import java.nio.file.Files;
/**
* 'assert' requires VM parameter '-ea' (enable assert)
* 'Field.setAccessible' on System requires '--add-opens java.base/java.lang=ALL-UNNAMED' on Java 16+
*/
public class ChangeLineSeparator {
public static void main(String[] args) throws IOException, NoSuchFieldException, IllegalAccessException {
assert System.lineSeparator().equals("\r\n") : "default Windows line separator should be CRLF";
Field lineSeparator = System.class.getDeclaredField("lineSeparator");
lineSeparator.setAccessible(true);
lineSeparator.set(null, "\n");
assert System.lineSeparator().equals("\n") : "modified separator should be LF";
File tempFile = Files.createTempFile(null, null).toFile();
tempFile.deleteOnExit();
try (PrintWriter out = new PrintWriter(new FileWriter(tempFile))) {
out.println("foo");
out.println("bar");
}
assert tempFile.length() == "foo\nbar\n".length() : "unexpected file size";
}
}
You may try with:
String str = "\n\r";
System.out.print("yourString"+str);
but you can instead use this:-
System.getProperty("line.separator");
to get the line seperator
Returns the system-dependent line separator string. It always returns
the same value - the initial value of the system property
line.separator.
On UNIX systems, it returns "\n"; on Microsoft Windows systems it
returns "\r\n".
As stated in the Java SE tutorial:
To modify the existing set of system properties, use
System.setProperties. This method takes a Properties object that has
been initialized to contain the properties to be set. This method
replaces the entire set of system properties with the new set
represented by the Properties object.
Warning: Changing system
properties is potentially dangerous and should be done with
discretion. Many system properties are not reread after start-up and
are there for informational purposes. Changing some properties may
have unexpected side-effects.
In the case of System.out.println(), the line separator that existed on system startup will be used. This is probably because System.lineSeparator() is used to terminate the line. From the documentation:
Returns the system-dependent line separator string. It always returns
the same value - the initial value of the system property
line.separator.
On UNIX systems, it returns "\n"; on Microsoft Windows systems it
returns "\r\n".
As Holger pointed out, you need to overwrite this property at startup of the JVM.
Windows cmd: Credit jeb at https://superuser.com/a/1519790 for a technique to specify a line-feed character in a parameter using a cmd variable. This technique can be used to specify the java line.separator.
Here's a sample javalf.cmd file
#echo off
REM define variable %\n% to be the linefeed character
(set \n=^^^
^
)
REM Start java using the value of %\n% as the line.separator System property
java -Dline.separator=%\n% %1 %2 %3 %4 %5 %6 %7 %8 %9
Here's a short test progam.
public class ReadLineSeparator {
public static void main(String... ignore) {
System.out.println(System.lineSeparator().chars()
.mapToObj(c -> "{"+Character.getName(c)+"}")
.collect(java.util.stream.Collectors.joining()));
}
}
On Windows,
java ReadLineSeparator produces
{CARRIAGE RETURN (CR)}{LINE FEED (LF)}
.\javalf.cmd ReadLineSeparator produces
{LINE FEED (LF)}
The method System.lineSeparator() returns the line separator used by the system. From the documentation it specifies that it uses the system property line.separator.
Checked Java 11 implementation:
private static void initPhase1() {
lineSeparator = props.getProperty("line.separator");
}
public static String lineSeparator() {
return lineSeparator;
}
So altering system property at runtime doesn't change System.lineSeparator().
For this reason some projects re-read system property directly, see my answer: How to avoid CRLF (Carriage Return and Line Feed) in Logback - CWE 117
The only viable option is to set system property during app startup.
For Bash it is as simple as: java -Dline.separator=$'\n' -jar my.jar.
For POSIX shell it is better to save that character in some variable first:
LF='
'
java -Dline.separator="$LF" -jar my.jar
If you are not sure debug it:
printf %s "$LF" | wc -c
1
printf %s "$LF" | od -x
0000000 000a
For Gradle I use:
tasks.withType(Test).configureEach {
systemProperty 'line.separator', '\n'
}
bootRun {
systemProperty 'line.separator', '\n'
}

Why is this Java program gives incorrect results on Eclipse and correct results when run from terminal?

Consider the following program.
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
public class HelloWorld {
public static void main(String[] args) {
System.out.println(Charset.defaultCharset());
char[] array = new char[3];
array[0] = '\u0905';
array[1] = '\u0905';
array[2] = '\u0905';
CharBuffer charBuffer = CharBuffer.wrap(array);
Charset utf8 = Charset.forName("UTF-8");
ByteBuffer encoded = utf8.encode(charBuffer);
System.out.println(new String(encoded.array()));
}
}
When I execute this using terminal,
java HelloWorld
I get properly encoded, shaped text. Default encoding was MacRoman.
Now when I execute the same code from Eclipse, I see incorrect text getting printed to the console.
When I change the file encoding option of Eclipse to UTF-8, it prints correct results in Eclipse.
I am wondering why this happens? Ideally, file encoding options should not have affected this code because here I am using UTF-8 explicitly.
Any idea why this is happening?
I am using Java 1.6 (Sun JDK), Mac OSx 10.7.
You need to specify what encoding you want to use when creating the string:
new String(encoded.array(), charset)
otherwise it will use the default charset.
Make sure the console you use to display the output is also encoded in UTF-8. In Eclipse for example, you need to go to Run Configuration > Common to do this.
System.out.println("\u0905\u0905\u0905");
would be the straight-forward usage.
And encoding is missing for the String constructor, defaulting to the set default encoding.
new String(encoded.array(), "UTF-8")
This happens because Eclipse uses the default ANSI encoding, not UFT-8. If your using a different encoding than what your IDE is using, you will get unreadable results.
you need to change your console run configuration.
click on "Run"
click on "Run Configurations" and then click on "common" tab
change Encoding to UTF

Get list of processes on Windows in a charset-safe way

This post gives a solution to retrieve the list of running processes under Windows. In essence it does:
String cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
Process p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream());
BufferedReader input = new BufferedReader(isr);
then reads the input.
It looks and works great but I was wondering if there is a possibility that the charset used by tasklist might not be the default charset and that this call could fail?
For example this other question about a different executable shows that it could cause some issues.
If that is the case, is there a way to determine what the appropriate charset would be?
Can break this into 2 parts:
The windows part
From java you're executing a Windows command - externally to the jvm in "Windows land". When java Runtime class executes a windows command, it uses the DLL for consoles & so appears to windows as if the command is running in a console
Q: When I run C:\windows\system32\tasklist.exe in a console, what is the character encoding ("code page" in windows terminology) of the result?
windows "chcp" command with no argument gives the active code page number for the console (e.g. 850 for Multilingual-Latin-1, 1252 for Latin-1). See Windows Microsoft Code Pages, Windows OEM Code Pages, Windows ISO Code Pages
The default system code page is originally setup according to your system locale (type systeminfo to see this or Control Panel-> Region and Language).
the windows OS/.NET function getACP() also gives this info
The java part:
How do I decode a java byte stream from the windows code page of "x" (e.g. 850 or 1252)?
the full mapping between windows code page numbers and equivalent java charset names can be derived from here - Code Page Identifiers (Windows)
However, in practice one of the following prefixes can be added to achieve the mapping:
"" (none) for ISO, "IBM" or "x-IBM" for OEM, "windows-" OR "x-windows-" for Microsoft/Windows.
E.g. ISO-8859-1 or IBM850 or windows-1252
Full Solution:
String cmd = System.getenv("windir") + "\\system32\\" + "chcp.com";
Process p = Runtime.getRuntime().exec(cmd);
// Use default charset here - only want digits which are "core UTF8/UTF16";
// ignore text preceding ":"
String windowsCodePage = new Scanner(
new InputStreamReader(p.getInputStream())).skip(".*:").next();
Charset charset = null;
String[] charsetPrefixes =
new String[] {"","windows-","x-windows-","IBM","x-IBM"};
for (String charsetPrefix : charsetPrefixes) {
try {
charset = Charset.forName(charsetPrefix+windowsCodePage);
break;
} catch (Throwable t) {
}
}
// If no match found, use default charset
if (charset == null) charset = Charset.defaultCharset();
cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream(), charset);
BufferedReader input = new BufferedReader(isr);
// Debugging output
System.out.println("matched codepage "+windowsCodePage+" to charset name:"+
charset.name()+" displayName:"+charset.displayName());
String line;
while ((line = input.readLine()) != null) {
System.out.println(line);
}
Thanks for the Q! - was fun.
Actually, the charset used by tasklist is always different from the system default.
On the other hand, it's quite safe to use the default as long as the output is limited to ASCII. Usually executable modules have only ASCII characters in their names.
So to get the correct Strings, you have to convert (ANSI) Windows code page to OEM code page, and pass the latter as charset to InputStreamReader.
It seems there's no comprehensive mapping between the these encodings. The following mapping can be used:
Map<String, String> ansi2oem = new HashMap<String, String>();
ansi2oem.put("windows-1250", "IBM852");
ansi2oem.put("windows-1251", "IBM866");
ansi2oem.put("windows-1252", "IBM850");
ansi2oem.put("windows-1253", "IBM869");
Charset charset = Charset.defaultCharset();
String streamCharset = ansi2oem.get(charset.name());
if (streamCharset) {
streamCharset = charset.name();
}
InputStreamReader isr = new InputStreamReader(p.getInputStream(),
streamCharset);
This approach worked for me with windows-1251 and IBM866 pair.
To get the current OEM encoding used by Windows, you can use GetOEMCP function. The return value depends on Language for non-Unicode programs setting on Administrative tab in Region and Language control panel. Reboot is required to apply the change.
There are two kinds of encodings on Windows: ANSI and OEM.
The former is used by non-Unicode applications running in GUI mode.
The latter is used by Console applications. Console applications cannot display characters that cannot be represented in the current OEM encoding.
Since tasklist is console mode application, its output is always in the current OEM encoding.
For English systems, the pair is usually Windows-1252 and CP850.
As I am in Russia, my system has the following encodings: Windows-1251 and CP866.
If I capture output of tasklist into a file, the file can't display Cyrillic characters correctly:
I get ЏаЁўҐв instead of Привет (Hi!) when viewed in Notepad.
And µTorrent is displayed as зTorrent.
You cannot change the encoding used by tasklist.
However it's possible to change the output encoding of cmd. If you pass /u switch to it, it will output everything in UTF-16 encoding.
cmd /c echo Hi>echo.txt
The size of echo.txt is 4 bytes: two bytes for Hi and two bytes for new line (\r and \n).
cmd /u /c echo Hi>echo.txt
Now the size of echo.txt is 8 bytes: each character is represented with two bytes.
Why not use the Windows API via JNA, instead of spawning processes? Like this:
import com.sun.jna.platform.win32.Kernel32;
import com.sun.jna.platform.win32.Tlhelp32;
import com.sun.jna.platform.win32.WinDef;
import com.sun.jna.platform.win32.WinNT;
import com.sun.jna.win32.W32APIOptions;
import com.sun.jna.Native;
public class ListProcesses {
public static void main(String[] args) {
Kernel32 kernel32 = (Kernel32) Native.loadLibrary(Kernel32.class, W32APIOptions.UNICODE_OPTIONS);
Tlhelp32.PROCESSENTRY32.ByReference processEntry = new Tlhelp32.PROCESSENTRY32.ByReference();
WinNT.HANDLE snapshot = kernel32.CreateToolhelp32Snapshot(Tlhelp32.TH32CS_SNAPPROCESS, new WinDef.DWORD(0));
try {
while (kernel32.Process32Next(snapshot, processEntry)) {
System.out.println(processEntry.th32ProcessID + "\t" + Native.toString(processEntry.szExeFile));
}
}
finally {
kernel32.CloseHandle(snapshot);
}
}
}
I posted a similar answer elsewhere.
There is a much better way to check the running processes, or even to run OS command through java: Process and ProcessBuilder.
As for the Charset, you can always inquire the OS about the supported charsets, and obtain an Encoder or Decoder according to your needs.
[Edit]
Let's break it down; there's no way of knowing in which encoding the bytes of a given String are, so your only choice is to get those bytes, shift the ordering as necessary (if you're ever in such an environment where a process can give you an array of bytes in different ordering, use ByteBuffer to deal with that), and use the multiple CharsetDecoders supported to decode the bytes to reasonable output.
It is overkill and requires you to estimate that a given output could be in UTF-8, UTF-16 or any other encoding. But at least you can decode the given output using one of the possible Charsets, and then try to use the processed output for your needs.
Since we're talking about a process run by the same OS in which the JVM itself is running, it is quite possible that your output will be in one of the Charset encodings returned by the availableCharsets() method.

How to print "rājshāhi" to the Eclipse output console?

I have tried the following:
System.out.println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "UTF-8"), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "ISO-8859-1"), true).println("rājshāhi");
Which yields the following output:
r?jsh?hi
r?jsh?hi
rÄ?jshÄ?hi
r?jsh?hi
So, what am I doing wrong?
Thanks.
P.S.
I am using Eclipse Indigo on Windows 7. The output goes to the Eclipse output console.
The java file must be encoded correctly. Look in the properties for that file, and set the encoding correctly:
What you did should work, even the simple System.out.println if you have a recent version of eclipse.
Look at the following:
The version of eclipse you are using
Whether the file is encoded correctly. See #Matthew's answer. I assume this would be the case because otherwise eclipse wouldn't allow you to save the file (would warn "unsupported characters")
The font for the console (Windows -> Preferences -> Fonts -> Default Console Font)
When you save the text to a file whether you get the characters correctly
Actually, copying your code and running it on my computer gave me the following output:
rājshāhi
rājshāhi
rājshāhi
r?jsh?hi
It looks like all lines work except the last one. Get your System default character set (see this answer). Mine is UTF-8. See if changing your default character set makes a difference.
Either of the following lines will get your default character set:
System.out.println(System.getProperty("file.encoding"));
System.out.println(Charset.defaultCharset());
To change the default encoding, see this answer.
Make sure when you are creating your class Assign the Text file Encoding Value UTF-8.
Once a class is created with any other Text File Encoding later on you can't change the Encoding syle even though eclipse will allow you it won't reflect.
So create a new class with TextFile Encoding UTF 8.It will work definitely.
EDIT: In your case though you are trying to assing Text File encoding programatically it is not making any impact it is taking the container inherited encoding (Cp1252)
Using Latest Eclipse version helped me to achive UTF-8 encoding on console
I used Luna Version of Eclipse and set Properties->Info->Others->UTF-8

Reading UTF-8 .properties files in Java 1.5?

I have a project where everything is in UTF-8. I was using the Properties.load(Reader) method to read properties files in this encoding. But now, I need to make the project compatible with Java 1.5, and the mentioned method doesn't exist in Java 1.5. There is only a load method that takes an InputStream as a parameter, which is assumed to be in ISO-8859-1.
Is there any simple way to make my project 1.5-compatible without having to change all the .properties files to ISO-8859-1? I don't really want to have a mix of encodings in my project (encodings are already a time sink one at a time, let alone when you mix them) or change all my project to ISO-8859-1.
With "a simple way" I mean "without creating a custom Properties class from scratch".
Could you use xml-properties instead? As I understand by the spec .properties files should be in ISO-8859-1, if you want other characters, they should be quoted, using the native2ascii tool.
One strategy that might work for this situation is as follows:
Read the bytes of the Reader into a ByteArrayOutputStream.
Once that is completed, call toByteArray() See below.
With the byte[] construct a ByteArrayInputStream
Use the ByteArrayInputStream in Properties.load(InputStream)
As pointed out, the above failed to actually convert the character set from UTF-8 to ISO-8859-1. To fix that, a tweak.
After the BAOS has been filled, instead of calling toByteArray()..
Call toString("ISO-8859-1") to get an ISO-8859-1 encoded String. Then look to..
Call String.getBytes() to get the byte[]
What you can do is open a thread that would read data using a BufferedReader then write out the data to a PipedOutputStream which is then linked by a PipedInputStream that load uses.
PipedOutputStream pos = new PipedOutputStream();
PipedInputStream pis = new PipedInputStream(pos);
ReaderRunnable reader = new ReaderRunnable(pos, new File("utfproperty.properties"));
Thread t = new Thread(reader);
t.start();
properties.load(pis);
t.join();
The BufferedReader will read the data one character at a time and if it detects it to be a character data not to be within the US-ASCII (i.e. low 7-bit) range then it writes "\u" + the character code into the PipedOutputStream.
ReaderRunnable would be a class that looks like:
public class ReaderRunnable implements Runnable {
public ReaderRunnable(OutputStream os, File f) {
this.os = os;
this.f = f;
}
private final OutputStream os;
private final File f;
public void run() {
// open file
// read file, escape any non US-ASCII characters
}
}
Now after writing all that I was thinking that someone should've had this problem before and solved it, and the best place to look for these things is in Apache Commons. Fortunately, they have an implementation there.
https://commons.apache.org/io/apidocs/org/apache/commons/io/input/ReaderInputStream.html
The implementation from Apache is not without flaws though. Your input file even if it is UTF-8 must only contain the characters from the ISO-8859-1 character set. The design I had provided above can handle that situation.
Depending on your build engine you can \uXXXX-escape the properties into the build target directory. Maven can filter them via the native2ascii-maven-plugin.
What I personally do in my projects is I keep my properties in UTF-8 files with an extension .uproperties and I convert them to ISO at the build time to .properties files using native2ascii.exe. This allows me to maintain my properties in UTF-8 and the Ant script does everything else for me.
What I just now experienced is, Make all .java files also UTF-8 encoding type (not only properties file where you store UTF-8 characters). This way there no need to use for InputStreamReader also. Also, make sure to compile to UTF-8 encoding.
This has worked for me without any added parameter of UTF-8.
To test this, write a simple stub program in eclipse and change the format of that java file by going to properties of that file and Resource section, to set the UTF-8 encoding format.

Categories