Setting the default Java character encoding - java

How do I properly set the default character encoding used by the JVM (1.5.x) programmatically?
I have read that -Dfile.encoding=whatever used to be the way to go for older JVMs. I don't have that luxury for reasons I wont get into.
I have tried:
System.setProperty("file.encoding", "UTF-8");
And the property gets set, but it doesn't seem to cause the final getBytes call below to use UTF8:
System.setProperty("file.encoding", "UTF-8");
byte inbytes[] = new byte[1024];
FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
FileOutputStream fos = new FileOutputStream("response-2.txt");
String in = new String(inbytes, "UTF8");
fos.write(in.getBytes());

Unfortunately, the file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding used by String.getBytes() and the default constructors of InputStreamReader and OutputStreamWriter has been permanently cached.
As Edward Grech points out, in a special case like this, the environment variable JAVA_TOOL_OPTIONS can be used to specify this property, but it's normally done like this:
java -Dfile.encoding=UTF-8 … com.x.Main
Charset.defaultCharset() will reflect changes to the file.encoding property, but most of the code in the core Java libraries that need to determine the default character encoding do not use this mechanism.
When you are encoding or decoding, you can query the file.encoding property or Charset.defaultCharset() to find the current default encoding, and use the appropriate method or constructor overload to specify it.

From the JVM™ Tool Interface documentation…
Since the command-line cannot always be accessed or modified, for example in embedded VMs or simply VMs launched deep within scripts, a JAVA_TOOL_OPTIONS variable is provided so that agents may be launched in these cases.
By setting the (Windows) environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8, the (Java) System property will be set automatically every time a JVM is started. You will know that the parameter has been picked up because the following message will be posted to System.err:
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8

I have a hacky way that definitely works!!
System.setProperty("file.encoding","UTF-8");
Field charset = Charset.class.getDeclaredField("defaultCharset");
charset.setAccessible(true);
charset.set(null,null);
This way you are going to trick JVM which would think that charset is not set and make it to set it again to UTF-8, on runtime!

I think a better approach than setting the platform's default character set, especially as you seem to have restrictions on affecting the application deployment, let alone the platform, is to call the much safer String.getBytes("charsetName"). That way your application is not dependent on things beyond its control.
I personally feel that String.getBytes() should be deprecated, as it has caused serious problems in a number of cases I have seen, where the developer did not account for the default charset possibly changing.

I can't answer your original question but I would like to offer you some advice -- don't depend on the JVM's default encoding. It's always best to explicitly specify the desired encoding (i.e. "UTF-8") in your code. That way, you know it will work even across different systems and JVM configurations.

Try this :
new OutputStreamWriter( new FileOutputStream("Your_file_fullpath" ),Charset.forName("UTF8"))

I have tried a lot of things, but the sample code here works perfect.
Link
The crux of the code is:
String s = "एक गाव में एक किसान";
String out = new String(s.getBytes("UTF-8"), "ISO-8859-1");

We were having the same issues. We methodically tried several suggestions from this article (and others) to no avail. We also tried adding the -Dfile.encoding=UTF8 and nothing seemed to be working.
For people that are having this issue, the following article finally helped us track down describes how the locale setting can break unicode/UTF-8 in Java/Tomcat
http://www.jvmhost.com/articles/locale-breaks-unicode-utf-8-java-tomcat
Setting the locale correctly in the ~/.bashrc file worked for us.

In case you are using Spring Boot and want to pass the argument file.encoding in JVM you have to run it like that:
mvn spring-boot:run -Drun.jvmArguments="-Dfile.encoding=UTF-8"
this was needed for us since we were using JTwig templates and the operating system had ANSI_X3.4-1968 that we found out through System.out.println(System.getProperty("file.encoding"));
Hope this helps someone!

My team encountered the same issue in machines with Windows.. then managed to resolve it in two ways:
a) Set enviroment variable (even in Windows system preferences)
JAVA_TOOL_OPTIONS
-Dfile.encoding=UTF8
b) Introduce following snippet to your pom.xml:
-Dfile.encoding=UTF-8
WITHIN
<jvmArguments>
-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8001
-Dfile.encoding=UTF-8
</jvmArguments>

I'm using Amazon (AWS) Elastic Beanstalk and successfully changed it to UTF-8.
In Elastic Beanstalk, go to Configuration > Software, "Environment properties".
Add (name) JAVA_TOOL_OPTIONS with (value) -Dfile.encoding=UTF8
After saving, the environment will restart with the UTF-8 encoding.

Not clear on what you do and don't have control over at this point. If you can interpose a different OutputStream class on the destination file, you could use a subtype of OutputStream which converts Strings to bytes under a charset you define, say UTF-8 by default. If modified UTF-8 is suffcient for your needs, you can use DataOutputStream.writeUTF(String):
byte inbytes[] = new byte[1024];
FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
String in = new String(inbytes, "UTF8");
DataOutputStream out = new DataOutputStream(new FileOutputStream("response-2.txt"));
out.writeUTF(in); // no getBytes() here
If this approach is not feasible, it may help if you clarify here exactly what you can and can't control in terms of data flow and execution environment (though I know that's sometimes easier said than determined). Good luck.

mvn clean install -Dfile.encoding=UTF-8 -Dmaven.repo.local=/path-to-m2
command worked with exec-maven-plugin to resolve following error while configuring a jenkins task.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Error occurred during initialization of VM
java.nio.charset.IllegalCharsetNameException: "UTF-8"
at java.nio.charset.Charset.checkName(Charset.java:315)
at java.nio.charset.Charset.lookup2(Charset.java:484)
at java.nio.charset.Charset.lookup(Charset.java:464)
at java.nio.charset.Charset.defaultCharset(Charset.java:609)
at sun.nio.cs.StreamEncoder.forOutputStreamWriter(StreamEncoder.java:56)
at java.io.OutputStreamWriter.<init>(OutputStreamWriter.java:111)
at java.io.PrintStream.<init>(PrintStream.java:104)
at java.io.PrintStream.<init>(PrintStream.java:151)
at java.lang.System.newPrintStream(System.java:1148)
at java.lang.System.initializeSystemClass(System.java:1192)

Solve this problem in my project. Hope it helps someone.
I use LIBGDX java framework and also had this issue in my android studio project.
In Mac OS encoding is correct, but in Windows 10 special characters and symbols and
also russian characters show as questions like: ????? and other incorrect symbols.
Change in android studio project settings:
File->Settings...->Editor-> File Encodings to UTF-8 in all three fields (Global Encoding, Project Encoding and Default below).
In any java file set:
System.setProperty("file.encoding","UTF-8");
And for test print debug log:
System.out.println("My project encoding is : "+ Charset.defaultCharset());

Setting up jvm arguments while starting application helped me resolve this issue. java -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8.
file.encoding=UTF-8 - This helps to have the Unicode characters in the file.
sun.jnu.encoding=UTF-8 - This helps to have the Unicode characters as the File name in the file system.

We set there two system properties together and it makes the system take everything into utf8
file.encoding=UTF8
client.encoding.override=UTF-8

Following #Caspar comment on accepted answer, the preferred way to fix this according to Sun is :
"change the locale of the underlying platform before starting your Java program."
http://bugs.java.com/view_bug.do?bug_id=4163515
For docker see:
http://jaredmarkell.com/docker-and-locales/

Recently I bumped into a local company's Notes 6.5 system and found out the webmail would show unidentifiable characters on a non-Zhongwen localed Windows installation. Have dug for several weeks online, figured it out just few minutes ago:
In Java properties, add the following string to Runtime Parameters
-Dfile.encoding=MS950 -Duser.language=zh -Duser.country=TW -Dsun.jnu.encoding=MS950
UTF-8 setting would not work in this case.

Related

How to configure EC2 to process emoji in my spring boot app

I'm currently trying to run a Telegram bot on EC2 instance.
But the problem is - all non-english symbol are replaced.
On the screenshot you can see how emojis are replaces (e.g. before 'Settings' word)
Or how russian word is totally messed.
What I have tried so far:
Run java with arguments:
java -Dfile.encoding=UTF-8 -Duser.language=en -Duser.country=US -jar
Set locale in application.properties
spring.mandatory-file-encoding=UTF-8
spring.http.encoding.charset=UTF-8
spring.http.encoding.enabled=true
Set /etc/environment locale
JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk"
LANG=en_US.utf-8
LC_ALL=en_US.utf-8
Please note: My messages / button text values are stored in appropriate locale bundles in app resources. And considering that bot still works (he recognizes the value of message he receives even if its messed) I assume that it has something to do with java app.
P.S. When I run it locally - it works perfectly.
Any help will be highly appreciated!
After tons of answers and attempts, here's my solution:
As my text values are stored in .properties bundles, Standard Java API is designed to use ISO 8859-1 encoding for properties files.
https://www.jetbrains.com/help/idea/properties-files.html#
So before doing 'mvn clean package', I manually decoded my properties files with
native2ascii -encoding UTF-8
src/main/resources/messages_en.properties
src/main/resources/messages_en.properties
Hope it helps.
If anyone has better solution - please let me know.

What type of line delimiters does wsimport use?

I am working on a client application using Eclipse and we are having all kinds of problems committing, merging, comparing, etc. with CVS (I know its CVS but this is bad). I'm thinking that since I am on Windows we are running into issues with line delimiters. Our CVS server runs on Windows (again I know, but comments are welcome to show our current dev environment is broken and I'm not just blowing smoke here).
Currently all of our projects are using Cp1252 with Windows style delimiters. I would like to change the default text file encoding to UTF-8 and UNIX style delimiters, and I was wonderng if someone who has gone through this transition could comment.
Also, since we use WSImport to create our client web services I am trying to figure out what type of line delimiters it uses. Does any one know?
Thanks,
JD
Since for write the Java source codes, wsimport use an instance of java.io.PrintWriter, and each line is write it with the method java.io.PrintWriter.println, this depends of the enviroment.
import com.sun.tools.internal.ws.WsImport;
public class Main {
static {
System.getProperties().put("line.separator", "\n"); \\ or "\r\n"
}
public static void main(String[] args) throws Throwable {
WsImport.main("service.wsdl -Xnocompile".split("\\s+"));
}
}
For the charset, this:
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.
You can specified this with the next arg in the virtual machine:
-Dfile.encoding=UTF-8

Who resets JVM file.encoding back to original?

System.setProperty("file.encoding", "utf-8");
The comment below implies that file.encoding would be changed for all apps running on the same JVM, however, I don't observe this kind of behaviour.
Setting a system property programmatically will affect all code running within the same JVM, which is hazardous, especially when discussing such a low-level system property.
I have read this question and understand that there are many issues with caching and Java 1.5
Setting the default Java character encoding?
Please, now consider the following code:
public class FileEncodingTest {
public static void main (String[] args) {
System.out.println(System.getProperty("file.encoding"));
System.setProperty("file.encoding", "UTF-8");
System.out.println(System.getProperty("file.encoding"));
}
}
Then I create a jar-file using Eclipse and Java 1.6 set in project configuration.
Then I run jar-file with Java 1.7 and all this happens under Windows 7.
java -jar FileEncodingTest.jar
Cp1251
UTF-8
java -jar FileEncodingTest.jar
Cp1251
UTF-8
So who and why resets the value of file.encoding back to Cp1251?
UPD:
Anyone can explain or provide a link which explains step-by-step what happens in terms of JVM, processes when I type java -jar MyClass.jar?
you started 2 vm's. one with each "java -jar" command.
you can change the encoding your projects uses by editing the project properties in eclipse.
but note that when you hardcode stuff that relies on the fileformat and another project uses your implementation there will be problems. thats what the comment means.
Just like you open an IE browser, it goes to homepage at first. If you visit another website, then you open another IE, it will still be the homepage.
JVMs are quite similar. 2 different processes of java program use different JVMs. It means when the program ends, the file-encoding property will be default again.

How to set system properties through a file with Oracle's JVM

According to Oracle, the only way to set system properties is through command line -D parameters like that :
java -Dmy.prop=value com.package.MyClass
Is it really the only way ? Isn't it possible to create some system.properties file that will contain all these properties, and that would be automagically read when the JVM starts ?
I precise I can have no use of the System.setProperty(String,String) function.[1]
Setting this file through a command line parameter would be fine as well :
java -Fsystem.properties com.package.MyClass
I have searched where I know (and found there is a way with IBM's JVM), but I'm still empty-handed...
[1] : The goal is to set the default Charset, and this is primarily done through the file.encoding property, but only at the VM startup phase. Setting this property in runtime doesn't change the default Charset, and there is also no way to change it 'programmatically'.

International JRE6 or JDK6 or reading a file in "cp037" encoding scheme

I have been trying to read a file in "cp037" encoding scheme using JAVA. I able to read a file in basic encoding schemes like UTF-8, UTF16 etc...After a bit of research on the internet i came to know that we need charset.jar or international version of JRE be installed to support extended encoding schemes. Can anyone send me a link for international version of JRE6 or JDK6. or is there any better way that i could read a file in cp037 encoding scheme.
P.S: cp037 is a character encoding scheme supported by IBM Mainframes. All i need is to display a file in windows, which is being generated on IBM Mainframes machine, using a java program.
Thanks in advance for your help... :-)
I found this webpage for Java 5 (note, it might be different for Java 6). There is no special, separate "international" version of the JRE or JDK; however, the lib\charsets.jar may or may not be installed on your system depending on what your operating system supports.
Are you sure there is no charsets.jar in your JRE installation directory? On my system, it's under %JDK_HOME%\jre\lib. (Note: NOT under %JDK_HOME%\lib).
Search your system to see if you already have charsets.jar somewhere. (Note that it's called charsets.jar with an s, not charset.jar).
After a bit of research on the internet i came to know that we need charset.jar or international version of JRE be installed to support extended encoding schemes.
Are you sure that this charset isn't included in the standard distribution though?
This code works fine for me on jdk 1.6.0_17 64bit (Windows):
Charset charset = Charset.forName("cp037");
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(f), charset));
String line = null;
while ((line = br.readLine()) != null) {
System.out.println("read line: " + line);
}
As Reddy says sometimes the charsets.jar is there, however, the issue still exists.
The only solution I found was to add the jar to the jre system libraries using the buildpath funcionality of eclipse.
The change was done to the default windows java version.
A different option would be to instal jdk instead of jre.
Some info links about the java encode schemes:
http://www.oracle.com/technetwork/java/javase/readme-142177.html
http://download.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html
Here is a slightly different code with Java 1.6 on WinXP getting text from Mainframe over the web upload:
String text = new String(data, 0, data.length, "Cp037");
text = text.replace('', 'a');
text = text.replace('Ý', '[');
text = text.replace('¨', ']');
Note, that there are several characters, which need special attention.

Categories