JDK 1.5 Properties load with unicode characters - java

JDK 1.5 Properties load method only gets InputStream while JDK 1.6+ load method also gets also Reader. When a string with Unicode characters is loaded to a properties object on JDK 1.6+ with load(reader), there is no problem. But on JDK 1.5 there is only load(InputStream) method; when loaded to properties unicode characters are not correctly loaded.
Properties props = new Properties();
ByteArrayInputStream bis = null;
Reader reader = null;
try {
bis = new ByteArrayInputStream(someStringWithUnicodeChars.getBytes("UTF-8"));
reader = new InputStreamReader(bis, "UTF-8");
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
props.load(reader); // This reads unicode characters correctly on JDK 1.6+
// There is no props.load(reader) method on JDK 1.5, so below method is used
props.load(bis);
// but Unicode characters are not loaded correctly.
How can I load following example string with unicode characters to properties object.
key1=test İ Ş Ğ
key2=ÇÇÇÇ

From the 1.5 javadoc "The stream is assumed to be using the ISO 8859-1 character encoding"
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Properties.html#load(java.io.InputStream)
Try this:
InputStream in = new ByteArrayInputStream(someStringWithUnicodeChars.getBytes("ISO-8859-1"));
Properties props = new Properties();
props.load(in);

therefor in the JDK exists the tool native2ascii[.exe].
1) create the properties file as UTF-8, name it for example: sample.native
2) convert the native properties file to Unicode escape sequences: native2ascii prop.native > prop.properties
3) load and process the properties file
// example: you will see the right UTF-8 characters only if your console suppert UTF-8
class PropsFile {
public static void main(String[] args) throws Exception {
try (FileInputStream fis = new FileInputStream("sample.properties")) {
Properties props = new Properties();
props.load(fis);
for (String name : props.stringPropertyNames()) {
System.out.println(name + "=" + props.getProperty(name));
}
}
}
}

Related

Android Resources.openRawResource() encoding issue [duplicate]

I am reading a property file which consists of a message in the UTF-8 character set.
Problem
The output is not in the appropriate format. I am using an InputStream.
The property file looks like
username=LBSUSER
password=Lbs#123
url=http://localhost:1010/soapfe/services/MessagingWS
timeout=20000
message=Spanish character are = {á é í, ó,ú ,ü, ñ, ç, å, Á, É, Í, Ó, Ú, Ü, Ñ, Ç, ¿, °, 4° año = cuarto año, €, ¢, £, ¥}
And I am reading the file like this,
Properties props = new Properties();
props.load(new FileInputStream("uinsoaptest.properties"));
String username = props.getProperty("username", "test");
String password = props.getProperty("password", "12345");
String url = props.getProperty("url", "12345");
int timeout = Integer.parseInt(props.getProperty("timeout", "8000"));
String messagetext = props.getProperty("message");
System.out.println("This is soap msg : " + messagetext);
The output of the above message is
You can see the message in the console after the line
{************************ SOAP MESSAGE TEST***********************}
I will be obliged if I can get any help reading this file properly. I can read this file with another approach but I am looking for less code modification.
Use an InputStreamReader with Properties.load(Reader reader):
FileInputStream input = new FileInputStream(new File("uinsoaptest.properties"));
props.load(new InputStreamReader(input, Charset.forName("UTF-8")));
As a method, this may resemble the following:
private Properties read( final Path file ) throws IOException {
final var properties = new Properties();
try( final var in = new InputStreamReader(
new FileInputStream( file.toFile() ), StandardCharsets.UTF_8 ) ) {
properties.load( in );
}
return properties;
}
Don't forget to close your streams. Java 7 introduced StandardCharsets.UTF_8.
Use props.load(new FileReader("uinsoaptest.properties")) instead. By default it uses the encoding Charset.forName(System.getProperty("file.encoding")) which can be set to UTF-8 with System.setProperty("file.encoding", "UTF-8") or with the commandline parameter -Dfile.encoding=UTF-8.
If somebody use #Value annotation, could try StringUils.
#Value("${title}")
private String pageTitle;
public String getPageTitle() {
return StringUtils.toEncodedString(pageTitle.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("UTF-8"));
}
You should specify the UTF-8 encoding when you construct your FileInputStream object. You can use this constructor:
new FileInputStream("uinsoaptest.properties", "UTF-8");
If you want to make a change to your JVM so as to be able to read UTF-8 files by default, you will have to change the JAVA_TOOL_OPTIONS in your JVM options to something like this :
-Dfile.encoding=UTF-8
If anybody comes across this problem in Kotlin, like me:
The accepted solution of #Würgspaß works here as well. The corresponding Kotlin syntax:
Instead of the usual
val properties = Properties()
filePath.toFile().inputStream().use { stream -> properties.load(stream) }
I had to use
val properties = Properties()
InputStreamReader(FileInputStream(filePath.toFile()), StandardCharsets.UTF_8).use { stream -> properties.load(stream) }
With this, special UTF-8 characters are loaded correctly from the properties file given in filePath.

Read and Write file in java whilst keeping the special characters

After reading and writing the file, the bullet points get replaced with symbolic unreadable text "�". Here is the code:
String str = FileUtils.readFileToString(new File(sourcePath), "UTF-8");
nextTextFile.append(redactStrings(str, redactedStrings));
FileUtils.writeStringToFile(new File(targetPath), nextTextFile.toString());
Link to sample file
generated file with funny characters
I checked it out on Windows and if the source file is encoded in UTF-8, the following code will produce the desired output to the console and to a file, which is then encoded in UTF-8 as well, making use of java.nio:
public static void main(String[] args) {
Path inPath = Paths.get(sourcePath);
Path outPath = Paths.get(targetPath);
try {
List<String> lines = Files.readAllLines(inPath, StandardCharsets.UTF_8);
lines.forEach(line -> {
System.out.println(line);
});
Files.write(outPath, lines, StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
}
Please note that the source file has to be encoded in UTF-8, otherwise this may throw an IOException stating something like Input length = 1. Play around with the StandardCharsets if it does not meet your requirements or make sure the encoding of the source file is UTF-8.

BeanIO - UnidentifiedRecordException when parsing UTF8 file

I have a problem when parsing a file that is encoded with UTF8.
I have two files which are completely identical, except for their encoding. (I simply just copied the file and saved it with UTF8, so the contents are identical). One is encoded using ANSI, the other with UTF8. The file which is encoded with ANSI is succesfully parsed while the other file causes BeanIO to throw an UnidentifiedRecordException when calling the BeanReader.read() method:
org.beanio.UnidentifiedRecordException: Unidentified record at line 1
I have tried to solve this by explicitly setting the encoding to UTF8 using this code:
public static BeanReader getBeanReader(File file, StreamBuilder builder) {
StreamFactory factory = StreamFactory.newInstance();
factory.define(builder);
InputStream iStream;
try {
iStream = new FileInputStream(file);
} catch (FileNotFoundException e) {
throw new CustomException("Could not create BeanReader, file not found", e);
}
Reader reader = new InputStreamReader(iStream, StandardCharsets.UTF_8);
return factory.createReader("reader", reader);
}
which doesn't solve the issue.
What could be the reason for this error?
As the first line is claimed erroneous, did you save the UTF-8 without BOM (that infamous zero-width space at file start)?

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

How to save Chinese Characters to file with java?

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.
StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;
FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();
What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?
There are several factors at work here:
Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);
Here is a method of reliably appending UTF-8 data to a file:
private static void writeUtf8ToFile(File file, boolean append, String data)
throws IOException {
boolean skipBOM = append && file.isFile() && (file.length() > 0);
Closer res = new Closer();
try {
OutputStream out = res.using(new FileOutputStream(file, append));
Writer writer = res.using(new OutputStreamWriter(out, Charset
.forName("UTF-8")));
if (!skipBOM) {
writer.write('\uFEFF');
}
writer.write(data);
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
String chinese = "\u4E0A\u6D77";
boolean append = true;
writeUtf8ToFile(new File("chinese.txt"), append, chinese);
}
Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.
Here is the Closer type used in this code:
public class Closer implements Closeable {
private Closeable closeable;
public <T extends Closeable> T using(T t) {
closeable = t;
return t;
}
#Override public void close() throws IOException {
if (closeable != null) {
closeable.close();
}
}
}
This code makes a Windows-style best guess about how to read the file based on byte order marks:
private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };
private static Charset getEncoding(InputStream in) throws IOException {
charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
byte[] bom = "\uFEFF".getBytes(encodings);
in.mark(bom.length);
for (byte b : bom) {
if ((0xFF & b) != in.read()) {
in.reset();
continue charsetLoop;
}
}
return encodings;
}
return Charset.defaultCharset();
}
private static String readText(File file) throws IOException {
Closer res = new Closer();
try {
InputStream in = res.using(new FileInputStream(file));
InputStream bin = res.using(new BufferedInputStream(in));
Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
StringBuilder out = new StringBuilder();
for (int ch = reader.read(); ch != -1; ch = reader.read())
out.append((char) ch);
return out.toString();
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
System.out.println(readText(new File("chinese.txt")));
}
(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)
If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:
Writer w = new FileWriter("test.txt");
w.append("上海");
w.close();
The safest way is to always explicitly specify the encoding:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
w.append("上海");
w.close();
P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.
Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).
The following bug exists to describe the issue in Java:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:
http://mindprod.com/jgloss/bom.html
and for a more correct solution see the following link:
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html
Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:
String FileName = "output.txt";
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();
I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.
Try this,
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileName,Append), "UTF8"));
for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
out.close();

Categories