to read unicode character in java - java

i am trying to read Unicode characters from a text file saved in utf-8 using java
my text file is as follows
अ, अदेबानि ,अन, अनसुला, अनसुलि, अनफावरि, अनजालु, अनद्ला, अमा, अर, अरगा, अरगे, अरन, अराय, अलखद, असे, अहा, अहिंसा, अग्रं, अन्थाइ, अफ्रि,
बियन, खियन, फियन, बन, गन, थन, हर, हम, जम, गल, गथ, दरसे, दरनै, थनै, थथाम, सथाम,
खफ, गल, गथ, मिख, जथ, जाथ, थाथ, दद, देख, न, नेथ, बर, बुंथ, बिथ, बिख, बेल, मम,
आ, आइ, आउ, आगदा, आगसिर
i have tried with the code as followed
import java.io.*;
import java.util.*;
import java.lang.*;
public class UcharRead
{
public static void main(String args[])
{
try
{
String str;
BufferedReader bufReader = new BufferedReader( new InputStreamReader(new FileInputStream("research_words.txt"), "UTF-8"));
while((str=bufReader.readLine())!=null)
{
System.out.println(str);
}
}
catch(Exception e)
{
}
}
}
getting out put as
????????????????????????
can anyone help me

You are (most likely) reading the text correctly, but when you write it out, you also need to enable UTF-8. Otherwise every character that cannot be printed in your default encoding will be turned into question marks.
Try writing it to a File instead of System.out (and specify the proper encoding):
Writer w = new OutputStreamWriter(
new FileOutputStream("x.txt"), "UTF-8");

If you are reading the text properly using UTF-8 encoding then make sure that your console also supports UTF-8. In case you are using eclipse then you can enable UTF-8 encoding foryour console by:
Run Configuration->Common -> Encoding -> Select UTF 8
Here is the eclipse screenshot.

You're reading it correctly - the problem is almost certainly just that your console can't handle the text. The simplest way to verify this is to print out each char within the string. For example:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
System.out.printf("%c - %04x\n", c, (int) c);
}
}
You can then verify that each character is correct using the Unicode code charts.
Once you've verified that you're reading the file correctly, you can then work on the output side of things - but it's important to try to focus on one side of it at a time. Trying to diagnose potential failures in both input and output encodings at the same time is very hard.

Related

“UTF-8” encoding is not working in java build [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I saved my Java source file specifying it's encoding type as UTF-8 in my eclipse. It is working fine in eclipse.
When I create a build with maven & execute it in my system Unicode characters are not working.
This is my code :
byte[] bytes = new byte[dataLength];
buffer.readBytes(bytes);
String s = new String(bytes, Charset.forName("UTF-8"));
System.out.println(s);
Eclipse console & windows console screenshot attached.
Expecting eclipse output in other systems(windows command prompt, powershell window, Linux machine, etc.,).
You could use the Console class for that.The following code could give you some inspiration:
public class Foo {
public static void main(String[] args) throws IOException {
String s = "öäü";
write(s);
}
private static void write(String s) throws IOException {
String encoding = new OutputStreamWriter(System.out).getEncoding();
Console console = System.console();
if (console != null) {
// if there is a console attached to the jvm, use it.
System.out.println("Using encoding " + encoding + " (Console)");
try (PrintWriter writer = console.writer()) {
writer.write(s);
writer.flush();
}
} else {
// fall back to "normal" system out
System.out.println("Using encoding " + encoding + " (System out)");
System.out.print(s);
}
}
}
Tested on Windows 10(poowershell), Ubuntu 16.04(bash) with default settings. Also works from within IntelliJ (Windows and Linux).
From what I can tell, you either have the wrong character, which I don't think is the case, or you are trying to display it on a terminal that doesn't handle the character. I have written a short test to separate the issues.
public static void main(String[] args){
String testA = "ֆޘᜅᾮ";
String testB = "\u0586\u0798\u1705\u1FAE";
System.out.println(testA.equals(testB));
System.out.println(testA);
System.out.println(testB);
try(BufferedWriter check = Files.newBufferedWriter(
Paths.get("uni-test.txt"),
StandardCharsets.UTF_8,
StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING) ){
check.write(testA);
check.write("\n");
check.write(testB);
check.close();
} catch(IOException ioc){
}
}
You could replace the values with the characters you want.
The first line should print out true if the string is the actual string you want. After that it is a matter of displaying the characters. For example if I open the text file with less then half of them are broken. If I open it with firefox, then I see all four characters, but some are wonky. You'll need a font that has characters for the corresponding unicode value.
One thing you can do is open the file in a word processor and select a font that displays the characters you want correctly.
As suggested by the OP, including the -Dfile.encoding=UTF8causes the characters to display correctly when using System.out.println. Similar to this question which changes the encoding of System.out.

How to get encoding type of a .txt or .sql file

Is there a possibility to get the encoding of a existing .txt file? for example: you know a customer needs a specific encoding and you want to automize the process of .sql-data delivery. then you read out the endcoding from a client config and compare it to the current encoding of the file to be delivered. if they differ you change the encoding. could not find a solution till now. any help would be appreciated.
There is no explicit declaration of text encoding in files, but you can guess the encoding by analyzing specific byte sequences that are characteristic of a certain encoding.
Chardet does exactly that and tries to guess. If it can't say for sure what the encoding is, it will give you a list with confidence values (e.g. "90% this is utf8"). The project includes both a Python module and a command line tool. For a Java version, see JChardet.
My 2cents: if you just need a quick way to detect, the command line chardet tool is the way to go.
juniversalchardet is one of the best available API for detecting the encoding type. Please checkout this link. You can go through the list of encoding types supported by it
Working Example from the site
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Hope this helps!

How to save an HTML page with special chars (UTF-8) to a txt file

I need to make a java code that save an html to a txt file.
The problem is that the special chars in UTF-8 are broken.
Words like "Hamamélis" are saved in this way "Hamam�lis".
the code that i writed is listed down there:
URLConnection conn;
conn = site.openConnection();
conn.setReadTimeout(10000);
Charset charset = Charset.forName("UTF8");
BufferedReader in = new BufferedReader( new InputStreamReader( conn.getInputStream(), "UTF-8" ) );
buff = in.readLine();
And after:
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(Nome), "UTF-8"));
out.write(buff);
out.close();
Anyone can suggest me a solution?
One possible error is omitting the hyphen from "UTF-8" in the 4th line of your first piece of code. See the CharSet documentation.
Otherwise, code seems correct. But of course we cannot test it directly as we do not have your data.
For comparison, here is a little class I wrote. In a manner similar to your code, this class correctly writes your "Hamamélis" example's accented 'e' as the two octets expected in UTF-8 for a single (non-normalized) character: in hex 'C3' & 'A9'.
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.io.IOException;
public class ReaderWriter {
public static void main(String[] args) {
try {
String content = "Hamamélis. Written: " + new java.util.Date();
File file = new File("some_text.txt");
// Create file if not already existent.
if (!file.exists()) {
file.createNewFile();
}
FileOutputStream fileOutputStream = new FileOutputStream( file );
OutputStreamWriter outputStreamWriter = new OutputStreamWriter( fileOutputStream, "UTF-8" );
BufferedWriter bufferedWriter = new BufferedWriter( outputStreamWriter );
bufferedWriter.write( content );
bufferedWriter.close();
System.out.println("ReaderWriter 'main' method is done. " + new java.util.Date() );
} catch (IOException e) {
e.printStackTrace();
}
}
}
As icktoofay commented, you should dig deeper to discover exactly what octets are involved. Use a hex editor like this "File Viewer" app I found today on the Mac App Store to see the exact octets in your saved file.
If the octets are C3 & A9, then the problem is simply that the text editor you used to look at the file as text used the wrong character encoding. For example, you can open that text file in a web browser, and use its menu commands to re-interpret the file as UTF-8.
If the octets are not C3 & A9, I would go further back to examine the input's octets.
If you do not understand that text files in computers actually contain numbers (not text in the human sense), then take a break from coding to read this entertaining article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Why is Java BufferedReader() not reading Arabic and Chinese characters correctly?

I'm trying to read a file which contain English & Arabic characters on each line and another file which contains English & Chinese characters on each line. However the characters of the Arabic and Chinese fail to show correctly - they just appear as question marks. Any idea how I can solve this problem?
Here is the code I use for reading:
try {
String sCurrentLine;
BufferedReader br = new BufferedReader(new FileReader(directionOfTargetFile));
int counter = 0;
while ((sCurrentLine = br.readLine()) != null) {
String lineFixedHolder = converter.fixParsedParagraph(sCurrentLine);
System.out.println("The line number "+ counter
+ " contain : " + sCurrentLine);
counter++;
}
}
Edition 01
After reading the line and getting the Arabic and Chinese word I use a function to translate them by simply searching for Given Arabic Text in an ArrayList (which contain all expected words) (using indexOf(); method). Then when the word's index is found it's used to call the English word which has the same index in another Arraylist. However this search always returns false because it fails when searching the question marks instead of the Arabic and Chinese characters. So my System.out.println print shows me nulls, one for each failure to translate.
*I'm using Netbeans 6.8 Mac version IDE
Edition 02
Here is the code which search for translation:
int testColor = dbColorArb.indexOf(wordToTranslate);
int testBrand = -1;
if ( testColor != -1 ) {
String result = (String)dbColorEng.get(testColor);
return result;
} else {
testBrand = dbBrandArb.indexOf(wordToTranslate);
}
//System.out.println ("The testBrand is : " + testBrand);
if ( testBrand != -1 ) {
String result = (String)dbBrandEng.get(testBrand);
return result;
} else {
//System.out.println ("The first null");
return null;
}
I'm actually searching 2 Arraylists which might contain the the desired word to translate. If it fails to find them in both ArrayLists, then null is returned.
Edition 03
When I debug I found that lines being read are stored in my String variable as the following:
"3;0000000000;0000001001;1996-06-22;;2010-01-27;����;;01989;������;"
Edition 03
The file I'm reading has been given to me after it has been modified by another program (which I know nothing about beside it's made in VB) the program made the Arabic letters that are not appearing correctly to appear. When I checked the encoding of the file on Notepad++ it showed that it's ANSI. however when I convert it to UTF8 (which replaced the Arabic letter with other English one) and then convert it back to ANSI the Arabic become question marks!
FileReader javadoc:
Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
So:
Reader reader = new InputStreamReader(new FileInputStream(fileName), "utf-8");
BufferedReader br = new BufferedReader(reader);
If this still doesn't work, then perhaps your console is not set to properly display UTF-8 characters. Configuration depends on the IDE used and is rather simple.
Update : In the above code replace utf-8 with cp1256. This works fine for me (WinXP, JDK6)
But I'd recommend that you insist on the file being generated using UTF-8. Because cp1256 won't work for Chinese and you'll have similar problems again.
IT is most likely Reading the information in correctly, however your output stream is probably not UTF-8, and so any character that cannot be shown in your output character set is being replaced with the '?'.
You can confirm this by getting each character out and printing the character ordinal.
public void writeTiFile(String fileName,String str){
try {
FileOutputStream out = new FileOutputStream(fileName);
out.write(str.getBytes("windows-1256"));
} catch (Exception ex) {
ex.printStackTrace();
}
}

How to save Chinese Characters to file with java?

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.
StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;
FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();
What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?
There are several factors at work here:
Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);
Here is a method of reliably appending UTF-8 data to a file:
private static void writeUtf8ToFile(File file, boolean append, String data)
throws IOException {
boolean skipBOM = append && file.isFile() && (file.length() > 0);
Closer res = new Closer();
try {
OutputStream out = res.using(new FileOutputStream(file, append));
Writer writer = res.using(new OutputStreamWriter(out, Charset
.forName("UTF-8")));
if (!skipBOM) {
writer.write('\uFEFF');
}
writer.write(data);
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
String chinese = "\u4E0A\u6D77";
boolean append = true;
writeUtf8ToFile(new File("chinese.txt"), append, chinese);
}
Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.
Here is the Closer type used in this code:
public class Closer implements Closeable {
private Closeable closeable;
public <T extends Closeable> T using(T t) {
closeable = t;
return t;
}
#Override public void close() throws IOException {
if (closeable != null) {
closeable.close();
}
}
}
This code makes a Windows-style best guess about how to read the file based on byte order marks:
private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };
private static Charset getEncoding(InputStream in) throws IOException {
charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
byte[] bom = "\uFEFF".getBytes(encodings);
in.mark(bom.length);
for (byte b : bom) {
if ((0xFF & b) != in.read()) {
in.reset();
continue charsetLoop;
}
}
return encodings;
}
return Charset.defaultCharset();
}
private static String readText(File file) throws IOException {
Closer res = new Closer();
try {
InputStream in = res.using(new FileInputStream(file));
InputStream bin = res.using(new BufferedInputStream(in));
Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
StringBuilder out = new StringBuilder();
for (int ch = reader.read(); ch != -1; ch = reader.read())
out.append((char) ch);
return out.toString();
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
System.out.println(readText(new File("chinese.txt")));
}
(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)
If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:
Writer w = new FileWriter("test.txt");
w.append("上海");
w.close();
The safest way is to always explicitly specify the encoding:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
w.append("上海");
w.close();
P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.
Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).
The following bug exists to describe the issue in Java:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:
http://mindprod.com/jgloss/bom.html
and for a more correct solution see the following link:
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html
Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:
String FileName = "output.txt";
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();
I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.
Try this,
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileName,Append), "UTF8"));
for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
out.close();

Categories