Byte streams in java - java

Can we write Unicode Data in a File with ByteStreams?
My code is:
public static void main(String[] args) throws Exception {
String str = "Русский язык ";
FileOutputStream fos = new FileOutputStream("file path");
fos.write(str.getBytes());
fos.flush();
fos.close();
}
Here i am using a byte stream to write unicode data, but it is writing properly.I am new to java but i have read that byte streams do not support unicode characters. So, why does it is working in this case?

i have read that byte streams do not support unicode characters.
Either you have used a bad source of information or you have probably misunderstood something. Byte streams support bytes. Therefore byte streams support anything that can be represented in bytes. Videos, text, pictures, music... If byte stream doesn't support it, it cannot be used in a digital computer at all.
The trick to represent those things in what is a simply a sequence of 1 and 0's, is to use agreed upon rules. You would encode your text according to certain rules, and then the receiver can decode it back using the same rules.
"Русский язык" can be represented as bytes in any encoding that supports cyrillic characters. In any of the encodings of unicode: UTF-8, UTF-16, UTF-32; Windows-1251, KOI8-R, KOI8-U, ISO-8859-5...
That doesn't mean these encodings are compatible with each other. They are all incompatible when it comes to encoding Cyrillic script, so text encoded in one the encodings, must strictly be decoded in that encoding.
.getBytes() uses the platform default encoding, which happened to be a one that supported Cyrillic script. You might believe it's UTF-8 but if you are on Windows, it's far more likely to be Cp1251. Don't fall into trap that just because you used "unicode characters", that your files are physically encoded in an UTF encoding. That will lead to encoding problems.
So always be explicit about encoding, so that your program works the same on any platform and so that you always know what encoding the files your program created are in. With your code, you could have done this:
String str = "Русский язык ";
FileOutputStream fos = new FileOutputStream("file path");
fos.write(str.getBytes("UTF-8"));
fos.flush();
fos.close();
Or as suggested by the other answer:
String str = "Русский язык ";
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("file path"), "UTF-8"
);
osw.write(str);
osw.flush();
osw.close();
These are technically exactly the same; text is being converted to bytes according to UTF-8 rules.

Related

Why is my String returning "\ufffd\ufffdN a m e"

This is my method
public void readFile3()throws IOException
{
try
{
FileReader fr = new FileReader(Path3);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
int a =1;
while( a != 2)
{
s = br.readLine();
a ++;
}
Storage.add(s);
br.close();
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
}
For some reason I am unable to read the file which only contains this "
Name
Intel(R) Core(TM) i5-2500 CPU # 3.30GHz "
When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.
\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.
But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :
that the file is encoded with 2 (or even 4) bytes per character
the encoding is little or big endian
(Reference : Using Byte Order Marks on MSDN)
As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ...
The excellent vim can read files with different encodings and convert between them.
If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
You must specify the encoding when reading the file, in your case probably is UTF-16.
Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16");
BufferedReader br = new BufferedReader(reader);
Check the documentation for more details: InputStreamReader class.
Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.
Perhaps you have UTF-16 characters such as '®' in your document.

Java Unicode to readable text conversion decoding

I am developing a Java application where I am consuming a web service. The web service is created using a SAP server, which encodes the data automatically in Unicode. I get a Unicode string from the web service.
"
倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
"
above is the response.
I want to convert it to readable text format like String. I am using core Java.
倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
That's a PDF file that has been interpreted as UTF-16LE.
You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. Extracting the document text out of a PDF file is a much bigger problem!
(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.)
If you have byte[] or an InputStream (both binary data) you can get a String or a Reader (both text) with:
final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"
byte[] b = ...;
String s = new String(b, encoding);
InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
String line = reader.readLine();
}
The reverse process uses:
byte[] b = s.geBytes(encoding);
OutputStream os = ...;
BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);
Unicode is a numbering system for all characters. The UTF variants implement Unicode as bytes.
Your problem:
In normal ways (web service), you would already have received a String. You could write that string to a file using the Writer above for instance. Either to check it yourself with a full Unicode font, or to pass the file on for a check.
You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. In XML it would be defined already.
Addition:
FileWriter writes to a file using the default encoding (from operating system on your machine). Instead use:
new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")
If it is a binary PDF, as #bobince said, use just a FileOutputStream on byte[] or InputStream.
This is definitely not a valid string. This looks like mangled UTF-16.
UPDATE
Indeed #Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. When Displayed in UTF-8 this string indeed shows PDF source code. Good catch.

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.
You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.
UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252
String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

File is not saved in UTF-8 encoding even when I set encoding to UTF-8

When I check my file with Notepad++ it's in ANSI encoding. What I am doing wrong here?
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}
UPDATE:
This is solved now, reason for jboss not understanding my xml wasn't encoding, but it was naming of my xml. Thanx all for help, even there really wasn't any problem...
If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.
Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.
You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.
UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.
UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.
It's only when you start to get into the non-ASCII codepoints that things start looking different.
But in the common case, byte-for-byte, ANSI and UTF-8 are identical.
If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.
(And there seems to be an issue with UTF8 in Java anyways...)
The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.
I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.
Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.
You can do it like this,
public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}

Java App : Unable to read iso-8859-1 encoded file correctly

I have a file which is encoded as iso-8859-1, and contains characters such as ô .
I am reading this file with java code, something like:
File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
int byteCount = fr.read(buffer, 0, buffer.length);
if (byteCount <= 0) {
break;
}
String s = new String(buffer, 0, byteCount,"ISO-8859-1");
System.out.println(s);
}
However the ô character is always garbled, usually printing as a ? .
I have read around the subject (and learnt a little on the way) e.g.
http://www.joelonsoftware.com/articles/Unicode.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://www.ingrid.org/java/i18n/utf-16/
but still can not get this working
Interestingly this works on my local pc (xp) but not on my linux box.
I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :
System.out.println(java.nio.charset.Charset.availableCharsets());
I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.
I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with
System.out.println((int) s.getCharAt(index));
In both cases the result should be 244 decimal; 0xf4 hex.
See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).
In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.
EDIT: Here's a really easy way to prove whether or not the console will work:
System.out.println("Here's the character: \u00f4");
Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream("myfile.csv"), "ISO-8859-1");
char[] buffer = new char[4096]; // character (not byte) buffer
while (true)
{
int charCount = br.read(buffer, 0, buffer.length);
if (charCount == -1) break; // reached end-of-stream
String s = String.valueOf(buffer, 0, charCount);
// alternatively, we can append to a StringBuilder
System.out.println(s);
}
Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.
As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.
#Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).
Consider this code:
public static void main(String[] args) throws IOException {
byte[] data = { (byte) 0xF4 };
String decoded = new String(data, "ISO-8859-1");
if (!"\u00f4".equals(decoded)) {
throw new IllegalStateException();
}
// write default charset
System.out.println(Charset.defaultCharset());
// dump bytes to stdout
System.out.write(data);
// will encode to default charset when converting to bytes
System.out.println(decoded);
}
By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:
UTF-8
?ô
If I switch the terminal's encoding to ISO 8859-1, this is printed:
UTF-8
ôô
In both cases, the same bytes are being emitted by the Java program:
5554 462d 380a f4c3 b40a
The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.
If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.
Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.
Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

Categories