Trouble reading special characters (Java)

Trouble reading special characters (Java) - java

I'm making a chat client that uses special encryption. It has a problem reading letters like «, ƒ, ̕ from the input buffer.
Im reading them into a byte array and I tried using
Connection.getInputStream().read();
And also using
BufferedReader myInput = new BufferedReader(
new InputStreamReader(Connection.getInputStream()));
But there appears to be a problem as it displays them as square boxes.

You have to make sure that your InputStreamReader uses the same charset to decode the bytes into chars than the one used by the sender to encode chars into bytes. Look at the other constructors of InputStreamReader.
You must also make sure that the font you're using to display the chars supports your special characters.

Set the correct encoding on the stream through new InputStreamReader(..,"utf-8") or whatever your input is.

Conver byte array to String specifying Character set.
String data = new String(byte[], "UTF-8");
make sure that displaying font support UTF-8 or your specified encoding charset.

You can try using a DataInputStream and the readChar() method.
DataInputStream in = new DataInputStream(myinput);
//where muinput is your BufferedInputStream.
char c = in.readChar();
should do what you want.

Related

String encoding (UTF-8) JAVA

Could anyone please help me out here. I want to know the difference in below two string formatting. I am trying to encode the string to UTF-8. which one is the correct method.
String string2 = new String(string1.getBytes("UTF-8"), "UTF-8"));
OR
String string3 = new String(string1.getBytes(),"UTF-8"));
ALSO if I use above two code together i.e.
line 1 :string1 = new String(string1.getBytes("UTF-8"), "UTF-8"));
line 2 :string1 = new String(string1.getBytes(),"UTF-8"));
Will the value of string1 will be the same in both the lines?
PS: Purpose of doing all this is to send Japanese text in web service call.
So I want to send it with UTF-8 encoding.

According to the javadoc of String#getBytes(String charsetName):
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.
And the documentation of String(byte[] bytes, Charset charset)
Constructs a new String by decoding the specified array of bytes using
the specified charset.
Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:
// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));

String and char (two-bytes UTF-16) in java is for (Unicode) text.
When converting from and to byte[]s one needs the Charset (encoding) of those bytes.
Both String.getBytes() and new String(byte[]) are short cuts that use the default operating system encoding. That almost always is wrong for crossplatform usages.
So use
byte[] b = s.getBytes("UTF-8");
s = new String(b, "UTF-8");
Or better, not throwing an UnsupportedCharsetException:
byte[] b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);
(Android does not know StandardCharsets however.)
The same holds for InputStreamReader, OutputStreamWriter that bridge binary data (InputStream/OutputStream) and text (Reader, Writer).

Please don't confuse yourself. "String" is usually used to refer to values in a datatype that stores text. In this case, java.lang.String.
Serialized text is a sequence of bytes created by applying a character encoding to a string. In this case, byte[].
There are no UTF-8-encoded strings in Java.
If your web service client library takes a string, pass it the string. If it lets you specify an encoding to use for serialization, pass it StandardCharsets.UTF_8 or equivalent.
If it doesn't take a string, then pass it string1.GetBytes(StandardCharsets.UTF_8) and use whatever other mechanism it provides to let you tell the recipient that the bytes are UTF-8-encoded text. Or, get a different client library.

It's the String conversion again: UNIX Windows-1252 to String

I'm downloading a website in Java, using all this:
myUrl = new URL("here is my URL");
in = new BufferedReader(new InputStreamReader(myUrl.openStream()));
In this file however there are some special characters like ä,ö and ü. I need to be able to print these out properly.
I try to encode the Strings using:
String encodedString = new String(toEncode.getBytes("Windows-1252"), "UTF-8");
But all it does is replace these special characters with a ?.
When I open what I am trying to print here using a downloaded .html file from Chrome with Notepad++, it says (in the bottom right corner) UNIX and Windows-1252. That's all I know about the encoded file.
What more steps can I take to figure out what is wrong?
--AND--
How can I convert this file so that I can properly read and print it in Java?
Sorry if this question is kind of stupid... I simply don't know any better and couldn't find anything on the internet.

OK, so you are mixing a lot of things here.
First of all, you do:
new InputStreamReader(myUrl.openStream())
this wil open a reader, yes; however, it will do so using your default JRE/OS Charset. Maybe not what you want.
Try and specify that you want UTF_8 (note, Java 7+ code):
try (
final InputStream in = myUrl.openStream();
final Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
) {
// read from the reader here
}
Now, what you are mixing...
You read from an InputStream; an InputStream only knows how to read bytes.
But you want text; and in Java, text means a sequence of chars.
Let us forget for a moment that you want chars and focus on the fact that you want text; let us substitute a char for a carrier pigeon.
Now, what you need to do is to transform this stream of bytes into a stream of carrier pigeons. For this, you need a particular process. And in this case, the process is called decoding.
Back to Java, now. There also exists a process which does the reverse: encoding a stream of carrier pigeons (or chars) into a stream of bytes.
The trick... There exist several ways to do that; Unicode refers to them as character codings; and in Java, the base class which provides both encoders and decoders is a Charset.
Now, an InputStreamReader accepts a Charset as a second argument... Which you should ALWAYS specify. If you DO NOT, this:
new InputStreamReader(in);
will be equivalent to:
new InputStreamReader(in, Charset.defaultCharset());
and Charset.defaultCharset() is Not. Guaranteed. To. Be. The. Same. Amongst. Implementations. Of. JREs.

How to force UTF-16 while reading/writing in Java?

I see that you can specify UTF-16 as the charset via Charset.forName("UTF-16"), and that you can create a new UTF-16 decoder via Charset.forName("UTF-16").newDecoder(), but I only see the ability to specify a CharsetDecoder on InputStreamReader's constructor.
How so how do you specify to use UTF-16 while reading any stream in Java?

Input streams deal with raw bytes. When you read directly from an input stream, all you get is raw bytes where character sets are irrelevant.
The interpretation of raw bytes into characters, by definition, requires some sort of translation: how do I translate from raw bytes into a readable string? That "translation" comes in the form of a character set.
This "added" layer is implemented by Readers. Therefore, to read characters (rather than bytes) from a stream, you need to construct a Reader of some sort (depending on your needs) on top of the stream. For example:
InputStream is = ...;
Reader reader = new InputStreamReader(is, Charset.forName("UTF-16"));
This will cause reader.read() to read characters using the character set you specified. If you would like to read entire lines, use BufferedReader on top:
BufferedReader reader = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-16")));
String line = reader.readLine();

How to encode bytes to a string in Java

I am trying to encode bytes from an inputstream to plain text characters. So, I made the string out of ints seperated by spaces, like this:
InputStream in;
//etc
int b;
String finalString="";
while((b=in.read())!=-1)finalString+=""+b+" ";
in.close()
But the problem is, this makes the string 3-4 times larger than the original bytes. Is there any other way of encoding bytes to plain text?

If I understand correctly, you want to transform binary data into plain text. You should use Base64 for that. The loss factor will only be 4/3.
Apache commons-codec has a free implementation of a Base64 encoder (and decoder).
Another possibility is Hex encoding (which commons-codec also supports), but it needs 2 bytes of text for each byte of binary data.

You can get all the bytes and out them into a byte array, and then create the string using the byte array.
i.e.
String newString = new String(byteArray);

Your current solution produces strings that are 3..4 times longer than what's in the file because it concatenates decimal character codes into a string.
Java provides a way of reading strings from streams without the need for writing loops, like this:
InputStream in;
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF8"));
String s = r.readLine();

Follow the documentation here
For example if your string is UTF8:
byte[] bytes = // you got that from somewhere...
String x = new String(bytes, "UTF8");

Commons-codec has methods to encode bytes to Base64 encoding.
encodedText = new String(
org.apache.commons.codec.binary.Base64.encodeBase64(byteArray));

If you can get it all into a single byte[], then this should just be
new String(byteArray, StandardCharsets.UTF_16LE);
or whatever character encoding you expect the input to use.

Writing unicode to rtf file

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things.
I use japanese here as an example but it´s the same for other languages i have tried.
public void writeToFile(){
String strJapanese = "日本語";
DataOutputStream outStream;
File file = new File("C:\\file.rtf");
try{
outStream = new DataOutputStream(new FileOutputStream(file));
outStream.writeBytes(strJapanese);
outStream.close();
}catch (Exception e){
System.out.println(e.toString());
}
}
I alse have tried:
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
Or more specific:
byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);
The output stream also has the writeUTF method:
outStream.writeUTF(strJapanese);
You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.
If it does work but my computer can´t open it properly, is there a way to check that?

By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html

DataOutputStream outStream;
You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.
outStream.writeBytes(strJapanese);
In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.
outStream.writeUTF(strJapanese);
This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.
Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.
In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.
This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.
RTF is not a very nice format.

You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.