It's the String conversion again: UNIX Windows-1252 to String - java

I'm downloading a website in Java, using all this:
myUrl = new URL("here is my URL");
in = new BufferedReader(new InputStreamReader(myUrl.openStream()));
In this file however there are some special characters like ä,ö and ü. I need to be able to print these out properly.
I try to encode the Strings using:
String encodedString = new String(toEncode.getBytes("Windows-1252"), "UTF-8");
But all it does is replace these special characters with a ?.
When I open what I am trying to print here using a downloaded .html file from Chrome with Notepad++, it says (in the bottom right corner) UNIX and Windows-1252. That's all I know about the encoded file.
What more steps can I take to figure out what is wrong?
--AND--
How can I convert this file so that I can properly read and print it in Java?
Sorry if this question is kind of stupid... I simply don't know any better and couldn't find anything on the internet.

OK, so you are mixing a lot of things here.
First of all, you do:
new InputStreamReader(myUrl.openStream())
this wil open a reader, yes; however, it will do so using your default JRE/OS Charset. Maybe not what you want.
Try and specify that you want UTF_8 (note, Java 7+ code):
try (
final InputStream in = myUrl.openStream();
final Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
) {
// read from the reader here
}
Now, what you are mixing...
You read from an InputStream; an InputStream only knows how to read bytes.
But you want text; and in Java, text means a sequence of chars.
Let us forget for a moment that you want chars and focus on the fact that you want text; let us substitute a char for a carrier pigeon.
Now, what you need to do is to transform this stream of bytes into a stream of carrier pigeons. For this, you need a particular process. And in this case, the process is called decoding.
Back to Java, now. There also exists a process which does the reverse: encoding a stream of carrier pigeons (or chars) into a stream of bytes.
The trick... There exist several ways to do that; Unicode refers to them as character codings; and in Java, the base class which provides both encoders and decoders is a Charset.
Now, an InputStreamReader accepts a Charset as a second argument... Which you should ALWAYS specify. If you DO NOT, this:
new InputStreamReader(in);
will be equivalent to:
new InputStreamReader(in, Charset.defaultCharset());
and Charset.defaultCharset() is Not. Guaranteed. To. Be. The. Same. Amongst. Implementations. Of. JREs.

Related

Fix mixed encoding in string

I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdοbe Dοcument Clοud
How I can get the same encoding in Java as with Notepad++?
Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.
You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)
You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.

Display chinese characters as it is from velocity file [duplicate]

Hi I am using java language. In this I have to use some chinese, japanese character as the string and print using System.out.println().
How can I do that?
Thanks
Java Strings support Unicode, so Chinese and Japanese is no problem. Other tools (such as text editors) and your OS shell probably need to be told about it, though.
When reading or printing Unicode data, you have to make sure that the console or stream also supports Unicode (otherwise it will likely be replaced with question marks).
Writer unicodeFileWriter = new OutputStreamWriter(
new FileOutputStream("a.txt"), "UTF-8");
unicodeFileWriter.write("漢字");
You can embed Unicode literals directly in Java source code files, but you need to tell the compiler that the file is in UTF-8 (javac -encoding UTF-8)
String x = "漢字";
If you want to go wild, you can even use Chinese characters in method, variable, or class names. But that is against the naming conventions, and I would strongly discourage it at least for class names (because they need to be mapped to file names, and Unicode can cause problems there):
結果 漢字 = new 物().処理();
Just use it, Java Strings are fully unicode, so there should be nothing hard to just say
System.out.println("世界您好!");
One more thing to remember, the Reader should be BufferedReader, and what I mean is:
BufferedReader br = new BufferedReader (new InputStreamReader (new FileInputStream (f), "UTF-8"));
this must be done because when you read the file, readLine() can be called:
while (br.readLine() != null)
{
System.out.println (br.readLine());
}
This method is the only one that I found which can function normally because a regular Reader does not contain a non-static readLine() void method (this method does not accept anything).

Find out encoding directly from an input stream [duplicate]

I'm facing a problem.
A file can be written in some encoding such as UTF-8, UTF-16, UTF-32, etc.
When I read a UTF-16 file, I use the code below:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF16"));
How can I determine which encoding the file is in before I read the file ?
When I read UTF-8 encoded file using UTF-16 I can't read the characters correctly.
There is no good way to do that. The question you're asking is like determining the radix of a number by looking at it. For example, what is the radix of 101?
Best solution would be to read the data into a byte array. Then you can use String(byte[] bytes, Charset charset) to test it with multiple encodings, most likely to least likely.
You cannot. Which transformation format applies is usually determined by the first four bytes of the file (assuming a BOM). You cannot see those just from the outside.
You can read the first few bytes and try to guess the encoding.
If all else fails, try reading with different encodings until one works (no exception when decoding and it 'looks' OK).

Readline() in Java does not handle Chinese characters properly

I have a text file with Chinese words written to a line. The line is surrounded with "\r\n", and written using fileOutputStream.write(string.getBytes()).
I have no problems reading lines of English words, my buffered reader parses it with readLine() perfectly. However, it recognizes the Chinese sentence as multiple lines, thus screwing up my programme flow.
Any solutions?
Using string.getBytes() encodes the String using the platform default encoding. That is rarely what you want, especially when you're trying to write characters that are not native to your current locale.
Specify the encoding instead (using string.getBytes("UTF-8"), for example).
A cleaner and more Java-esque way would be to wrap your OutputStream in an OutputStreamWriter like this:
Writer w = new OutputStreamWriter(out, "UTF-8");
Then you can simply call writer.write(string) and don't need to repeat the encoding each time you want to write a String.
And, as commented below, specify the same encoding when reading the file (using a Reader, preferably).
If you're outputting the text via fileOutputStream.write(string.getBytes()), you're outputting with the default encoding for the platform. It's important to ensure you're then reading with the appropriate encoding, and using methods that are encoding-aware. The problem won't be in your BufferedReader instance, but whatever Reader you have under it that's converting bytes into characters.
This article may be of use: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to detect illegal UTF-8 byte sequences to replace them in java inputstream?

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).
I want to do my best do extract as much information as possible.
The file contains a few illegal byte sequences, those should be replaces with the replacement character.
It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.
Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc
Is there something like that available (commercially or as free software)?
Thanks
-stephan
Solution:
final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);
java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).
CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.
One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..
The behaviour you want is already the default for InputStreamReader. So there is no need to specify it yourself. This suffices:
final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);

Categories