ISO 8859-1 Encoding of files printed in Java program - java

I write a program that implements a file structure, the program prints out a product file based on the structure. Product names include letters Æ, Ø and Å. These letters are not displayed correctly in the output file. I use
PrintWriter printer = new PrintWriter(new FileOutputStream(new File("products.txt")));
IS0 8859 - 1 or Windows ANSI (CP 1252) is the character sets that the implementation requiers.

There are two possibilities:
Java is using the wrong encoding when outputting the file.
The file is actually correct, and whatever you are using to display the file is using the wrong encoding.
Assuming that the problem is the first one, the root cause is that Java has figured out that the default encoding for the platform is something other than the one you want / expect. There are three ways to solve this:
Figure out why Java has the got default locale and encoding "wrong" and remedy that. It will be something to do with your operating system's locale settings ...
Read this FAQ for details on how you can override the default locale settings at the command line.
Use a PrintWriter constructor that specifies the encoding explicitly so that your application doesn't rely on the default encoding. For example:
PrintWriter pw = new PrintWriter("filename", "ISO-8859-1");
In response to this comment:
Don’t PrintWriters all have the bug that you can’t know you had an error with them?
It is not a bug, it is a design feature.
You can find out if there was an error. You just can't find out what it was.
If you don't like it, you can use Writer instead.
They won’t raise an exception or even return failure if you try to shove a codepoint at them that can’t fit in the designated encoding.
Neither will a regular Writer I believe ... unless you specifically construct it to do this. The normal behaviour is to replace any unmappable codepoint with a specific character, though this is not specified in the javadocs (IIRC).
Do they even tell if you the filesystem fills up; I seem to recall that they don’t.
That is true. However:
For the kind of file you typically write using a PrintWriter this is not a critical issue.
If it is a critical issue AND you still want to use PrintWriter, you can always call checkError() (IIRC) to find out if there was an error.
I always end up writing my out OutputStreamWriter constructor with the explicit Charset.forName("UTF-8").newEncoder() second argument. It’s kind of tedious, so perhaps there’s a better way.
I dunno.

Related

How to change hyphen sign for question mark in java program?

I'm utilizing this line codes
String string = "Some usefull information − don't know what happens with my output";
System.out.println(string);
String str2verify = driver.findElement(By.xpath("//someWellFormXpath")).getText();
Assert.assertEquals(str2verify , "Some usefull information − don't know what happens with my output");
And I'm getting this in my console, so if I want to use equals function doesn't work.
Output
Some usefull information ? don't know what happens with my output
expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]
java.lang.AssertionError: expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]
This is the process:
You write some text. In an editor. That is showing strings to you.
You save your file. files are bytes, not characters, so your editor is applying a charset encoding to do this. Which one? Your editor will know, you didn't mention which one you use so I can't tell you.
Javac reads your file. files are bytes, but javac needs characters, so javac is applying a charset encoding to do this. Which one? "The platform default", unless you use the -encoding parameter / the tool you are using that calls javac has a way to tell it which -encoding parameter to use.
Javac emits class files. These are byte based so this doesn't require encoding.
Your java JVM runs your class file. As part of running, a string is printed to standard out.
System.out refers to 'standard out'. These things are, on pretty much every OS, a stream of bytes. Meaning, when you send strings there, the JVM first encodes your string using some charset encoding, then it goes to standard out.
Something is connected to the other end of standard out and sees these bytes. These convert the bytes back to a string, also using some encoding.
The characters are sent to the font rendering engine on your OS. Even if the character 'survived' all those conversions back and forth, it is possible your font doesn't have a glyph for it. The intent is clearly for that character to be an emdash (a dash that is as long as the letter 'm' - the standard 'minus' character is an ndash, not the same thing; that one is a bit shorter).
Count em up - that's like 6 conversions. They all need to be using the same charset encoding. So, check that your editor and javac agree on what charset encoding your source file is in. Then, check that the console thing that is showing the string is in agreement with standard out (which should be 'platform default', whatever that might be), then, check if the font you use has emdash.
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
Then write to ps, not System.out - that's how you can explicitly force some charset to be used when writing to output.
It turns that em dash doesn't have a representation in cp-1252 charset encoding, so at the end I have to change to UTF-8 all my files in the project to be able to save this character.
It was a pain in the brain this encoding issue.
Thanks for all the suggestions friends.

Java 7: What charset shall I use when calling Files.newBufferedReader?

In previous versions of Java, I would read a file by creating a buffered reader like this:
BufferedReader in = new BufferedReader(new FileReader("file.txt"));
In Java 7, I would like to use Files.newBufferedReader, but I need to pass in a charset as well. For example:
BufferedReader in = Files.newBufferedReader(Paths.get("file.txt"),
Charset.forName("US-ASCII"));
Previously, I did not have to worry about charsets when reading plain text files. What charset shall I use? Do you know what charset was used by default in previous versions of Java? I simply want to be able to find and replace the old statement with the new one.
Previously, I did not have to worry about charsets when reading plain text files.
Well, you should have done. If you were just using FileReader, it was using the default character encoding for the system. That was a bad idea, which is why I always used FileInputStream and an InputStreamReader. You should always be explicit about it. If you really want the default character encoding for the system, you should use Charset.defaultCharset() - but I strongly suggest that you don't.
If you're going to read a file, you should know the character encoding, and specify that. If you get to decide what character encoding to use when writing a file, UTF-8 is a good default choice.
PrintWriter/PrintStream in Java has by default Charset.defaultCharset()
java.nio.charset.Charset.defaultCharset()

What could be the possible consequences of default encoding to UTF-8 for a String to Stream conversion?

I need to convert Strings obtained from some API's to InputStream consumed by other API's. The only way is that I convert the String to Stream without knowing the exact encoding. So I assume it to be UTF-8 and it works fine for now. However I would like to know what could be a better solution for this given that I have no way of identifying the the encoding of the source of the string.
There is no good solution to the problem of not knowing the encoding.
Because of this, you must demand that the encoding be explicitly specified, or else use one single agreed-upon encoding that is strictly adhered to.
Also, make sure you use the rare form of the contructor to InputStreamReader that condescends to raise an exception on an encoding error. That is InputStreamReader(InputStream in, CharsetDecoder dec). The other three are either broken or else infelicitously designed depending on your point of view or purposes, because they suppress encoding errors and render your program unreliable and nonportable.
Be very careful about missing errors, especially when you do not know for sure what you are getting — and even if you think you do :).
The possible consequences of applying the incorrect encoding is getting the wrong data out the other end.
The specific consequences will depend on the specific encodings. For example, if you receive a stream of ISO-8859-1 characters, and try to decode using UTF-8, you'll probably get errors due to incorrect sequences. If you start with UTF-16 and assume that it's ISO-8859-1, you'll get twice as many characters as you expect, and every other one will be garbage.
Encodings are not a property of Strings in Java, they're only relevant when you convert between Strings and bytes. If those APIs give you Strings, there is only one point where your program needs to use an encoding, which is when you convert the String back to bytes to be returned by the InputStream. And those "other APIs" of course need to know which encoding to use if they're going to interpret the contents as text data.
To add to the other answers, your deployed application will no longer be portable between Windows and Linux, since these usually have different default encodings.

Readline() in Java does not handle Chinese characters properly

I have a text file with Chinese words written to a line. The line is surrounded with "\r\n", and written using fileOutputStream.write(string.getBytes()).
I have no problems reading lines of English words, my buffered reader parses it with readLine() perfectly. However, it recognizes the Chinese sentence as multiple lines, thus screwing up my programme flow.
Any solutions?
Using string.getBytes() encodes the String using the platform default encoding. That is rarely what you want, especially when you're trying to write characters that are not native to your current locale.
Specify the encoding instead (using string.getBytes("UTF-8"), for example).
A cleaner and more Java-esque way would be to wrap your OutputStream in an OutputStreamWriter like this:
Writer w = new OutputStreamWriter(out, "UTF-8");
Then you can simply call writer.write(string) and don't need to repeat the encoding each time you want to write a String.
And, as commented below, specify the same encoding when reading the file (using a Reader, preferably).
If you're outputting the text via fileOutputStream.write(string.getBytes()), you're outputting with the default encoding for the platform. It's important to ensure you're then reading with the appropriate encoding, and using methods that are encoding-aware. The problem won't be in your BufferedReader instance, but whatever Reader you have under it that's converting bytes into characters.
This article may be of use: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

java.net.URLConnection.guessContentTypeFromStream and text/plain

All,
I am trying to identify plain text files with Mac line endings and, inside an InputStream, silently convert them to Windows or Linux line endings (the important part is the LF character, really). Specifically, I'm working with several APIs that take InputStreams and are hard-locked to looking for \n as newlines.
Sometimes, I get binary files. Obviously, a file that isn't text-like shouldn't have this substitution done, because the value that happens to correspond to \r obviously can't silently be followed by a \n without mangling things badly.
I am attempting to use java.net.URLConnection.guessContentTypeFromStream and only performing endline conversions if the type is text/plain. Unfortunately, "text/plain" doesn't seem to be in its gamut of return values; all I get is null for my flat text files, and it's possibly not safe to assume all unidentifiable files can be modified.
What better library (preferably in a public Maven repository and open-source) can I use to do this? Alternatively, how can I make guessContentTypeFromStream work for me? I know I'm describing an inherently hazardous application and no solution can be perfect, but should I just treat "null" as likely to be "text/plain" and I simply need to write more code myself to look for evidence that it isn't?
It seems to me that what you're asking is to determine if a file is textual or not. Given that, there is a solution here that seems right:
Granted, he is talking about unix, bash and perl but the concept is the same:
Unless you inspect every byte of the
file, you are not going to get this
100%. And there is a big performance
hit with inspecting every byte. But
after some experiments, I settled on
an algorithm that works for me. I
examine the first line and declare the
file to be binary if I encounter even
one non-text byte. It seems a little
slack, I know, but I seem to get away
with it.
EDIT #1:
Expanding on this type of solution, it seems like a reasonable approach would be to ensure the file contains no non-ascii characters (unless you're dealing with files that are non-English...thats another solution). This could be done by checking if the file contents as a String does not match this:
// -- uses commons-io
String fileAsString = FileUtils.readFileToString( new File( "file-name-here" ) );
boolean isTextualFile = fileAsString.matches( ".*\\p{ASCII}+.*" );
EDIT #2
You may want to try this as your regex, or something close to it. Though, I'll admit it could likely use some refining.
".*(?:\\p{Print}|\\p{Space})+.*"

Categories