Curly quotes causing Java Scanner hasNextLine() to be false -- why?

Curly quotes causing Java Scanner hasNextLine() to be false -- why? - java

I've been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the problem file, it comes up completely empty handed -- hasNextLine() is false, buffer is empty, etc. I narrowed it down to the fact that it won't even read the first line if there is a curly quote anywhere in the file. No exceptions are thrown. Note that a BufferedReader on the same file doesn't have a problem.
try {
int count = 0;
Scanner scanner = new Scanner(new File("C:/myfile.txt"));
while (scanner.hasNextLine()) {
count++;
scanner.nextLine();
}
scanner.close();
System.out.print(count);
count = 0;
BufferedReader reader = new BufferedReader(new FileReader("C:/myfile.txt"));
while (reader.readLine() != null) {
count++;
}
reader.close();
System.out.print(count);
}
catch(IOException e) {
e.printStackTrace();
}
The above code, reading a file that contains nothing but a single curly quote, prints out "01". Searches on Google led me to try this:
Scanner scanner = new Scanner(new File("C:/myfile.txt"), "ISO-8859-1");
This makes it work (ie. it prints out "11"). I also noticed that if I go into Notepad and do a Save As... the default encoding at the bottom is "ANSI." If I change this to "UTF-8" and save the file, then the scanner (without an encoding) also works. If I tell the scanner "UTF-8", then understandably it only works if I save as UTF-8, but "ISO-8859-1" seems to make it work even if I save it as "ANSI".
So, I know it has something to do with file encoding, but the problem is I don't understand anything about file encoding. My knowledge of what "ISO-8859-1" means is extremely vague; why does that make it work no matter how I save the file? Why does BufferedReader work regardless?
EDIT:
The links/comments below really helped point me in the right direction! I think I've got it figured out.
First of all, in Notepad:
"ANSI" is CP1252
"Unicode" is UTF-16LE
"UTF-8" is... well, UTF-8
In hexadecimal, a curly apostrophe is represented as:
CP1252: 92
UTF-16LE: 1920
UTF-8: E2 80 99
The default encoding Java uses on my system, according to Charset.defaultCharset(), is UTF-8. So when I saved the file in UTF-8, the scanner knew what to expect. When I saved the file in CP1252, however, it choked once it hit that "92", because it's not a valid way to represent a character in that encoding. It works fine as long as there aren't any such chracters in the file -- the hex for "hello world" happens to be the same in both CP1252 and UTF-8 and doesn't happen to cause a problem.
UTF-8 doesn't work with a UTF-16 file, because it doesn't know what to do with the byte order mark ("FFFE"), regardless of what characters are in the file.
On the other hand, when I set the scanner to CP1252 or ISO-8859-1, it's much more tolerant. It doesn't necessarily interpret the characters correctly, mind you, but there's nothing that prevents it from recognizing lines in the file and looping through.
As far as why Scanner has a problem but the FileReader/BufferedReader does not, I am going to guess that it's because the scanner needs to tokenize the file, ie. interpret the characters so it can identify whitespace and other patterns, so it chokes when there's something unrecognizable. The reader doesn't need to do that. All it needs to identify are the line breaks.

If you don't specify an encoding when you create the scanner it will try to divine the encoding based on a byte order mark (BOM), which is the first few bytes of a file. If it doesn't have one, it will default to whatever default the OS uses. Since you're using Windows, the default is cp-1252. It seems that notepad is saving your text file using ISO-8859-1 which is similar, but not that same as cp-1252. See this link for more details:
http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
When you save it as UTF-8, it probably places the UTF-8 BOM at the beginning of the file and the scanner can pick up on it.
If you want to look more into BOM, look it up in wikipedia--the article is quite good. You can also download PSPad and open the text file in hex mode to see the individual bytes. Hope that helps :)

Scanner's hasNextLine method will just return false if it encountered encoding error in the input file. Without any exception. This is frustrating, and it is not documented anywhere, even in JDK 8 documentation.
If you just want to read a file line-by-line, use this instead:
final BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream("inputfile.txt"), "inputencoding"));
while (true) {
String line = input.readLine();
if (line == null) break;
// process line
}
input.close();
Make sure the inputencoding above is replaced with the correct encoding of the file. Most likely it is utf-8 or ascii. Even if the encoding mismatches, it won't prematurely terminate like Scanner.

Some time ago I had similar problem with configuration file which was edited by the user. Because I never know what type of editor user will use I try this:
org.mozilla.universalchardet.UniversalDetector
available from here:
https://code.google.com/p/juniversalchardet/
The detecting char encoding is not simple thing so I can't be sure if this library works at any condition, but for me was sufficient. Have a look, maybe will help somehow to detect your encoding and later set it to Scanner.

Related

Fix mixed encoding in string

I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdÎ¿be DÎ¿cument ClÎ¿ud
How I can get the same encoding in Java as with Notepad++?

Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.

You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)

You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.

Use Get-Content in powershell as java input get extra character

I am practicing to use command line to run java script in windows 10.The java script is using scanner(System.in) to get input from a file and print the string it get from the file.The powershell command is as follow:
Get-Content source.txt | java test.TestPrint
The content of source.txt file is as follow:
:
a
2
!
And the TestPrint.java file is as follow:
package test;
import java.util.Scanner;
public class TestPrint {
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner in = new Scanner(System.in);
while(in.hasNextLine())
{
String str = in.nextLine();
if(str.equals("q")) break;
System.out.println( str );
}
}
}
Then weird thing happed.The result is
?:
a
2
!
You see,It add question mark into the begging of first line.Then when I change the character in first line of the source.txt file from ":" to "a",the result is
a
a
2
!
It add space into the begging of the first line.
I had tested the character and found the regularity：if the character is larger than "?" in ASCII,which is 63 in ASCII,then it will add space,such as "A"(65 in ASCII) or "["(91 in ASCII).If the character is smaller than "?", including "?" itself ,it will add question mark.

Could this be a Unicode issue (See: Java Unicode problems)? i.e. try specifying the type you want to read in:
Scanner in = new Scanner(System.in, "UTF-8");
EDIT:
Upon further research, PowerShell 5.1 and earlier, the default code page is Windows-1252. PowerShell 6+ and cross platform versions have switched to UTF-8. So (from the comments) you may have to specify Windows-1252 encoding:
Scanner in = new Scanner(System.in, "Windows-1252");
To find out what encoding is being used, execute the following in PowerShell:
[System.Text.Encoding]::Default
And you should be able to see what encoding is being used (for me in PowerShell v 5.1 it was Windows-1252, for PowerShell 6 it was UTF-8).

There is no text but encoded text.
Every program reading a text file or stream must know and use the same character encoding that the writer used.
An adaptive default character encoding is a 90s solution to a 70s and 80s problem (approx). Today, it's usually best to avoid constructors and methods that use a default, and in PowerShell, add an encoding argument where needed to control input or output.
To prevent data loss, you can use the Unicode character set throughout. UTF-8 is the most common for files and streams. (PowerShell and Java use UTF-16 for text datatypes.)
But you need to start from what you know the character encoding of the text file is. If you don't know this metadata, that's data loss right there.
Unicode provides that if a file or stream is known to be Unicode, it can start with metadata called a BOM. The BOM indicates which specific Unicode character encoding is being used and what the byte order is (for character encodings with code units longer than a byte). [This provision doesn't solve any problem that I've seen and causes problems of its own.]
(A character encoding, at the abstract level, is a map between codepoints and code units and is therefore independent of byte order. In practice, a character encoding takes the additional step of serializing/deserializing code units to/from byte sequences. So, sometimes using or not using a BOM is included in the encoding's name or description. A BOM might also be referred to as a signature. Ergo, "UTF-8 with signature.")
As metadata, a BOM, if present, should used if needed and always discarded when putting text into text datatypes. Unfortunately, Java's standard libraries don't discard the BOM. You can use popular libraries or a dozen or so lines of your own code to do this.
Again, start with the knowing the character encoding of the text file and inserting that metadata into the processing as an argument.

Display chinese characters as it is from velocity file [duplicate]

Hi I am using java language. In this I have to use some chinese, japanese character as the string and print using System.out.println().
How can I do that?
Thanks

Java Strings support Unicode, so Chinese and Japanese is no problem. Other tools (such as text editors) and your OS shell probably need to be told about it, though.
When reading or printing Unicode data, you have to make sure that the console or stream also supports Unicode (otherwise it will likely be replaced with question marks).
Writer unicodeFileWriter = new OutputStreamWriter(
new FileOutputStream("a.txt"), "UTF-8");
unicodeFileWriter.write("漢字");
You can embed Unicode literals directly in Java source code files, but you need to tell the compiler that the file is in UTF-8 (javac -encoding UTF-8)
String x = "漢字";
If you want to go wild, you can even use Chinese characters in method, variable, or class names. But that is against the naming conventions, and I would strongly discourage it at least for class names (because they need to be mapped to file names, and Unicode can cause problems there):
結果 漢字　= new 物().処理();

Just use it, Java Strings are fully unicode, so there should be nothing hard to just say
System.out.println("世界您好!");

One more thing to remember, the Reader should be BufferedReader, and what I mean is:
BufferedReader br = new BufferedReader (new InputStreamReader (new FileInputStream (f), "UTF-8"));
this must be done because when you read the file, readLine() can be called:
while (br.readLine() != null)
{
System.out.println (br.readLine());
}
This method is the only one that I found which can function normally because a regular Reader does not contain a non-static readLine() void method (this method does not accept anything).

Unable to read any of file that contains specific character(s)

TL;DR
Why does reading in a file with – not find any data on Notepad?
Problem:
Up to this point, I have been using just plain ol' Notepad (Version 6.1) to read/write text for testing/answering questions here.
Simple bit of code to read in the text files contents, and print them to the console:
Scanner sc = new Scanner(new File("myfile.txt"));
while (sc.hasNextLine()) {
String text = sc.nextLine();
System.out.println(text);
}
All is well, the lines print as expected.
Then, if I put in this exact character: –, anywhere in the text file, it will not read any of the file, and print nothing to the console.
I can of course use Notepad++ or other (better) text editors, and there is no issue, the text, including the dash character, will print as expected.
I can also specify UTF-8, using Notepad, and it will work fine:
File fileDir = new File("myfile.txt");
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
On my original Notepad file, if I copy and paste the text (including the –) into Notepad++ and compare the two files with WinMerge, it tells me that the dash on Notepad is –, but on Notepad++, it is â€“.
Question:
Why, when this – is used in a text file in Notepad, it reads nothing, basically telling me that hasNextLine() is false? Should it not at least read the input until the line that contains this specific character?
Steps to reproduce:
On Windows 7, right-click and create new Text Document.
Put any text in the file (without any special characters, as such)
Put in this character anywhere in the file: –
Run the first block of code above
Output: BUILD SUCCESSFUL (total time: 1 second), i.e. doesn't print any of the text.
PS:
I know I asked a similar (well, it ended up being the same) question yesterday, but unfortunately, it seems I may not have explained myself well, or some of the viewers didn't fully read the question. Either way, I think I've explained it better here.

The issue seems to be a difference of encoding. You have to read in the same encoding that the file was written into.
Your system notepad probably uses Windows-1252(or Cp-1252) encoding. There have been problems in this encoding with a range of characters between 128 - 159. The Dash lies between this range. This range is not present in the equivalent ISO 8859-1, and is only present in the Cp1252 encoding.
Eclipse, when reading the notepad file, assumes the file to be having the encoding ISO-8859-1 (as it is equivalent). But this character is not present in ISO-8859-1, hence the problem. If you want to read from Java, you will have to specify Cp1252, and you should get your output.
This is also the reason why your code with UTF-8 works correctly, when the file in notepad is written in UTF-8.

A buffered reader reads more than the current line, maybe the text upto the problematic bytes. Charset.CharsetDecoder.onMalformedInput then comes in play, and there something restricive happens, which I would normally not have expected.
Do you use a special JDK? Do you wipe exceptions under the carpet? Like a lambda wrapping the above code. (Add catch Throwable)
Is your platfom encoding -Dfile.encoding=ISO-8859-1 instead of Cp1252.

Readline() in Java does not handle Chinese characters properly

I have a text file with Chinese words written to a line. The line is surrounded with "\r\n", and written using fileOutputStream.write(string.getBytes()).
I have no problems reading lines of English words, my buffered reader parses it with readLine() perfectly. However, it recognizes the Chinese sentence as multiple lines, thus screwing up my programme flow.
Any solutions?

Using string.getBytes() encodes the String using the platform default encoding. That is rarely what you want, especially when you're trying to write characters that are not native to your current locale.
Specify the encoding instead (using string.getBytes("UTF-8"), for example).
A cleaner and more Java-esque way would be to wrap your OutputStream in an OutputStreamWriter like this:
Writer w = new OutputStreamWriter(out, "UTF-8");
Then you can simply call writer.write(string) and don't need to repeat the encoding each time you want to write a String.
And, as commented below, specify the same encoding when reading the file (using a Reader, preferably).

If you're outputting the text via fileOutputStream.write(string.getBytes()), you're outputting with the default encoding for the platform. It's important to ensure you're then reading with the appropriate encoding, and using methods that are encoding-aware. The problem won't be in your BufferedReader instance, but whatever Reader you have under it that's converting bytes into characters.
This article may be of use: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.