Java bug? Can't read GB2312 file with Scanner directly - java

I have a file in GB3212 encoding (Chinese). File is downloaded from here http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO as is with wget under Windows and stored into ModernChineseCharacterFrequencyList.html filename.
The code below demonstrates how Java is unable to read it up to end with one way and is able with another.
Namely, if Scanner is created with scanner = new Scanner(src, "GB2312") the code does not work. And if Scanner is created with scanner = new Scanner(new FileInputStream(src), "GB2312") then it DOES work.
Delimiter pattern lines just show another option with which the glitch remains.
public static void main(String[] args) throws FileNotFoundException {
File src = new File("ModernChineseCharacterFrequencyList.html");
//Pattern frequencyDelimitingPattern = Pattern.compile("<br>|<pre>|</pre>");
Scanner scanner;
String line;
//scanner = new Scanner(src, "GB2312"); // does NOT work
scanner = new Scanner(new FileInputStream(src), "GB2312"); // does work
//scanner.useDelimiter(frequencyDelimitingPattern);
while(scanner.hasNext()) {
line = scanner.next();
System.out.println(line);
}
}
Is this a glitch or by-design behavior?
UPDATE
When the code DOES work it just reads all tokens up to end. When it does NOT work it cancels reading approximately in the middle with no exception or error message.
No singularity at the break place was found. Nor did any "magic" numbers like 2^32 manifest.
UPDATE 2
Originally the behavior was found on Windows with Sun's JavaSE 1.6
And now the same behavior also found on Ubuntu with OpenJDK 1.6.0_23

I cannot test my answer right now but the JDK 6 documentation suggests different canonical names for encondings depending on the API you use: io or nio
JDK 6 Supportted Encondings
Maybe, instead of using "GB2312" you should use "EUC_CN" which is the suggested canonical name for Java I/O.

Related

Java Scanner can't read from only File Name

I'm making a Java program to read some scores from a .csv file and calculate the average of those scores. To read from the file, I'm using the Scanner Class.
First, I create a scanner to read from my file:
Scanner scanner = new Scanner(new File("TempFile.csv"));
I expected this to work, but it returns a FileNotFoundException. So, I replaced TempFile.csv with the file's absolute file name.
Scanner scanner = new Scanner(new File(C:\\Users\....));
This gave me the result I wanted, and I was able to parse the file. I'm new to Java, but I know that it's bad practice to use the absolute file name.
How can I use only the short file name?
Scanner scanner = new Scanner(new File (new File("TempFile.csv").getAbsolutePath()));
Use above.
"TempFile.csv" is a relative path. It's relative to the working directory of your java program. This directory is the value of the System property "user.dir". The following line of code gives you that value...
String workingDirectory = System.getProperty("user.dir");
Hence if you are getting FileNotFoundException, it probably means file "TempFile.csv" is not located in the working directory of your java program.
By the way, since java 8, class java.nio.file.Files contains method readAllLines. So if file "TempFile.csv" is not too big, readAllLines may be a simpler alternative to class Scanner. Note though that you still need to provide the correct path to the file when calling that method.

What is the difference between Java File+Scanner object for reading files and a FileReader object?

I was trying to find the difference between FileReader and the method I'm used to. I saw a question that was similar, but didn't really answer my question,it was here. So here goes:
The method I'm used to goes like this:
import java.io.File;
import java.util.Scanner;
...
public static ArrayList<String> read_file(String filename)
{
File temp = new File(filename);
Scanner input_file;
ArrayList<String> result = new ArrayList<String>();
try
{
input_file = new Scanner(temp);
}
catch (Exception e)
{
System.out.printf("Error: failed to open file %s\n", filename);
return result;
}
while (input_file.hasNextLine)
{
String line = input_file.nextLine;
result.add(line);
}
input_file.close();
return result;
...
I get that the File object allows us to work with a file that exists in that String path/filename..
But what is the difference between what the File+Scanner combination here does and what the FileReader(File file) or FileReader(String filename) object does (I'm NOT asking about the different versions of FileReader, I get the idea of overloaded methods/constructors)?
It would help to explain what the FileReader does and how it's use would differ from a Scanner..
Thanks guys in advance.
In simple words:
Scanner: A simple text scanner which can parse primitive types and strings using regular expressions. The advantage is programmers need to not worry about the writing implementation for parsing and converting the input data to various primitives. This fastens the development and reliable, because it is used by everybody and gets tested.
FileReader: Convenience class for reading character files or stream of characters. The functionality provided by FileReader is very limited to just reading the chars from the defined stream. The rest of the work has to be done the programmer.
Conclusion: The Scanner provide reliable and easy to use implementations for reading and parsing the streams (files), saves a lot of time in development.
Basically a Scanner and a FileReader differ in their API. Meaning that you have different methods available to you for reading a file depending on which one you use. A Scanner attempts to tokenize your file, while a reader gives you access to more fine grained details. Also a Scanner is not specific to a file. It can read from many different input sources like the command line. While a FileReader is specific to reading a File.

Using files as arguments for a Java program in the terminal

java someJavaProgram fsa.fsa <test.txt
That, apparently, is a legitimate command to take with two files as arguments for a Java program in the terminal - one to read in, and then the other (and I think the idea is that it prints the output to the terminal directly). someJavaProgram, fsa.fsa and test.txt are all files in the same directory (being someProject/src, and someJavaProgram in the default package).
However, the response I am given in the terminal just says:
FSA file not found - please scan in the appropriate file.
Testing file not found, please scan in the new relevant file.
My question is two-fold:
What is this command and what is it for?
Does it need refining or modifying or is it the program that needs improvement?
I should note that I wrote the code in Eclipse, where I simply hardcoded filepaths into the program. I'm not sure if that affects anything but it's related.
EDIT: The filepaths and related code are as follows:
private static final String FILE_PATH = "src/test.txt";
private static final String FSA_PATH = "src/fsa1.fsa";
...
public static void main(String[] args) throws FileNotFoundException {
interpretAutomaton();
testAutomaton();
}
...
interpretAutomaton() {
...
Scanner fsaScanner = new Scanner(new BufferedReader(new FileReader(FSA_PATH)));
...
testAutomaton() {
...
Scanner fileScanner = new Scanner(new BufferedReader(new FileReader(FILE_PATH)));
*Both are surrounded by try/catch blocks - which clearly work!
Thanks to anyone who can help clarify on the matter!
Based on the comments so far, to answer your actual questions:
1) The command has four elements:
java - execute the java program
someJavaProgram - the name of the Java class to execute
fsa.fsa - the first argument to the java program, accessible via argv[]
<test.txt - standard input redirection, the contents of the file will be available on the program's standard input, ie. System.in
The net effect is to run your Java program with one argument and one file's contents on the standard input.
2) Both the command line and the program look like they need to change:
change the command line to:
java someJavaProgram fsa.fsa test.txt
That is, remove the <. You will also need to check the paths to the files are correct. This command line assume you are in the same directory as the files when you execute it.
Change your code to use the filenames on the command line rather than the hard-coded names.

Java Scanner hasNextLine returns false

I have several files (actually they are also java source files saved in Eclipse on Ubuntu) which I need to read and process line by line. I've noticed that I cannot read one of the files. The code I am using is as below
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine() ) {
builder.append(scanner.nextLine()).append("\n");
}
} catch (FileNotFoundException ex) {
System.out.println("Error");
}
I was checking beforehand if the file exists. And it does. I can even rename it. But I cannot read a single line. hasNextLine simply returns false. (I even try hasNext).
At the end I take a look at the content of the file and find that there is a different looking character (which was in the comment section of java file). It is the following character.
ΒΈ
When I delete this character, I can read the file normally. However this is not acceptable. What can I do to read the files even with that character in it?
This is most probably a character set issue, caused by the fact that the platform you are running your java code uses by default a different set; it is always a good practice to specify the expected/needed character set to be used when parsing, and with the Scanner class is just a matter of calling the constructor as:
Scanner scanner = new Scanner(file, "UTF-8");
where the second parameter is the character set literal, or even better:
Scanner scanner = new Scanner(file, StandardCharsets.UTF_8);

Scanner.hasNext() returns false

I have directory with many files in it - each with over 800 lines in it. Hovewer, when I try to read it using Scanner, it seems empty.
File f1 = new File("data/cityDistances/a.txt"),
f2 = new File("data/cityDistances/b.txt");
System.out.println(f1.exists() && f2.exists()); //return true
System.out.println(f1.getTotalSpace() > 0 && f2.getTotalSpace() > 0); //return true
Scanner in = new Scanner(f1);
System.out.println(in.hasNext()); // return false;
System.out.println(in.hasNextLine()); //return false;
Why can it behave like that?
I've managed to do it using BufferedReader. Nonetheless, it seems even more strange that BufferedReader works and Scanner didn't.
As the default delimeter for Scanner is whitespace, that would imply your a.txt contains only whitespace - does it have 800 lines of whitespace? ;)
Have you tried the following?
new Scanner(new BufferedReader(new FileReader("a.txt")));
I had a similar problem today reading a file with Scanner.
I specified the encoding type of the file and it solved the problem.
scan = new Scanner(selectedFile,"ISO-8859-1");
This also happened to me today. I'm reading a plain text file from a Linux system written by some application in a Windows box and Scanner.hasNextLine() is always false even tough there are lines with the Windows line separator and all. As said by Hound Dog, by using
Scanner scanner = new Scanner(new BufferedReader(new FileReader(file)));
it worked like a charm. FileReader or BufferedReader seem to properly identify and use some file characteristics.
Checking the scanner's exception may show that the file can't be read.
...
System.out.println(in.hasNext()); // return false;
IOException ex = in.ioException();
if (ex != null)
ex.printStackTrace(System.out);
...
The function File.getTotalSpace() is not behaving how you're expecting. It is returning the size of the partition where those particular files are located.
You want to use File.length().
refer to file address you are using linux, refer to your name there is possible some Polish or Czech character and refer to below link, scanner dont like non utf-8 characters on linux:)
http://karussell.wordpress.com/2008/09/04/encoding-issues-solutions-for-linux-and-within-java-apps/
This probably doesn't answer the OP's question, but for anyone else who is experiencing "my iterator's hasNext() is always false!"--if you have a bunch of watches with .next() in them, your IDE or whatever is actually advancing the cursor position per each one. This has caused me much trouble. Be careful with iterators and watches.

Categories