Difference between python readline() and Java's Scanner class nextLine() - java

What is the difference between Python's readline() and Java's Scanner class method nextLine()?
nextLine() looks for the next line separator character which could be something other than "\n" as written here:
http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html#nextLine()
Does the Python readline() method do the same? This is important because my file could have other line separators, but I need to look for specifically the new line character.
Any ideas?

You should test it by yourself.
I've tested it on the console using f.readline() and it reads until \n, even if I have a \r in the line.
>>> f.readline()
'This is a test\n'
>>> f.readline()
'Second line\rwith char\n'
>>> f.readline()
'Third line'
NOTE: Some weird things can happen if you simple print the read line on a python script. But if you use repr(str) you'll see all \n and \r.

First of all you are comparing apple to oranges. Scanner is not the Java equivalent of a python file object. BufferedReader is the equivalent, and in fact if you look at the nextLine method's documentation of BufferedReader:
Reads a line of text. A line is considered to be terminated by any one
of a line feed ('\n'), a carriage return ('\r'), or a carriage return
followed immediately by a linefeed.
Python does this too:
A manner of interpreting text streams in which all of the following
are recognized as ending a line: the Unix end-of-line convention '\n',
the Windows convention '\r\n', and the old Macintosh convention '\r'.
See PEP 278 and PEP 3116, as well as str.splitlines() for an
additional use.
AFAIK python does not provide a public equivalent of Java's Scanner. But there is an (undocumented) re.Scanner which could be used to achieve what you want.
You simply provide a "lexicon" when create an instance and then call the scan method.
Probably the easiest way of achieving what you want is to read the file in chunks, and split it using re.split.

Related

Java Scanner.nextLine() mistakenly parses unicode (emoji) as a new line

Easiest demonstrated with an example:
String test = "salut ð\u009F\u0098\u0085 test";
Scanner scan = new Scanner(test);
System.out.println("1:" + scan.nextLine());
System.out.println("2:" + scan.nextLine());
This was a string in user input so unfortunately I'm not 100% sure what that unicode is, but if I recall correctly, it was an emoji (I saw the message when it was sent).
The output is:
1:salut ð
2: test
My expected output is just 1 line (i.e. the example code should give a NoSuchElementException because the second nextLine() should fail.). Why is it parsing as two lines? What is a potential workaround?
When I open the file in a text editor it correctly does not treat that unicode as a new line.
Why is it parsing as two lines?
Although this is an uncommon codepoint, the unicode name of U+0085 is NEXT LINE [NEL], I guess it could be considered a new line character.
But is there a reason BufferedReader and text editors like Sublime Text don't parse it as an actual new line, while Scanner does?
If you look at the respective documentations of Scanner and BufferedReader:
Scanner.nextLine:
Advances this scanner past the current line and returns the input that was skipped. This method returns the rest of the current line, excluding any line separator at the end. The position is set to the beginning of the next line.
Since this method continues to search through the input looking for a line separator...
BufferedReader.readLine:
Reads a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.
Scanner.nextLine just says "line separator" a very vague term (it certainly doesn't refer to the Unicode category "Line Separators", which only has one codepoint), whereas the BufferedReader.readLine documentation states exactly what a line is.
Considering how Scanner also handles localised number formats and stuff, my guess is that it is designed to be a "smarter" class than BufferedReader.
Looking at the source code of my version of the JDK, Scanner considers the following strings "line separators":
\r\n
\n
\r
\u2028
\u2029
\u0085
The reason why \u0085 is considered a new line character is apparently related to XML parsing.

Using the split method to split a text file of music notes that also has line breaks

I am wondering how you can use the split method with this example considering the fact that that there is a line break in the file.
g3,g3,g3,c4-,a3-,g4-,r,r,r,g3,g3,g3,c4-,a3-,a4,g4-,r,r,r,c4,c4,c4,e4,r
g4,r,a4,r,r,b4b,r,a4,f4,r,g4,r,r,g4#,r,g4,d4#,r,g4
I read the Pattern api and tutorials and think it should be like so.
line.split("(,\n)");
I also tried
line.split([,\n]);
and
line.split("[,\n]");
lines may separated using \r or \n both of them, or even some other characters. Since Java 8 you can use \\R to represent line separators (more info). So you could try using
String[] arr = yourText.split(",|\\R");
As Pshemo notes, the 3rd option str.split("[,\n]") should work assuming the file ends each line with \n and not \r\n.
Additionally, how you read the file may affect your split argument.
If you are reading the file in with a BufferedReader, then going line by line with the readLine() method will automatically exclude any line-termination characters.

printf: Difference between \n and %n [duplicate]

I'm reading Effective Java and it uses %n for the newline character everywhere. I have used \n rather successfully for newline in Java programs.
Which is the 'correct' one? What's wrong with \n ? Why did Java change this C convention?
From a quick google:
There is also one specifier that doesn't correspond to an argument. It is "%n" which outputs a line break. A "\n" can also be used in some cases, but since "%n" always outputs the correct platform-specific line separator, it is portable across platforms whereas"\n" is not.
Please refer
https://docs.oracle.com/javase/tutorial/java/data/numberformat.html
Original source
%n is portable across platforms
\n is not.
See the formatting string syntax in the reference documentation:
'n' line separator The result is the
platform-specific line separator
While \n is the correct newline character for Unix-based systems, other systems may use different characters to represent the end of a line. In particular, Windows system use \r\n, and early MacOS systems used \r.
By using %n in your format string, you tell Java to use the value returned by System.getProperty("line.separator"), which is the line separator for the current system.
Warning:
If you're doing NETWORKING code, you might prefer the certainty of \n, as opposed to %n which may send different characters across the network, depending upon what platform it's running on.
"correct" depends on what exactly it is you are trying to do.
\n will always give you a "unix style" line ending.
\r\n will always give you a "dos style" line ending.
%n will give you the line ending for the platform you are running on
C handles this differently. You can choose to open a file in either "text" or "binary" mode. If you open the file in binary mode \n will give you a "unix style" line ending and "\r\n" will give you a "dos style" line ending. If you open the file in "text" mode on a dos/windows system then when you write \n the file handling code converts it to \r\n. So by opening a file in text mode and using \n you get the platform specific line ending.
I can see why the designers of java didn't want to replicate C's hacky ideas regarding "text" and "binary" file modes.
Notice these answers are only true when using System.out.printf() or System.out.format() or the Formatter object. If you use %n in System.out.println(), it will simply produce a %n, not a newline.
In java, \n always generate \u000A linefeed character. To get correct line separator for particular platform use %n.
So use \n when you are sure that you need \u000A linefeed character, for example in networking.
In all other situations use %n
%n format specifier is a line separator that's portable across operating systems. However, it cannot be used as an argument to System.out.print or System.out.println functions.
It is always recommended to use this new version of line separator above \n.

java Scanner reads only first 2048 bytes

I'm using java.util.Scanner to read file contents from classpath with this code:
String path1 = getClass().getResource("/myfile.html").getFile();
System.out.println(new File(path1).length()); // 22244 (correct)
String file1 = new Scanner(new File(path1)).useDelimiter("\\Z").next();
System.out.println(file1.length()); // 2048 (first 2k only)
Code runs from idea with command (maven test)
/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java -Dmaven.home=/usr/share/java/maven-3.0.4 -Dclassworlds.conf=/usr/share/java/maven-3.0.4/bin/m2.conf -Didea.launcher.port=7533 "-Didea.launcher.bin.path=/Applications/IntelliJ IDEA 12 CE.app/bin" -Dfile.encoding=UTF-8 -classpath "/usr/share/java/maven-3.0.4/boot/plexus-classworlds-2.4.jar:/Applications/IntelliJ IDEA 12 CE.app/lib/idea_rt.jar" com.intellij.rt.execution.application.AppMain org.codehaus.classworlds.Launcher --fail-fast --strict-checksums test
It was running perfectly on my win7 machine. But after I moved to mac same tests fail.
I tried to google but didn't find much =(
Why Scanner with delimiter \Z read my whole file into a string on win7 but won't do it on mac?
I know there're more ways to read a file, but I like this one-liner and want to understand why it's not working.
Thanks.
Here is some info from java about it
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
\Z The end of the input but for the final terminator, if any
\z The end of the input
Line terminators
A line terminator is a one- or two-character sequence that marks the
end of a line of the input character sequence. The following are
recognized as line terminators:
A newline (line feed) character ('\n'), A carriage-return character
followed immediately by a newline character ("\r\n"), A standalone
carriage-return character ('\r'), A next-line character ('\u0085'), A
line-separator character ('\u2028'), or A paragraph-separator
character ('\u2029).
So use \z instead of \Z
There is a good article about this method of entirely reading file with Scanner:
http://closingbraces.net/2011/12/17/scanner-with-z-regex/
In brief:
Because a single read with “/z” as the delimiter should read
everything until “end of input”, it’s tempting to just do a single
read and leave it at that, as the examples listed above all do.
In most cases that’s OK, but I’ve found at least one situation where
reading to “end of input” doesn’t read the entire input – when the
input is a SequenceInputStream, each of the constituent InputStreams
appears to give a separate “end of input” of its own. As a result, if
you do a single read with a delimiter of “/z” it returns the content
of the first of the SequenceInputStream’s constituent streams, but
doesn’t read into the rest of the constituent streams.
Beware of using it. It will be better to read it line-by-line, or use hasNext() checking until it will be real false.
UPD: In other words, try this code:
StringBuilder file1 = new StringBuilder();
Scanner scanner = new Scanner(new File(path1)).useDelimiter("\\Z");
while (scanner.hasNext()) {
file1.append(scanner.next());
}
I encountered this as well when using nextLine() on Mac, Java 7 update 45. Worse, after the line that is longer than 2048 bytes, the rest of the file is ignored and the Scanner thinks that it is already the end of file.
I change it to explicitly tell Scanner to use larger buffer, and it works.
Scanner sc = new Scanner(new BufferedInputStream(new FileInputStream(nf), 20*1024*1024), "utf-8");

What's up with Java's "%n" in printf?

I'm reading Effective Java and it uses %n for the newline character everywhere. I have used \n rather successfully for newline in Java programs.
Which is the 'correct' one? What's wrong with \n ? Why did Java change this C convention?
From a quick google:
There is also one specifier that doesn't correspond to an argument. It is "%n" which outputs a line break. A "\n" can also be used in some cases, but since "%n" always outputs the correct platform-specific line separator, it is portable across platforms whereas"\n" is not.
Please refer
https://docs.oracle.com/javase/tutorial/java/data/numberformat.html
Original source
%n is portable across platforms
\n is not.
See the formatting string syntax in the reference documentation:
'n' line separator The result is the
platform-specific line separator
While \n is the correct newline character for Unix-based systems, other systems may use different characters to represent the end of a line. In particular, Windows system use \r\n, and early MacOS systems used \r.
By using %n in your format string, you tell Java to use the value returned by System.getProperty("line.separator"), which is the line separator for the current system.
Warning:
If you're doing NETWORKING code, you might prefer the certainty of \n, as opposed to %n which may send different characters across the network, depending upon what platform it's running on.
"correct" depends on what exactly it is you are trying to do.
\n will always give you a "unix style" line ending.
\r\n will always give you a "dos style" line ending.
%n will give you the line ending for the platform you are running on
C handles this differently. You can choose to open a file in either "text" or "binary" mode. If you open the file in binary mode \n will give you a "unix style" line ending and "\r\n" will give you a "dos style" line ending. If you open the file in "text" mode on a dos/windows system then when you write \n the file handling code converts it to \r\n. So by opening a file in text mode and using \n you get the platform specific line ending.
I can see why the designers of java didn't want to replicate C's hacky ideas regarding "text" and "binary" file modes.
Notice these answers are only true when using System.out.printf() or System.out.format() or the Formatter object. If you use %n in System.out.println(), it will simply produce a %n, not a newline.
In java, \n always generate \u000A linefeed character. To get correct line separator for particular platform use %n.
So use \n when you are sure that you need \u000A linefeed character, for example in networking.
In all other situations use %n
%n format specifier is a line separator that's portable across operating systems. However, it cannot be used as an argument to System.out.print or System.out.println functions.
It is always recommended to use this new version of line separator above \n.

Categories