Split text file by line, platform-independently

Split text file by line, platform-independently - java

I wanna split a text file by line, so on Windows that would be text = new String(Files.readAllBytes(path), charset); text.split("\r\n", -1) and on UNIX it's text.split("\n", -1), and text.split(System.lineSeparator(), -1) works for both. But what if a file is created on UNIX and copied to Windows or vice versa - how do I best handle those cases? And what would that mean for the file itself - would it be broken if you tried to view it in a text editor like notepad?

Try Files.readAllLines. Alternatively Files.lines which will return you a Stream of lines.
From the javadoc of readAllLines:
This method recognizes the following as line terminators:
\u000D followed by \u000A, CARRIAGE RETURN followed by LINE FEED
\u000A, LINE FEED
\u000D, CARRIAGE RETURN
Copying from one file system to the other doesn't change the content of the file (except you are doing some "special" copying ;-) ).

If you create a file, it will use whatever line separator is native to the platform.
If you then open the file on another platform, the file does not change. If you open a unix file on windows, it doesn't gain the extra \r character.
It really depends on the editor as to how it looks, some editors handle things better than others.
As for Java, just use System.lineSeparator() if you need to specify the end of line character sequence.
As #Andreas mentioned, you can use BufferedReader.readLine() to read a file a line at a time, and it will handle the end of line character sequence in a platform independent manner.

Related

Java: PrintWriter and newlines in a string

My question is pretty straight forward, if I have a single long string with alot of "\n" newlines within it, i.e:
strings = "Hey\nThere\nFriend\n"
And use a PrintWriter in Java to do the following:
PrintWriter save = new PrintWriter("test.txt");
save.println(strings);
save.close();
Will the file I end up with be formatted with the \n? i.e the file will have:
Hey
There
Friend
Or will it have:
Hey\nThere\nFriend
If it's the latter, can someone guide me on how I might change my code (and understanding of how PrintWriter works) to create the former output?

In fact, \n will work but only for Unix based OS. Windows based OS use \r\n as separator.
You should avoid using specific OS line separator if you want to write a portable code.
Favor System.lineSeparator() to not be OS dependent.
Note also that PrintWriter provides println() to achieve a break line that is not OS dependent (even if it is not necessary useful for you use case)

You will get a text file containing a single text line Hey\nThere\nFriend\n followed by your operating system new-line sequence (inserted by println()).
The meaning of \n depends on the operating system and possibly the text editor. On Linux \n usually will be interpreted as new-line sequence but on Windows the new-line sequence is \r\n so most text editors (e.g. native Notepad) will display a single HeyThereFriend line.

On windows platform \n means char(13) +Char(10) you can use
String nl = Character.toString ((char) 13)+Character.toString ((char) 10);
String strings = "Hey"+nl+"There"+nl+"Friend"+nl;
System.out.print(strings);

Unable to read any of file that contains specific character(s)

TL;DR
Why does reading in a file with – not find any data on Notepad?
Problem:
Up to this point, I have been using just plain ol' Notepad (Version 6.1) to read/write text for testing/answering questions here.
Simple bit of code to read in the text files contents, and print them to the console:
Scanner sc = new Scanner(new File("myfile.txt"));
while (sc.hasNextLine()) {
String text = sc.nextLine();
System.out.println(text);
}
All is well, the lines print as expected.
Then, if I put in this exact character: –, anywhere in the text file, it will not read any of the file, and print nothing to the console.
I can of course use Notepad++ or other (better) text editors, and there is no issue, the text, including the dash character, will print as expected.
I can also specify UTF-8, using Notepad, and it will work fine:
File fileDir = new File("myfile.txt");
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
On my original Notepad file, if I copy and paste the text (including the –) into Notepad++ and compare the two files with WinMerge, it tells me that the dash on Notepad is –, but on Notepad++, it is â€“.
Question:
Why, when this – is used in a text file in Notepad, it reads nothing, basically telling me that hasNextLine() is false? Should it not at least read the input until the line that contains this specific character?
Steps to reproduce:
On Windows 7, right-click and create new Text Document.
Put any text in the file (without any special characters, as such)
Put in this character anywhere in the file: –
Run the first block of code above
Output: BUILD SUCCESSFUL (total time: 1 second), i.e. doesn't print any of the text.
PS:
I know I asked a similar (well, it ended up being the same) question yesterday, but unfortunately, it seems I may not have explained myself well, or some of the viewers didn't fully read the question. Either way, I think I've explained it better here.

The issue seems to be a difference of encoding. You have to read in the same encoding that the file was written into.
Your system notepad probably uses Windows-1252(or Cp-1252) encoding. There have been problems in this encoding with a range of characters between 128 - 159. The Dash lies between this range. This range is not present in the equivalent ISO 8859-1, and is only present in the Cp1252 encoding.
Eclipse, when reading the notepad file, assumes the file to be having the encoding ISO-8859-1 (as it is equivalent). But this character is not present in ISO-8859-1, hence the problem. If you want to read from Java, you will have to specify Cp1252, and you should get your output.
This is also the reason why your code with UTF-8 works correctly, when the file in notepad is written in UTF-8.

A buffered reader reads more than the current line, maybe the text upto the problematic bytes. Charset.CharsetDecoder.onMalformedInput then comes in play, and there something restricive happens, which I would normally not have expected.
Do you use a special JDK? Do you wipe exceptions under the carpet? Like a lambda wrapping the above code. (Add catch Throwable)
Is your platfom encoding -Dfile.encoding=ISO-8859-1 instead of Cp1252.

Carriage Return and Line Feed windows and Linux java application

I am working on a integration test application, this is what I am doing in the test case,
I read a test input file,which is stored in the cvs , write it to a file in the file system,the application polls the directory for the file, processes it and creates the output file, and I poll the directory for the output file, test case is successful if the both the file contents are equal(I am reading the both input files and output files into strings and comparing them).
The problem is this test case fails when its runs in a linux system, the reason being the file which is stored in the cvs was checked in from a windows system which contains CRLF as the line terminations whereas the output file generated has the line terminations as CR,now when I read these files and compare them character by character, they are having a mismatch.
could anyone help here.

You can check the line separator for the host operating system using System.getProperty("line.separator")
Since you're using text files, you can also compare the file contents line by line. Check LineNumberReader.readLine() for that.

You can try to compare them by lines. E.g. use FileUtils for this.
List<String> file1 = FileUtils.readLines(...);
List<String> file2 = FileUtils.readLines(...);
return file1.equals(file2);

You could remove all the '\r' characters from the downloaded file? Or replace the "\r\n" Windows string by the "\n" Linux one. Beware of the Mac case too: end of line could be identified by "\r".

When you check in the file, you can tell CVS it's a binary file (cvs add -kb), and then CVS will not convert line endings along the way.
This has other drawbacks too, e.g. no proper diff, but if you really test character by character, I guess you don't need that.
Please note that you must specify -kb when adding the file, you can't change it later.

getting a "^M" after each line in unix from a java created file

I have a Java program that creates a file and prints a bunch of data using this statement:
out.write(data+"|"+data2+"\r\n");
When I view this file in vim in Unix I see a ^M after each line. What is it? What is causing this? How can I get rid of it?

You need to use the platform-specific line separator string instead of \r\n when constructing your output string. That can be obtained by System.getProperty("line.separator");.

^M is character 13 (decimal) which is the carriage return (in your code it's \r). Notice that M is the 13th letter of the alphabet.
You can get rid of it by not including \r in your code. This will work fine if you're on a unix platform. On windows, the file will look funny unless you're viewing it in something like Wordpad.

*nix uses \n for newline, Windows uses \r\n and produces that ^M character in vi and the like.

You may want to try running the file through dos2unix utility in *nix, it will get rid of ^M

You'll generally only see those if the first line was a unix line ending (lf) but it also includes DOS line endings. To remove them (and correct the file), load it again using :e ++ff=dos, then :set ff=unix, then write it.
Within the Java code, if you're writing text data instead of binary, use a PrintStream and use print() and println() which adds the correct line ending for your system.

"\n" delimiters issue

I have a stringbuilder object, that a line of data gets added to.
after each line gets added, I append a "\n" on the end to indicate a new line.
this stringbuilder object, finalised, gets written to a flat file.
When I open the flat file in notepad I get a small rectangle after every line and the column formatting is ruined.
When I open the flat file in wordpad, the new line is taken into consideration and the column formatting is perfect.
I have tried all ways I know of removing the new line entry before it gets written, but this removes the formatting when written to the flat file. I need the new line for the formatting of the columns.
how can I output the file with new lines but without using \n?

The Windows way of terminating a line is to use "\r\n", not just "\n".
You can find the "line separator for the current operating system" using the line.separator system property:
String lineSeparator = System.getProperty("line.separator");
...
StringBuilder builder = new StringBuilder();
...
builder.append(lineSeparator);
...

You can get the value for the system your Java program is running on from the system properties
public static String newline = System.getProperty("line.separator");

You should add System.getProperty("line.separator") instead of \n. Since "nodepad", it is \r\n, for MS Windows.

In Windows you should use \n\r. In *NIX (Linux/UNIX/Mac) u should use \n

If you're using Windows, you should be writing \r\n to get it to load properly in Notepad. The \n terminator is a Unix file ending, and Notepad won't parse it properly. Wordpad will convert them for you.
Also I suggest not using Notepad, and looking towards something like Vim.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split text file by line, platform-independently - java

Related

Java: PrintWriter and newlines in a string

Unable to read any of file that contains specific character(s)

Carriage Return and Line Feed windows and Linux java application

getting a "^M" after each line in unix from a java created file

"\n" delimiters issue

Categories

Resources