Newline character different than system newline - java

My Question: How do I force an input stream to process line separators as the system standard line separator?
I read a file to a string and the newlines get converted to \n but my System.getProperty("line.separator"); is \r\n. I want this to be portable, so I want my file reader to read the newlines as the system standard newline character (whatever that may be). How can I force it? Here are my methods from the Java Helper Library to read the file in as a string.
/**
* Takes the file and returns it in a string. Uses UTF-8 encoding
*
* #param fileLocation
* #return the file in String form
* #throws IOException when trying to read from the file
*/
public static String fileToString(String fileLocation) throws IOException {
InputStreamReader streamReader = new InputStreamReader(new FileInputStream(fileLocation), "UTF-8");
return readerToString(streamReader);
}
/**
* Returns all the lines in the Reader's stream as a String
*
* #param reader
* #return
* #throws IOException when trying to read from the file
*/
public static String readerToString(Reader reader) throws IOException {
StringWriter stringWriter = new StringWriter();
char[] buffer = new char[1024];
int length;
while ((length = reader.read(buffer)) > 0) {
stringWriter.write(buffer, 0, length);
}
reader.close();
stringWriter.close();
return stringWriter.toString();
}

Your readerToString method doesn't do anything to line endings. It simply copies character data - that's all. It's entirely unclear how you're diagnosing the problem, but that code really doesn't change \n to \r\n. It must be \r\n in the file - which you should look at in a hex editor. What created the file in the first place? You should look there for how any line breaks are represented.
If you want to read lines, use BufferedReader.readLine() which will cope with \r, \n or \r\n.
Note that Guava has a lot of helpful methods for reading all the data from readers, as well as splitting a reader into lines etc.

It's advisable to use BufferedReader for reading a file line-by-line in a portable way, and then you can use each of the lines read for writing to the required output using the line separator of your choice

With the Scanner#useDelimiter method you can specify what delimiter to use when reading from a File or InputStream or whatever.

You can use a BufferedReader to read the file line by line and convert the line separators, e.g.:
public static String readerToString(Reader reader) throws IOException {
BufferedReader bufReader = new BufferedReader(reader);
StringBuffer stringBuf = new StringBuffer();
String separator = System.getProperty("line.separator");
String line = null;
while ((line = bufReader.readLine()) != null) {
stringBuf.append(line).append(separator);
}
bufReader.close();
return stringBuf.toString();
}

Related

Character missing when use the InputStreamReader class in Java

I had written some codes to read from a text file char by char and then print it to the screen,but the result had made me feel confused,here it is:
this is the code that i had written
import java.io.*;
import java.nio.charset.StandardCharsets;
public class learnIO
{
public static void main(String[] args) throws IOException{
var in = new InputStreamReader(new FileInputStream("test1.txt"), StandardCharsets.UTF_8);
while(in.read() != -1){
System.out.println((char)in.read());
}
}
}
the content and encoding scheme of the file:
file test1.txt
test1.txt: ASCII text
cat test1.txt
hello, world!
the result is:
e
l
,
w
r
d
some char had missed,Why did this happen?
return type of read method of InputStreamReader is int that takes 4 bytes
and char type is 2 bytes so casting int to char you skip 2 bytes
refer to https://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html
You need to use InputStreamReader inside BufferedReader as from the official oracle documentation it says that
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or maybe given explicitly, or the platform's default charset may be accepted.
Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream.
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
For top efficiency, consider wrapping an InputStreamReader within a BufferedReader. For example:
BufferedReader in
= new BufferedReader(new InputStreamReader(System.in));
So the solution to your problem can be solved using the following code
try {
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("hello.txt");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
//Read File Line By Line
char c;
while ((c = (char) br.read()) != (char) -1) {
// Print the content on the console
String character = Character.toString(c);
System.out.println(character);
}
//Close the input stream
in.close();
} catch (Exception e) {//Catch exception if any
System.err.println("Error: " + e.getMessage());
}

java Regex for matching in file

I want to find warnings defined by a regex-pattern in a log file
(yes tex-log file)
and also find pattern in a tex file which signifies
that it is a main file.
To that end, I read the file linewise and match the pattern.
This works fine as long as the pattern is one line only.
// may throw FileNotFoundException < IOExcption
FileReader fileReader = new FileReader(file);
// BufferedReader for perfromance
BufferedReader bufferedReader = new BufferedReader(fileReader);
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);//
// readLine may throw IOException
for (String line = bufferedReader.readLine();
line != null;
// readLine may thr. IOException
line = bufferedReader.readLine()) {
if (pattern.matcher(line).find()) {
return true;
}
}
return false;
If it spreads over lines, this approach becomes difficult.
I tried
CharBuffer chars = CharBuffer.allocate(1000);
// may throw IOException
int numRead = bufferedReader.read(chars);
System.out.println("file: "+file);
System.out.println("numRead: "+numRead);
System.out.println("chars: '"+chars+"'");
return pattern.matcher(chars).find();
but this did not work: no matching at all!!
numRead yields 1000 whereas chars seems to be ''!!!!
Example: pattern:
\A(\RequirePackage\s*([(\s|\w|,)])?\s{\w+}\s*([(\d|.)+])?|
\PassOptionsToPackage\s*{\w+}\s*{\w+}|
%.$|
\input{[^{}]}|
\s)*
\(documentstyle|documentclass)
is my pattern for the latex main file.
One such file is attached in part:
\RequirePackage[l2tabu, orthodox]{nag}
\documentclass[10pt, a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage{fancyvrb}
\title{The dvi-format and the program dvitype}
\author{Ernst Reissner (rei3ner#arcor.de)}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This document describes the dvi file format
traditionally used by \LaTeX{}
and still in use with \texttt{htlatex} and that like.
How to resolve that problem?
If you need multi-line matching and the log file is not too large, you can read the whole file in one string:
String content = new Scanner(file).useDelimiter("\\Z").next();
and then run the regex against content.

Finding CRLF in string

I have a file and I want to read in the string on each line. If the line does not end in CRLF (\r\n), I want to to print something. I made this file by redirecting output from print commands similar to the following.
System.out.println("Test\r\n");
But when I read this line in from the file using buffered reader, it doesn't seem like it catches the CRLF.
I use the following to detect the crlf (where inputline is the line that has been read in).
if(inputline.indexOf("\r\n")<0)
It never detects the \r\n. How can I remedy this? Is this an issue with buffered reader?
readLine
public String readLine()
throws IOException
Read a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.
Returns:
A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached
from http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
Thus you may need to write some of your own code (or take this, borrowed from http://www.coderanch.com/t/276442//java/Reading-file-byte-array)
private byte[] toByteArray(File file) throws FileNotFoundException, IOException{
int length = (int) file.length();
byte[] array = new byte[length];
InputStream in = new FileInputStream(file);
int offset = 0;
while (offset < length) {
offset += in.read(array, offset, (length - offset));
}
in.close();
return array;
}
This will give you all the bytes - nothing stripped. Knock yourself out looking for \r\n...
You can use java.util.Scanner which knows how to find lines in a file (or text)
Scanner sc = new Scanner(new File("filename"));
while(sc.hasNextLine()) {
System.out.println(sc.nextLine());
}

Remove escape characters from String loaded from file

I am using the below method to load a string from file to variable.
private static String readFile(String path) throws IOException {
FileInputStream stream = new FileInputStream(new File(path));
try {
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
/* Instead of using default, pass in a decoder. */
return Charset.defaultCharset().decode(bb).toString();
}
finally {
stream.close();
}
}
The problem is that my variable has escape characters in it. I want my variables to contain:
some string
but instead it looks like:
some string&#xd
how can i improve my method to not allow that?
You can use Reader instead, and BufferedReader in particular to read lines from TXT file:
BufferedReader br = new BufferedReader(new FileReader(path));
String line = br.readLine(); // this strips line termination characters for you
If you want to read the whole file, there are lots of utility classes that provide this functionality (like Google Guava):
String contents = Files.toString(new File(path), charset);
I would think that there are some hidden characters in your .txt file.
You could try:
return Charset.defaultCharset()
.newDecoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
.decode(bb)
.toString()

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.
Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}
While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.
Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...
If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]
How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html
I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}
The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}
If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

Categories