Check line for unprintable characters while reading text file

Check line for unprintable characters while reading text file - java

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

Related

How to split a byte array that contains multiple "lines" in Java?

Say we have a file like so:
one
two
three
(but this file got encrypted)
My crypto method returns the whole file in memory, as a byte[] type.
I know byte arrays don't have a concept of "lines", that's something a Scanner (for example) could have.
I would like to traverse each line, convert it to string and perform my operation on it but I don't know
how to:
Find lines in a byte array
Slice the original byte array to "lines" (I would convert those slices to String, to send to my other methods)
Correctly traverse a byte array, where each iteration is a new "line"
Also: do I need to consider the different OS the file might have been composed in? I know that there is some difference between new lines in Windows and Linux and I don't want my method to work only with one format.
Edit: Following some tips from answers here, I was able to write some code that gets the job done. I still wonder if this code is worthy of keeping or I am doing something that can fail in the future:
byte[] decryptedBytes = doMyCrypto(fileName, accessKey);
ByteArrayInputStream byteArrInStrm = new ByteArrayInputStream(decryptedBytes);
InputStreamReader inStrmReader = new InputStreamReader(byteArrInStrm);
BufferedReader buffReader = new BufferedReader(inStrmReader);
String delimRegex = ",";
String line;
String[] values = null;
while ((line = buffReader.readLine()) != null) {
values = line.split(delimRegex);
if (Objects.equals(values[0], tableKey)) {
return values;
}
}
System.out.println(String.format("No entry with key %s in %s", tableKey, fileName));
return values;
In particular, I was advised to explicitly set the encoding but I was unable to see exactly where?

If you want to stream this, I'd suggest:
Create a ByteArrayInputStream to wrap your array
Wrap that in an InputStreamReader to convert binary data to text - I suggest you explicitly specify the text encoding being used
Create a BufferedReader around that to read a line at a time
Then you can just use:
String line;
while ((line = bufferedReader.readLine()) != null)
{
// Do something with the line
}
BufferedReader handles line breaks from all operating systems.
So something like this:
byte[] data = ...;
ByteArrayInputStream stream = new ByteArrayInputStream(data);
InputStreamReader streamReader = new InputStreamReader(stream, StandardCharsets.UTF_8);
BufferedReader bufferedReader = new BufferedReader(streamReader);
String line;
while ((line = bufferedReader.readLine()) != null)
{
System.out.println(line);
}
Note that in general you'd want to use try-with-resources blocks for the streams and readers - but it doesn't matter in this case, because it's just in memory.

As Scott states i would like to see what you came up with so we can help you alter it to fit your needs.
Regarding your last comment about the OS; if you want to support multiple file types you should consider making several functions that support those different file extensions. As far as i know you do need to specify which file and what type of file you are reading with your code.

java Regex for matching in file

I want to find warnings defined by a regex-pattern in a log file
(yes tex-log file)
and also find pattern in a tex file which signifies
that it is a main file.
To that end, I read the file linewise and match the pattern.
This works fine as long as the pattern is one line only.
// may throw FileNotFoundException < IOExcption
FileReader fileReader = new FileReader(file);
// BufferedReader for perfromance
BufferedReader bufferedReader = new BufferedReader(fileReader);
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);//
// readLine may throw IOException
for (String line = bufferedReader.readLine();
line != null;
// readLine may thr. IOException
line = bufferedReader.readLine()) {
if (pattern.matcher(line).find()) {
return true;
}
}
return false;
If it spreads over lines, this approach becomes difficult.
I tried
CharBuffer chars = CharBuffer.allocate(1000);
// may throw IOException
int numRead = bufferedReader.read(chars);
System.out.println("file: "+file);
System.out.println("numRead: "+numRead);
System.out.println("chars: '"+chars+"'");
return pattern.matcher(chars).find();
but this did not work: no matching at all!!
numRead yields 1000 whereas chars seems to be ''!!!!
Example: pattern:
\A(\RequirePackage\s*([(\s|\w|,)])?\s{\w+}\s*([(\d|.)+])?|
\PassOptionsToPackage\s*{\w+}\s*{\w+}|
%.$|
\input{[^{}]}|
\s)*
\(documentstyle|documentclass)
is my pattern for the latex main file.
One such file is attached in part:
\RequirePackage[l2tabu, orthodox]{nag}
\documentclass[10pt, a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage{fancyvrb}
\title{The dvi-format and the program dvitype}
\author{Ernst Reissner (rei3ner#arcor.de)}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This document describes the dvi file format
traditionally used by \LaTeX{}
and still in use with \texttt{htlatex} and that like.
How to resolve that problem?

If you need multi-line matching and the log file is not too large, you can read the whole file in one string:
String content = new Scanner(file).useDelimiter("\\Z").next();
and then run the regex against content.

Regarding word search error (Encoding error Java)

I have a list of french words where I am trying to search in my database. The words are "thé Mariage frères", "thé Lipton" etc.
While I am reading my file in java, it shows the words as "thÃ© Lipton", "thÃ© Mariage frÃ¨res". It fails to get the correct words.
I don't know how to correct my errors.
Help me, please!!!

You file is in one encoding (maybe latin1/iso-8859-1) and you're reading your file in another encoding.
See if this port helps How to read a file in Java with specific character encoding?

Try this.
try (FileInputStream fis = new FileInputStream("input.txt");
InputStreamReader isr = new InputStreamReader(fis, StandardCharsets.UTF_8);
BufferedReader reader = new BufferedReader(isr)) {
String line;
while ((line = reader.readLine()) != null)
System.out.println(line);
}

Try creating Scanner object like this
Scanner s = new Scanner(new File("French_Tea_keywords/filter_keywords.txt"), "UTF8");

java read properties and xml file using stringbuilder

I need to read a set of xml and property files and parse the data. Currently I am using inputstream ans string builder to do this. But this does not create the file in the same way as input file is. I donot want to remove the white spaces and new lines. How do i achieve this.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
}
String s = sb5.toString();
My output is:
#test 123 #test2 345
Expected output is:
#test
123
#test2
345
Any thoughts ? Thanks

br.readLine() consumes the line breaks, you need to add them to your StringBuilder after appending the line.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
sb5.append("\n");
}
If you want an extremely simple solution for reading a file to a String, Apache Commons-IO has a method for performing such a task (org.apache.commons.io.FileUtils).
FileUtils.readFileToString(File file, String encoding);

readLine() method doesn't add the EOL character (\n). So while appending the string to the builder, you need to add the EOL char, like sb5.append(line5+"\n");

The various readLine methods discard the newline from the input.
From the BufferedReader docs:
Returns: A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached
A solution may be as simple as adding back a newline to your StringBuilder for every readLine: sb5.append(line5 + "\n");.
A better alternative is to read into an intermediate buffer first, using the read method, supplying your own char[]. You can still use StringBuilder.append, and get a String will match the file contents.

Java: Read a txt file and save it in an array of strings but without backslashes (spaces)?

I am using this code to read a txt file, line by line.
// Open the file that is the first command line parameter
FileInputStream fstream = new FileInputStream("/Users/dimitramicha/Desktop/SweetHome3D1.txt");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// Read File Line By Line
int i = 0;
while ((strLine = br.readLine()) != null) {
str[i] = strLine;
i++;
}
// Close the input stream
in.close();
and I save it in an array.
Afterwards, I would like to make an if statement about the Strings that I saved in the array. But when I do that it doesn't work, because (as I've thought) it saves also the spaces (backslashes). Do you have any idea how I can save the data in the array but without spaces?

I would do:
strLineWithoutSpaces = strLine.replace(' ', '');
str[i] = strLineWithoutSpaces;
You can also do more replaces if you find other characters that you don't want.

Have a look at the replace method in String and call it on strLine before putting it in the array.

You can use a Scanner which by default uses white space to separate tokens. Have a look at this tutorial.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Check line for unprintable characters while reading text file - java

My program must read text files - line by line. Files in UTF-8. I am not sure that files are correct - can contain unprintable characters. Is possible check for it without going to byte level? Thanks.

Just found out that with the Java NIO (java.nio.file.*) you can easily write: List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8); for(String line:lines){ System.out.println(line); } instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression [^\p{Print}]

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

Related

How to split a byte array that contains multiple "lines" in Java?

java Regex for matching in file

Regarding word search error (Encoding error Java)

java read properties and xml file using stringbuilder

Java: Read a txt file and save it in an array of strings but without backslashes (spaces)?

Categories

Resources