java Regex for matching in file

java Regex for matching in file - java

I want to find warnings defined by a regex-pattern in a log file
(yes tex-log file)
and also find pattern in a tex file which signifies
that it is a main file.
To that end, I read the file linewise and match the pattern.
This works fine as long as the pattern is one line only.
// may throw FileNotFoundException < IOExcption
FileReader fileReader = new FileReader(file);
// BufferedReader for perfromance
BufferedReader bufferedReader = new BufferedReader(fileReader);
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);//
// readLine may throw IOException
for (String line = bufferedReader.readLine();
line != null;
// readLine may thr. IOException
line = bufferedReader.readLine()) {
if (pattern.matcher(line).find()) {
return true;
}
}
return false;
If it spreads over lines, this approach becomes difficult.
I tried
CharBuffer chars = CharBuffer.allocate(1000);
// may throw IOException
int numRead = bufferedReader.read(chars);
System.out.println("file: "+file);
System.out.println("numRead: "+numRead);
System.out.println("chars: '"+chars+"'");
return pattern.matcher(chars).find();
but this did not work: no matching at all!!
numRead yields 1000 whereas chars seems to be ''!!!!
Example: pattern:
\A(\RequirePackage\s*([(\s|\w|,)])?\s{\w+}\s*([(\d|.)+])?|
\PassOptionsToPackage\s*{\w+}\s*{\w+}|
%.$|
\input{[^{}]}|
\s)*
\(documentstyle|documentclass)
is my pattern for the latex main file.
One such file is attached in part:
\RequirePackage[l2tabu, orthodox]{nag}
\documentclass[10pt, a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage{fancyvrb}
\title{The dvi-format and the program dvitype}
\author{Ernst Reissner (rei3ner#arcor.de)}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This document describes the dvi file format
traditionally used by \LaTeX{}
and still in use with \texttt{htlatex} and that like.
How to resolve that problem?

If you need multi-line matching and the log file is not too large, you can read the whole file in one string:
String content = new Scanner(file).useDelimiter("\\Z").next();
and then run the regex against content.

Related

What am I missing? NumberFormatException error

I want to read from a txt file which contains just numbers. Such file is in UTF-8, and the numbers are separated only by new lines (no spaces or any other things) just that. Whenever i call Integer.valueOf(myString), i get the exception.
This exception is really strange, because if i create a predefined string, such as "56\n", and use .trim(), it works perfectly. But in my code, not only that is not the case, but the exception texts says that what it couldn't convert was "54856". I have tried to introduce a new line there, and then the error text says it couldn't convert "54856
"
With that out of the question, what am I missing?
File ficheroEntrada = new File("C:\\in.txt");
FileReader entrada =new FileReader(ficheroEntrada);
BufferedReader input = new BufferedReader(entrada);
String s = input.readLine();
System.out.println(s);
Integer in;
in = Integer.valueOf(s.trim());
System.out.println(in);
The exception text reads as follows:
Exception in thread "main" java.lang.NumberFormatException: For input string: "54856"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)
at java.base/java.lang.Integer.parseInt(Integer.java:658)
at java.base/java.lang.Integer.valueOf(Integer.java:989)
at Quicksort.main(Quicksort.java:170)
The file in.txt consists of:
54856
896
54
53
2
5634

Well, aparently it had to do with Windows and those \r that it uses... I just tried executing it on a Linux VM and it worked. Thanks to everyone that answered!!

Try reading the file with Scanner class has use it's hasNextInt() method to identify what you are reading is Integer or not. This will help you find out what String/character is causing the issue
public static void main(String[] args) throws Exception {
File ficheroEntrada = new File(
"C:\\in.txt");
Scanner scan = new Scanner(ficheroEntrada);
while (scan.hasNext()) {
if (scan.hasNextInt()) {
System.out.println("found integer" + scan.nextInt());
} else {
System.out.println("not integer" + scan.next());
}
}
}

If you want to ensure parsability of a string, you could use a Pattern and Regex that.
Pattern intPattern = Pattern.compile("\\-?\\d+");
Matcher matcher = intPattern.matcher(input);
if (matcher.find()) {
int value = Integer.parseInt(matcher.group(0));
// ... do something with the result.
} else {
// ... handle unparsable line.
}
This pattern allows any numbers and optionally a minus before (without whitespace). It should definetly parse, unless it is too long. I don't know how it handles that, but your example seems to contain mostly short integers, so this should not matter.

Most probably you have a leading/trailing whitespaces in your input, something like:
String s = " 5436";
System.out.println(s);
Integer in;
in = Integer.valueOf(s.trim());
System.out.println(in);
Use trim() on string to get rid of it.
UPDATE 2:
If your file contains something like:
54856\n
896
54\n
53
2\n
5634
then use following code for it:
....your code
FileReader enter = new FileReader(file);
BufferedReader input = new BufferedReader(enter);
String currentLine;
while ((currentLine = input.readLine()) != null) {
Integer in;
//get rid of non-numbers
in = Integer.valueOf(currentLine.replaceAll("\\D+",""));
System.out.println(in);
...your code

Replace 2nd Occurrence of Word in Text File

I have a sentence in my text file,
Moreover, human serum could significantly enhance the LPS-induced DV suppression in a CD14-dependent manner, indicating that the "binding" of LPS to CD14 was critical for the induction of virus inhibition.
How do I replace the 2nd occurrence of CD14 to AB45 and write back to the text file?

For the algorithm itself,
file.indexOf("CD14", file.indexOf("CD14")+4)
can be used to locate the occurance (given that "file" is a string with all of the contents of your file). The second argument of "indexOf" asks for a start point. By calling indexOf twice, you find the first instance of the string than check for another instance skipping past the first instance (+4 since indexOf will return the start of the string, adding the length of the string skips over it). To replace the string,
int i = file.indexOf("CD14", file.indexOf("CD14")+4);
String s = file;
if(i != -1) s = file.substring(0,i) + "AD25" + file.substring(Math.min(i+4,file.length()), file.length());
If you're asking how to read/write a text file, try google. one example - Java: How to read a text file , another - http://www.javapractices.com/topic/TopicAction.do?Id=42

There are several approaches to take in solving this one. A very simple but verbose approach would be:
public static void replaceSecondOccurence(String originalText, String replacementText) throws IOException {
File file = new File("file.txt");
InputStreamReader reader = new InputStreamReader(new FileInputStream(file));
StringBuilder fileContent = new StringBuilder();
int content;
while ((content = reader.read()) != -1) {
fileContent.append((char) content);
}
reader.close();
int index = fileContent.lastIndexOf(originalText);
fileContent.replace(index, index + originalText.length(), replacementText);
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(file));
writer.write(fileContent.toString());
writer.close();
}

Newline character different than system newline

My Question: How do I force an input stream to process line separators as the system standard line separator?
I read a file to a string and the newlines get converted to \n but my System.getProperty("line.separator"); is \r\n. I want this to be portable, so I want my file reader to read the newlines as the system standard newline character (whatever that may be). How can I force it? Here are my methods from the Java Helper Library to read the file in as a string.
/**
* Takes the file and returns it in a string. Uses UTF-8 encoding
*
* #param fileLocation
* #return the file in String form
* #throws IOException when trying to read from the file
*/
public static String fileToString(String fileLocation) throws IOException {
InputStreamReader streamReader = new InputStreamReader(new FileInputStream(fileLocation), "UTF-8");
return readerToString(streamReader);
}
/**
* Returns all the lines in the Reader's stream as a String
*
* #param reader
* #return
* #throws IOException when trying to read from the file
*/
public static String readerToString(Reader reader) throws IOException {
StringWriter stringWriter = new StringWriter();
char[] buffer = new char[1024];
int length;
while ((length = reader.read(buffer)) > 0) {
stringWriter.write(buffer, 0, length);
}
reader.close();
stringWriter.close();
return stringWriter.toString();
}

Your readerToString method doesn't do anything to line endings. It simply copies character data - that's all. It's entirely unclear how you're diagnosing the problem, but that code really doesn't change \n to \r\n. It must be \r\n in the file - which you should look at in a hex editor. What created the file in the first place? You should look there for how any line breaks are represented.
If you want to read lines, use BufferedReader.readLine() which will cope with \r, \n or \r\n.
Note that Guava has a lot of helpful methods for reading all the data from readers, as well as splitting a reader into lines etc.

It's advisable to use BufferedReader for reading a file line-by-line in a portable way, and then you can use each of the lines read for writing to the required output using the line separator of your choice

With the Scanner#useDelimiter method you can specify what delimiter to use when reading from a File or InputStream or whatever.

You can use a BufferedReader to read the file line by line and convert the line separators, e.g.:
public static String readerToString(Reader reader) throws IOException {
BufferedReader bufReader = new BufferedReader(reader);
StringBuffer stringBuf = new StringBuffer();
String separator = System.getProperty("line.separator");
String line = null;
while ((line = bufReader.readLine()) != null) {
stringBuf.append(line).append(separator);
}
bufReader.close();
return stringBuf.toString();
}

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

Read XML, Replace Text and Write to same XML file via Java

Currently I am trying something very simple. I am looking through an XML document for a certain phrase upon which I try to replace it. The problem I am having is that when I read the lines I store each line into a StringBuffer. When I write the it to a document everything is written on a single line.
Here my code:
File xmlFile = new File("abc.xml")
BufferedReader br = new BufferedReader(new FileReade(xmlFile));
String line = null;
while((line = br.readLine())!= null)
{
if(line.indexOf("abc") != -1)
{
line = line.replaceAll("abc","xyz");
}
sb.append(line);
}
br.close();
BufferedWriter bw = new BufferedWriter(new FileWriter(xmlFile));
bw.write(sb.toString());
bw.close();
I am assuming I need a new line character when I prefer sb.append but unfortunately I don't know which character to use as "\n" does not work.
Thanks in advance!
P.S. I figured there must be a way to use Xalan to format the XML file after I write to it or something. Not sure how to do that though.

The readline reads everything between the newline characters so when you write back out, obviously the newline characters are missing. These characters depend on the OS: windows uses two characters to do a newline, unix uses one for example. To be OS agnostic, retrieve the system property "line.separator":
String newline = System.getProperty("line.separator");
and append it to your stringbuffer:
sb.append(line).append(newline);

Modified as suggested by Brel, your text-substituting approach should work, and it will work well enough for simple applications.
If things start to get a little hairier, and you end up wanting to select elements based on their position in the XML structure, and if you need to be sure to change element text but not tag text (think <abc>abc</abc>), then you'll want to call in in the cavalry and process the XML with an XML parser.
Essentially you read in a Document using a DocuemntBuilder, you hop around the document's nodes doing whatever you need to, and then ask the Document to write itself back to file. Or do you ask the parser? Anyway, most XML parsers have a handful of options that let you format the XML output: You can specify indentation (or not) and maybe newlines for every opening tag, that kinda thing, to make your XML look pretty.

Sb would be the StringBuffer object, which has not been instantiated in this example. This can added before the while loop:
StringBuffer sb = new StringBuffer();

Scanner scan = new Scanner(System.in);
String filePath = scan.next();
String oldString = "old_string";
String newString = "new_string";
String oldContent = "";
BufferedReader br = null;
FileWriter writer = null;
File xmlFile = new File(filePath);
try {
br = new BufferedReader(new FileReader(xmlFile));
String line = br.readLine();
while (line != null) {
oldContent = oldContent + line + System.lineSeparator();
line = br.readLine();
}
String newContent = oldContent.replaceAll(oldString, newString);
writer = new FileWriter(xmlFile);
writer.write(newContent);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
scan.close();
br.close();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java Regex for matching in file - java

If you need multi-line matching and the log file is not too large, you can read the whole file in one string: String content = new Scanner(file).useDelimiter("\\Z").next(); and then run the regex against content.

Related

What am I missing? NumberFormatException error

Replace 2nd Occurrence of Word in Text File

Newline character different than system newline

Check line for unprintable characters while reading text file

Read XML, Replace Text and Write to same XML file via Java

Categories

Resources