Result separation with regex - java

I wrote a parser that reads a file line by line and parses it with a regex statement. (Regex below)
case "countries":
pattern = "\\\"(.+?)\\\"(\\s+)?(\\((.+?)\\))?(\\s+)?(\\{(.+?)\\(\\#(.+?)\\)\\})?(\\s+)?(.+)";
substitution = "$1, $4, $7, $8, $10";
break;
This outputs a list with all the groups I want and each group separated by a comma. (through the result.split(",");)
Now lets say I don't want to use a comma but instead an | or an *. Changing the comma to any other string doesn't seem to change anything. What am I missing?
try (CSVWriter csvWriter = new CSVWriter(new FileWriter(myLocalPath + "CSV/" + choice.toLowerCase() + ".csv")))
{
Pattern r = Pattern.compile(pattern);
while (br.readLine() != null)
{
String nextLine = br.readLine();
Matcher matcher = r.matcher(nextLine);
String result = matcher.replaceAll(substitution);
String[] line = result.split("lorem");
csvWriter.writeNext(line, false);
}
}catch(Exception e){
System.out.println(e);
System.out.println("Parsing done!");
}

seems what you're missing is Pattern.quote, if argument must be read literally, indeed split argument is a regex.
String[] line = result.split(Pattern.quote("..."));

Related

Parsing Windows tasklist output in Java

I am trying to build an array of processes running on my machine; to do so I have been trying to use the following two commands:
tasklist /fo csv /nh # For a CSV output
tasklist /nh # For a non-CSV output
The issue that I am having is that I can not properly parse the output.
First Scenario
I have a line like:
"wininit.exe","584","Services","0","5,248 K"
Which I have attempted to parse using "".split(","), however this fails when it comes to the process memory usage - the comma in the number field willl result in an extra field.
Second Scenario
Without the non-CSV output, I have a line like:
wininit.exe 584 Services 0 5,248 K
Which I am attempting to parse using "".split("\\s+") however this one now fails on a process like System Idle Process, or any other process with a space in the executible name.
How can I parse either of these output such that the same split index will always contain the correct data column?
To parse a string, always prefer the most strict formatting. In this case, CSV. In this way, you could process each line with a regular expression containing FIVE groups:
private final Pattern pattern = Pattern
.compile("\\\"([^\\\"]*)\\\",\\\"([^\\\"]*)\\\",\\\"([^\\\"]*)\\\",\\\"([^\\\"]*)\\\",\\\"([^\\\"]*)\\\"");
private void parseLine(String line) {
Matcher matcher = pattern.matcher(line);
if (!matcher.find()) {
throw new IllegalArgumentException("invalid format");
}
String name = matcher.group(1);
int pid = Integer.parseInt(matcher.group(2));
String sessionName = matcher.group(3);
String sessionId = matcher.group(4);
String memUsage = matcher.group(5);
System.out.println(name + ":" + pid + ":" + memUsage);
}
You should use a StringTokenizer class instead of split. You use the " delimiter and expect the delimiter to be returned. You can then use that delimiter to provide field separation. For instance,
StringTokenizer st = new StringTokenizer(input, "\"", true);
State state = NONE;
while (st.hasMoreTokens()) {
String t = st.nextToken();
switch(state) {
case NONE:
if ("\"".equals(t)) {
state = BEGIN;
}
// skip the ,
break;
case BEGIN:
// Store t in which entry it correspond to.
state = END;
break;
case END:
state = NONE;
break;
}
}
Each token will be stored within its respective data set and you can then process that information for each Process.
Tried this and seems to work.
public void parse(){
try {
Runtime runtime = Runtime.getRuntime();
Process proc = runtime.exec("tasklist -fo csv /nh");
BufferedReader stdInput = new BufferedReader(new
InputStreamReader(proc.getInputStream()));
String line = "";
while ((line = stdInput.readLine()) != null) {
System.out.println();
for (String column: line.split("\"")){
if (!column.equals(",")&& !column.equals("")){
System.out.print("["+column+"]");
}
}
}
}catch (Exception e){
e.printStackTrace();
}
}

Splitting a text file into multiple files by specific character sequence

I have a file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
what I need is read the text between ".I 1" and ".I 2" and save it as a separate file like "DOC_ID_1.txt" and then read the text between ".I 2" and ".I 3"
and save it as a separate file like "DOC_ID_2.txt" and so on. lets assume that the number of .I # is not known.
I have tried this but cannot finish it. any help will be appreciated
String inputDocFile="C:\\Dropbox\\Data\\cran.all.1400";
try {
File inputFile = new File(inputDocFile);
FileReader fileReader = new FileReader(inputFile);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line=null;
String outputDocFileSeperatedByID="DOC_ID_";
//Pattern docHeaderPattern = Pattern.compile(".I ", Pattern.MULTILINE | Pattern.COMMENTS);
ArrayList<ArrayList<String>> result = new ArrayList<> ();
int docID =0;
try {
StringBuilder sb = new StringBuilder();
line = bufferedReader.readLine();
while (line != null) {
if (line.startsWith(".I"))
{
result.add(new ArrayList<String>());
result.get(docID).add(".I");
line = bufferedReader.readLine();
while(line != null && !line.startsWith(".I")){
line = bufferedReader.readLine();
}
++docID;
}
else line = bufferedReader.readLine();
}
} finally {
bufferedReader.close();
}
} catch (IOException ex) {
Logger.getLogger(ReadFile.class.getName()).log(Level.SEVERE, null, ex);
}
You want to find the lines which match "I n".
The regex you need is : ^.I \d$
^ indicates the beginning of the line. Hence, if there are some whitespaces or text before I, the line will not match the regex.
\d indicates any digit. For the sake of simplicty, I allow only one digit in this regex.
$ indicates the end of the line. Hence, if there are some characters after the digit, the line will not match the expression.
Now, you need to read the file line by line and keep a reference to the file in which you write the current line.
Reading a file line by line is much easier in Java 8 with Files.lines();
private String currentFile = "root.txt";
public static final String REGEX = "^.I \\d$";
public void foo() throws Exception{
Path path = Paths.get("path/to/your/input/file.txt");
Files.lines(path).forEach(line -> {
if(line.matches(REGEX)) {
//Extract the digit and update currentFile
currentFile = "File DOC_ID_"+line.substring(3, line.length())+".txt";
System.out.println("Current file is now : currentFile);
} else {
System.out.println("Writing this line to "+currentFile + " :" + line);
//Files.write(...);
}
});
Note : In order to extract the digit, I use a raw "".substring() which I consider as evil but it is easier to understand. You can do it in a better way with a Pattern and a Matcher :
With this regex : ".I (\\d)". (The same as before but with parenthesis which indicates what you will want to capture). Then :
Pattern pattern = Pattern.compile(".I (\\d)");
Matcher matcher = pattern.matcher(".I 3");
if(matcher.find()) {
System.out.println(matcher.group(1));//display "3"
}
Look up regex, Java has inbuilt libraries for this.
https://docs.oracle.com/javase/tutorial/essential/regex/
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
These links will give you a starting point, effectively you can use counter to perform a pattern match against the string and store anything between the first pattern match and the second pattern match. This information can be output to a separate file using the Formatter class.
Found here:-
http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
public class Test {
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String inputFile="C:\\logs\\test.txt";
BufferedReader br = new BufferedReader(new FileReader(new File(inputFile)));
String line=null;
StringBuilder sb = new StringBuilder();
int count=1;
try {
while((line = br.readLine()) != null){
if(line.startsWith(".I")){
if(sb.length()!=0){
File file = new File("C:\\logs\\DOC_ID_"+count+".txt");
PrintWriter writer = new PrintWriter(file, "UTF-8");
writer.println(sb.toString());
writer.close();
sb.delete(0, sb.length());
count++;
}
continue;
}
sb.append(line);
}
} catch (Exception ex) {
ex.printStackTrace();
}
finally {
br.close();
}
}
}

How to apply regex to entire file, not just line after line?

I want to apply my regular expression not just to the first line of the text file, but to the all lines together.
Currently it matches only when the entire appropriate match is on one line. And if the appropriate match continues on the next line - it doesn't match at all.
class Parser {
public static void main(String[] args) throws IOException {
Pattern patt = Pattern.compile("(include|"
+ "integrate|"
+ "driven based on|"
+ "facilitate through|"
+ "contain|"
+ "using|"
+ "equipped"
+ "integrate|"
+ "implement|"
+ "utilized to facilitate|"
+ "comprise){1}"
+ "[\\s\\w\\,\\(\\)\\;\\:]*\\."); //Regex
BufferedReader r = new BufferedReader(new FileReader("E:/test/test.txt")); // read the file
String line;
PrintWriter pWriter = null;
while ((line = r.readLine()) != null) {
Matcher matcher = patt.matcher(line);
while (matcher.find()) {
try{
pWriter = new PrintWriter(new BufferedWriter(new FileWriter("E:/test/test1.txt", true)));//append any given input
pWriter.println(matcher.group()); //write the result of matcher to the new file
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
if (pWriter != null){
pWriter.flush();
pWriter.close();
}
}
System.out.println(matcher.group());
}
}
}
}
Change while ((line = r.readLine()) != null) to this:
String file = ""; // Basically, a conglomerate of all of the lines in the file
while ((line = r.readLine()) != null) {
file += line; // Append each line to the "file" string
}
Matcher matcher = patt.matcher(file);
while (matcher.find()) {
/* Blah blah blah, your outputting goes here. */
}
The reason why this happens is because you're doing each line individually. For what you want, you need to apply the regex to the file all at once.
Currently the matcher is applied per line, it needs to be applied to the whole file to work as intended.
Regex are greedy, you will match the whole String on the first match unless you have . (or other special characters) in your String:
...
+ "comprise){1}"
+ "[\\s\\w\\,\\(\\)\\;\\:]*\\."); //Regex
On the last line you match any whitespace and word, so pretty much anything but .. Also the {1} and most of the \ are superfluous (because in []):
...
+ "comprise)"
+ "[\\s\\w,();:]*\\."); //Regex
If you don't care about the newline characters just remove them first and it should work (I see no way around it if you have something like "com\nprise" and want to match that):
s = s.replaceAll("\\n+", "");

Java String Matching in a Sorted File and grouping similar data

i have sorted file and i need to do the following pattern match. I read the row and then compare or do patern match with the row just after it , if it matches then insert the string i used to match after a comma in that row and move on to the next row. I am new to Java and overwhelmed with options from Open CSV to BufferedReader. I intend to iterate through the file till it reaches the end. I may always have blanks and have a dated in quotes. The file size would be around 100 MBs.
My file has data like
ABCD
ABCD123
ABCD456, 123
XYZ
XYZ890
XYZ123, 890
and output is expected as
ABCD, ABCD
ABCD123, ABCD
ABCD456, 123, ABCD
XYZ, XYZ
XYZ890, XYZ
XYZ123, 890, XYZ
Not sure about the best method. Can you please help me.
To open a file, you can use File and FileReader classes:
File csvFile = new File("file.csv");
FileReader fileReader = null;
try {
fileReader = new FileReader(csvFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
You can get a line of the file using Scanner:
Scanner reader = new Scanner(fileReader);
while(reader.hasNext()){
String line = reader.nextLine();
parseLine(line);
}
You want to parse this line. For it, you have to study Regex for using Pattern and Matcher classes:
private void parseLine(String line) {
Matcher matcher = Pattern.compile("(ABCD)").matcher(line);
if(matcher.find()){
System.out.println("find: " + matcher.group());
}
}
To find the next pattern of the same row, you can reuse matcher.find(). If some result was found, it will return true and you can get this result with matcher.groud();
Read line by line and use regex to replace it as per your need using String.replaceAll()
^([A-Z]+)([0-9]*)(, [0-9]+)?$
Replacement : $1$2$3, $1
Here is Online demo
Read more about Java Pattern
Sample code:
String regex = "^([A-Z]+)([0-9]*)(, [0-9]+)?$";
String replacement = "$1$2$3, $1";
String newLine = line.replaceAll(regex,replacement);
For better performance, read 100 or more lines at a time and store in a buffer and finally call String#replaceAll() single time to replace all at a time.
sample code:
String regex = "([A-Z]+)([0-9]*)(, [0-9]+)?(\r?\n|$)";
String replacement = "$1$2$3, $1$4";
StringBuilder builder = new StringBuilder();
int counter = 0;
String line = null;
try (BufferedReader reader = new BufferedReader(new FileReader("abc.csv"))) {
while ((line = reader.readLine()) != null) {
builder.append(line).append(System.lineSeparator());
if (counter++ % 100 == 0) { // 100 lines
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
builder.setLength(0); // reset the buffer
}
}
}
if (builder.length() > 0) {
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
}
Read more about Java 7 - The try-with-resources Statement

Regarding Java String Manipulation

I have the string "MO""RET" gets stored in items[1] array after the split command. After it get's stored I do a replaceall on this string and it replaces all the double quotes.
But I want it to be stored as MO"RET. How do i do it. In the csv file from which i process using split command Double quotes within the contents of a Text field are repeated (Example: This account is a ""large"" one"). So i want retain the one of the two quotes in the middle of string if it get's repeated and ignore the end quotes if present . How can i do it?
String items[] = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
items[1] has "MO""RET"
String recordType = items[1].replaceAll("\"","");
After this recordType has MORET I want it to have MO"RET
Don't use regex to split a CSV line. This is asking for trouble ;) Just parse it character-by-character. Here's an example:
public static List<List<String>> parseCsv(InputStream input, char separator) throws IOException {
BufferedReader reader = null;
List<List<String>> csv = new ArrayList<List<String>>();
try {
reader = new BufferedReader(new InputStreamReader(input, "UTF-8"));
for (String record; (record = reader.readLine()) != null;) {
boolean quoted = false;
StringBuilder fieldBuilder = new StringBuilder();
List<String> fields = new ArrayList<String>();
for (int i = 0; i < record.length(); i++) {
char c = record.charAt(i);
fieldBuilder.append(c);
if (c == '"') {
quoted = !quoted;
}
if ((!quoted && c == separator) || i + 1 == record.length()) {
fields.add(fieldBuilder.toString().replaceAll(separator + "$", "")
.replaceAll("^\"|\"$", "").replace("\"\"", "\"").trim());
fieldBuilder = new StringBuilder();
}
if (c == separator && i + 1 == record.length()) {
fields.add("");
}
}
csv.add(fields);
}
} finally {
if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
}
return csv;
}
Yes, there's little regex involved, but it only trims off ending separator and surrounding quotes of a single field.
You can however also grab any 3rd party Java CSV API.
How about:
String recordType = items[1].replaceAll( "\"\"", "\"" );
I prefer you to use replace instead of replaceAll.
replaceAll uses REGEX as the first argument.
The requirement is to replace two continues QUOTES with one QUOTE
String recordType = items[1].replace( "\"\"", "\"" );
To see the difference between replace and replaceAll , execute bellow code
recordType = items[1].replace( "$$", "$" );
recordType = items[1].replaceAll( "$$", "$" );
Here you can use the regular expression.
recordType = items[1].replaceAll( "\\B\"", "" );
recordType = recordType.replaceAll( "\"\\B", "" );
First statement replace the quotes in the beginning of the word with empty character.
Second statement replace the quotes in the end of the word with empty character.

Categories