Parsing a Tab Separated File - java

I'm attempting to TSV from IMDB:
$hutter Battle of the Sexes (2017) (as $hutter Boy) [Bobby Riggs Fan] <10>
NVTION: The Star Nation Rapumentary (2016) (as $hutter Boy) [Himself] <1>
Secret in Their Eyes (2015) (uncredited) [2002 Dodger Fan]
Steve Jobs (2015) (uncredited) [1988 Opera House Patron]
Straight Outta Compton (2015) (uncredited) [Club Patron/Dopeman]
$lim, Bee Moe Fatherhood 101 (2013) (as Brandon Moore) [Himself - President, Passages]
For Thy Love 2 (2009) [Thug 1]
Night of the Jackals (2009) (V) [Trooth]
"Idle Talk" (2013) (as Brandon Moore) [Himself]
"Idle Times" (2012) {(#1.1)} (as Brandon Moore) [Detective Ryan Turner]
As you can some lines start with a tab and some do not. I want a map with the actor's name as a key and a list of movies as the value. Between the actor's name is one or more tabs to until the movie listing.
My code:
while ((line = reader.readLine()) != null) {
Matcher matcher = headerPattern.matcher(line);
boolean headerMatchFound = matcher.matches();
if (headerMatchFound) {
Logger.getLogger(ActorListParser.class.getName()).log(Level.INFO, "Header for actor list found");
String newline;
reader.readLine();
while ((newline = reader.readLine()) != null) {
String[] fullLine = null;
String actor;
String title;
Pattern startsWithTab = Pattern.compile("^\t.*");
Matcher tab = startsWithTab.matcher(newline);
boolean tabStartMatcher = tab.matches();
if (!tabStartMatcher) {
fullLine = newline.split("\t.*");
System.out.println("Actor: " + fullLine[0] +
"Movie: " + fullLine[1]);
}//this line will have code to match lines that start with tabs.
}
}
}
The way I've done this only works for a few lines before I get and arrayoutofbounds exception. How can I parse the lines and split them into 2 strings at max if they have one or more tabs?

There are subtleties in parsing tab/comma-delimited data files having to do with quoting and escaping.
To save yourself a lot of work, frustration and headaches you really should consider using one of the existing CSV parsing libaries such as OpenCSV or Apache Commons CSV.
Posted as an answer instead of a comment because the OP has not stated a reason for reinventing the wheel and there are some tasks that really have been "solved" once and for all.

Related

Only print lines from bottom up if there is data

How would you go about solving the following logic:
I have pdf file with cells:
addressLine1
addressLine2
addressLine3
addressLine4
addressLine5
cityStateZip
All of them have getters.
Sometimes, all fields have data and sometimes they don't.
To make it pretty, I want them grouped together, ie:
1261 Graeber St (address4)
Bldg 2313 Rm 24 (address5)
Pensacola FL 32508 (cityStateZip)
You need to account for some of these addresses being blank, if addressLine1 is the only one existing.ie:
1261 Graeber St (address5)
Pensacola FL 32508 (cityStateZip)
Here, since address2, address3, address4 are blank, we moved address1 on pdf cell address5
My code right now print:
1261 Graeber St (address1)
(address2)
(address3)
(address4)
(address5)
Pensacola FL 32508 (cityStateZip)
And here is the code:
FdfInput.SetValue("addressLine1", getAddressLine1() );
FdfInput.SetValue("addressLine2", getAddressLine2() );
FdfInput.SetValue("addressLine3", getAddressLine3() );
FdfInput.SetValue("addressLine4", getAddressLine4() );
FdfInput.SetValue("addressLine5", getAddressLine5() );
FdfInput.SetValue("addressLine6", getCityStateZip() );
Picture on the left is how it looks like right now, I want it to be like picture on the right.
Is this a good candidate for LinkedList.insertLast() ?
This:
if(!getAddressLine1().isEmpty())
FdfInput.SetValue("addressLine1", getAddressLine1());
if(!getAddressLine2().isEmpty())
FdfInput.SetValue("addressLine2", getAddressLine2());
if(!getAddressLine3().isEmpty())
FdfInput.SetValue("addressLine3", getAddressLine3());
if(!getAddressLine4().isEmpty())
FdfInput.SetValue("addressLine4", getAddressLine4());
if(!getAddressLine5().isEmpty())
FdfInput.SetValue("addressLine5", getAddressLine5());
if(!getCityStateZip().isEmpty())
FdfInput.SetValue("cityStateZip", getCityStateZip());
In other words, if there is data to add to the line, do so, otherwise, skip it entirely. For example, let's say all of the fields are empty besides address3, address5, and cityStateZip.
// The output will not look like this:
addressLine3
addressLine5
cityStateZip
Instead, it will look like:
addressLine3
addressLine5
cityStateZip
I solved it by storing strings in array list and decrementing the counter on the name:
List<String> addrLines = new ArrayList<String>();
if(!getCityStateZip().isEmpty())
addrLines.add(getTomaCityStateZip());
if(!getAddressLine5().isEmpty())
addrLines.add(getAddressLine5());
if(!getAddressLine4().isEmpty())
addrLines.add(getAddressLine4());
if(!getAddressLine3().isEmpty())
addrLines.add(getAddressLine3());
if(!getAddressLine2().isEmpty())
addrLines.add(getAddressLine2());
if(!getAddressLine1().isEmpty())
addrLines.add(getAddressLine1());
for (int i = addrLines.size(); i > 0; --i)
{
int line = addrLines.size() - i;
String field = String.format("addressLine%d", 6 - line);
FdfInput.SetValue(field, addrLines.get(line));
}

Need to filter, parse and sort multiple log files

I have a need to collect a subset of info from log files that reside on one-to-many log file servers. I have the following java code that does the initial data collection/filtering:
public String getLogServerInfo(String userName, String password, String hostNames, String id) throws Exception{
int timeout = 5;
String results = "";
String[] hostNameArray = hostNames.split("\\s*,\\s*");
for (String hostName : hostNameArray) {
SSHClient ssh = new SSHClient();
ssh.addHostKeyVerifier(new PromiscuousVerifier());
try {
Utils.writeStdOut("Parsing server: " + hostName);
ssh.connect(hostName);
ssh.authPassword(userName, password);
Session s = ssh.startSession();
try {
String sh1 = "cat /logs/en/event/event*.log | grep \"" + id + "\" | grep TYPE=ERROR";
Command cmd = s.exec(sh1);
results += IOUtils.readFully(cmd.getInputStream()).toString();
cmd.join(timeout, TimeUnit.SECONDS);
Utils.writeStdOut("\n** exit status: " + cmd.getExitStatus());
} finally {
s.close();
}
} finally {
ssh.disconnect();
ssh.close();
}
}
return results;
}
The results string variable looks something like this:
TYPE=ERROR, TIMESTAMP=10/03/2015 07:14:31 253 AM, HOST=server1, APPLICATION=app1, FUNCTION=function1, STATUS=null, GUID=null, etc. etc.
TYPE=ERROR, TIMESTAMP=10/03/2015 07:14:59 123 AM, HOST=server1, APPLICATION=app1, FUNCTION=function1, STATUS=null, GUID=null, etc. etc.
TYPE=ERROR, TIMESTAMP=10/03/2015 07:14:28 956 AM, HOST=server2, APPLICATION=app1, FUNCTION=function2, STATUS=null, GUID=null, etc. etc.
I need to accomplish the following:
What do I need to do to be able to sort results by TIMESTAMP? It is unsorted right now, because i am enumerating one to many files, and appending results to end of a string.
I only want a subset of "columns" returned, such as TYPE, TIMESTAMP, FUNCTION. I thought i could REGEX it in the grep, but maybe arrays would be better?
Results are simply being printed to console/report, as this is only printed for failed tests, and is there for troubleshooting purposes only.
I took the list of output that you provided and put it in a file, named test.txt, making sure that each "TYPE=ERROR etc. etc" was in a new line (I guess it's the same in your output, but it isn't clear).
Then I used cat test.txt | cut -d',' -f1,2,5 | sort -k2 to do what you want.
cut -d',' -f1,2,5 basically splits by comma and only reports tokens number 1,2,5 (TYPE,TIMESTAMP,FUNCTION). If you want more, you can add more numbers depending on what token you want
sort -k2 sorts according to the 2nd column (TIMESTAMP)
The output I get is:
TYPE=ERROR, TIMESTAMP=10/03/2015 07:14:28 956 AM, FUNCTION=function2
TYPE=ERROR, TIMESTAMP=10/03/2015 07:14:31 253 AM, FUNCTION=function1
TYPE=ERROR, TIMESTAMP=10/03/2015 07:14:59 123 AM, FUNCTION=function1
So what you should try and do, is to further pipe your command with |cut -d',' -f1,2,5 | sort -k2
I hope it helps.
After working on this some more, i come to find that one of the key/value pairs allows commas in the values, thus cut will not work. Here is the finished product:
My grep command stays the same, collecting data from all servers:
String sh1 = "cat /logs/en/event/event*.log | grep \"" + id + "\" | grep TYPE=ERROR";
Command cmd = s.exec(sh1);
results += IOUtils.readFully(cmd.getInputStream()).toString();
Put the string into an array, so i can process them line by line:
String lines[] = results.split("\r?\n");
I then used regex to get the data i needed, repeating the below for each line in the array, and for as many columns as needed. It's a bit of a hack, I probably could have done it better by simply replacing the comma in the offending key/value pair, then using SPLIT() and comma as delimeter, then looping for the fields i want.
lines2[i] = "";
Pattern p = Pattern.compile("TYPE=(.*?), APPLICATION=.*");
Matcher m = p.matcher(lines[i]);
if (m.find()) {
lines2[i] += ("TYPE=" + m.group(1));
}
Finally, this will sort by Timestamp, since it is 2nd column:
Arrays.sort(lines2);

Weird BufferedReader behavior for a huge file

I am getting a very weird error. So, my program read a csv file.
Whenever it comes to this line:
"275081";"cernusco astreet, milan, italy";NULL
I get an error:
In the debug screen, I see that the BufferedReader read only
"275081";"cernusco as
That is a part of the line. But, it should read all of the line.
What bugs me the most is when I simply remove that line out of the csv file, the bug disappear! The program runs without any problem. I can remove the line, maybe it is a bad input or whatever; but, I want to understand why I am having this problem.
For better understanding, I will include a part of my code here:
reader = new BufferedReader(new FileReader(userFile));
reader.readLine(); // skip first line
while ((line = reader.readLine()) != null) {
String[] fields = line.split("\";\"");
int id = Integer.parseInt(stripPunctionMark(fields[0]));
String location = fields[1];
if (location.contains("\";")) { // When there is no age. The data is represented as "location";NULL. We cannot split for ";" here. So check for "; and split.
location = location.split("\";")[0];
System.out.printf("Added %d at %s\n", id, location);
people.put(id, new Person(id, location));
numberOfPeople++;
}
else {
int age = Integer.parseInt(stripPunctionMark(fields[2]));
people.put(id, new Person(id, location, age));
System.out.printf("Added %d at: %s age: %d \n", id, location, age);
numberOfPeople++;
}
Also, you can find the csv file here or here is a short version of the part that I encountered the error:
"275078";"el paso, texas, usa";"62"
"275079";"istanbul, eurasia, turkey";"26"
"275080";"madrid, n/a, spain";"29"
"275081";"cernusco astreet, milan, italy";NULL
"275082";"hacienda heights, california, usa";"16"
"275083";"cedar rapids, iowa, usa";"22"
This has nothing whatsoever to do with BufferedReader. It doesn't even appear in the stack trace.
It has to do with your failure to check the result and length of the array returned by String.split(). Instead you are just assuming the input is well-formed, with at least three columns in each row, and you have no defences if it isn't.

Regular Expression to get Information from Whatsapp Text file

I have no idea about creating regular expressions for extracting different text from a text file. I am working on text file consisting of message details in whatsapp chat.
Consider the following data from a text file of whatsapp chat:
25/12/2012 9:15 am: User1: Faith makes all things possible,
Hope makes all things work,
Love makes all things beautiful,
May you have all the three for this Christmas.
MERRY CHRISTMAS
01/01/2013 12:03 am: User1: <message>.
04/08/2013 10:54 am: User2: Happy Friendship day
13/10/2013 11:57 am: User1:<message>
<message continues>
<message continues>
30/12/2013 10:07 pm: User3:<message>
30/12/2013 11:12 pm: User4: Same to you
This is a sample chat text from which I need to extract Date, Time, Username, Message. I am working in java for this.
The java code for this that I have worked out is as follows.But Didnt found any correct REGEX according to my requirement.
BufferedReader br = new BufferedReader(new FileReader("text filepath"));
String sCurrentLine;
Pattern r = Pattern.compile(REGEX); //REGEX required for extracting data
while ((sCurrentLine = br.readLine()) != null) {
System.out.println(sCurrentLine);
Matcher m = r.matcher(sCurrentLine);
if (m.find()) {
System.out.println("Date: " + m.group(1) );
System.out.println("Time: " + m.group(2) );
System.out.println("User: " + m.group(3) );
System.out.println("Message: " + m.group(4) );
} else {
System.out.println("NO MATCH");
}
Thanks in advance for any help!
I think you're looking for this regex,
(\d{2}\/\d{2}\/\d{4})\s(\d(?:\d)?:\d{2} [ap]m):\s([^:]*):(.*?)(?=\s*\d{2}\/|$)
Java regex would be,
"(?s)(\\d{2}/\\d{2}/\\d{4})\\s(\\d(?:\\d)?:\\d{2} [ap]m):\\s([^:]*):(.*?)(?=\\s*\\d{2}/|$)"
DEMO

Parse a task list

A file contains the following:
HPWAMain.exe 3876 Console 1 8,112 K
hpqwmiex.exe 3900 Services 0 6,256 K
WmiPrvSE.exe 3924 Services 0 8,576 K
jusched.exe 3960 Console 1 5,128 K
DivXUpdate.exe 3044 Console 1 16,160 K
WiFiMsg.exe 3984 Console 1 6,404 K
HpqToaster.exe 2236 Console 1 7,188 K
wmpnscfg.exe 3784 Console 1 6,536 K
wmpnetwk.exe 3732 Services 0 11,196 K
skypePM.exe 2040 Console 1 25,960 K
I want to get the process ID of the skypePM.exe. How is this possible in Java?
Any help is appreciated.
Algorithm
Open the file.
In a loop, read a line of text.
If the line of text starts with skypePM.exe then extract the number.
Repeat looping until all lines have been read from the file.
Close the file.
Implementation
import java.io.*;
public class T {
public static void main( String args[] ) throws Exception {
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream( "tasklist.txt" ) ) );
String line;
while( (line = br.readLine()) != null ) {
if( line.startsWith( "skypePM.exe" ) ) {
line = line.substring( "skypePM.exe".length() );
int taskId = Integer.parseInt( (line.trim().split( " " ))[0] );
System.out.println( "Task Id: " + taskId );
}
}
br.close();
}
}
Alternate Implementation
If you have Cygwin and related tools installed, you could use:
cat tasklist.txt | grep skypePM.exe | awk '{ print $2; }'
To find the Process Id of the application SlypePM..
Open the file
now read lines one by one
find the line which contains SkypePM.exe in the beginning
In the line containing SkypePM.exe parse the line to read the numbers after the process name leaving the spaces.
You get process id of the process
It is all string operations.
Remember the format of the file should not change after you write the code.
If you really want to parse the output, you may need a different strategy. If your output file really is the result of a tasklist execution, then it should have some column headers at the top of it like:
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
I would use these, in particular the set of equal signs with spaces, to break any subsequent strings using a fixed-width column strategy. This way, you could have more flexibility in parsing the output if needed (i.e. maybe someone is looking for java.exe or wjava.exe). Do keep in mind the last column may not be padded with spaces all the way to the end.
I will say, in the strictest sense, the existing answers should work for just getting the PID.
Implementation in Java is not a good way. Shell or other script languages may help you a lot. Anyway, JAWK is a implementation of awk in Java, I think it may help you.

Categories