Regex to parse multiline data - java

I have a following data from a file and I would like to see if I can do a regex parsing here
Name (First Name) City Zip
John (retired) 10007
Mark Baltimore 21268
....
....
Avg Salary
70000 100%
Its not a big file and the entire data from the file is available in a String object with a new line characters (\n) (String data = "data from the file")
I am trying to get name, city, zip and then the salary, percentage details
data inside () considered part of Name field.
For Name field space is considered valid and there are no space for other fields.
'Avg Salary' is available only at the end of the file
Will it be easy to do this via regex parsing in Java?

If the text file is space-aligned, you can (and probably should) extract the fields based on the number of characters. So, you take the first n characters in each line as first name, the next m characters as City, and so on.
This is one code to extract using the above method, by calculating the field length of the fields automatically, assuming we know the header.
String data = "data from the file";
// This is just to ensure we have enough space in the array
int numNewLines = data.length()-data.replace("\n","").length();
String[][] result = new String[numNewLines][3];
String[] lines = data.split("\n");
int avgSalary = 0;
int secondFieldStart = lines[0].indexOf("City");
int thirdFieldStart = lines[0].indexOf("Zip");
for(int i=1; i<lines.length; i++){
String line = lines[i].trim();
if(line.equals("Avg Salary")){
avgSalary = Integer.parseInt(lines[i+1].substring(0,secondFieldStart).trim());
break;
}
result[i-1][0] = line.substring(0,secondFieldStart).trim(); // First Name
result[i-1][1] = line.substring(secondFieldStart,thirdFieldStart).trim(); // City
result[i-1][2] = line.substring(thirdFieldStart).trim(); // Zip
}
Using regex will be possible, but it will be more complicated. And regex won't be able to differentiate person's name and city's name anyway:
Consider this case:
John Long-name Joe New York 21003
How would you know the name is John Long-name Joe instead of John Long-name Joe New if you don't know that the length of the first field is at most 20 characters? (note that length of John Long-name Joe is 19 characters, leaving one space between it and New in New York)
Of course if your fields are separated by other characters (like tab character \t), you can split each line based on that. And it's easy to modify the code above to accommodate that =)
Since the solution I proposed above is simpler, I guess you might want to try it instead =)

Related

Splitting a String with different inputs Java

Input = computer science: harvard university, cambridge (EXAMPLE)
Prompt: Use a String method twice, to find the locations of the colon and the comma.
Use a String method to extract the major and store it into a new String.
Use a String method to extract the university and store it into a new String.
Use a String method to extract the city and store it into a new String.
Display the major, university, and city in reverse, as shown below, followed by a newline.
I was thinking I could just use substring(); but the input entered from the user varies and so the indexes are all different. I am still learning and am stumped on how to do this one. Does substring let you somehow use it without knowing the specific index? Or do I have to use a whole different method? Any help would be awesome. BTW this is HW.
Assuming the input string has format: <major>: <university>, <city>
String has a set of indexOf() methods to find the position of a character/substring (returns -1 if the character/substring is not found) and substring to retrieve a subpart of the input from index or between a pair of indexes.
Also, String::trim may be used to get rid of leading/trailing whitespaces.
String input = "computer science: harvard university, cambridge";
int colonAt = input.indexOf(':');
int commaAt = input.indexOf(',');
// or int commaAt = input.indexOf(',', colonAt + 1); to make sure comma occurs after the colon
String major = null;
String university = null;
String city = null;
if (colonAt > -1 && commaAt > -1) {
major = input.substring(0, colonAt);
university = input.substring(colonAt + 1, commaAt).trim();
city = input.substring(commaAt + 1).trim();
}
System.out.printf("Major: '%s', Univ: '%s', City: '%s'%n", major, university, city);
Output:
Major: 'computer science', Univ: 'harvard university', City: 'cambridge'

In Java is there any method to read a data of a line from file?

Is there a method to read the first and the last data from a line which are separated by space from a file in java.
Example:
the file contains the following information
100 20-11-2020 08:25:42 IN
101 21-09-2020 09:01:20 IN
Here I just want 100 and IN to extract and print
One approach is to read the entire string and use the split method. Store the split string in an array and simply access the first and last element., something like this:
String line = "100 20-11-2020 08:25:42 IN"
String arr[] = line.split(" ");
String var1 = arr[0];//stores 100
String var2 = arr[arr.length - 1];//stores IN
Hope that helps! Happy coding!

jython convert text file to list of strings

I must convert a text file into a list of strings separated by commas (with no whitespace and no first line). After printing that, I need to print the name of each state, how many lines contain each state, The sum of all Cen2010 values (the 1st number in each line) for each state, sum of Est2013 values (the last number in each line) for each state, and the total change from Cen2010 population to Est2013 population for each state.
Text File Example:
NAME,STNAME,Cen2010,Base2010,Est2010,Est2011,Est2012,Est2013
"Abingdon city",Illinois,3319,3286,3286,3270,3242,3227
"Addieville village",Illinois,252,252,252,250,250,247
"Addison village",Illinois,36942,36964,37007,37181,37267,37385
"Adeline village",Illinois,85,85,85,84,84,83
Current Code:
def readPopest():
censusfile=pickAFile()
cf=open(censusfile,"rt")
cflines=cf.readlines()
for i in range(len(cflines)-1):
lines=cflines[i+1]
estimate=lines.strip().split(',')
print estimate
Returning:
['"Abingdon city"', 'Illinois', '3319', '3286', '3286', '3270', '3242', '3227']
['"Addieville village"', 'Illinois', '252', '252', '252', '250', '250', '247']
['"Addison village"', 'Illinois', '36942', '36964', '37007', '37181', '37267', '37385']
['"Adeline village"', 'Illinois', '85', '85', '85', '84', '84', '83']
I think you can import this data to SQL database and then it is very easy to sum, filter etc.
But in Python we have dictionaries. You can read data and fill dictionary where key name is name of the state. Then for each line you add town to list of towns in this state, and add numbers to already saved numbers. Of course for 1st town in state you must create structure with two arrays. One for towns, and one for numbers. In code it looks like:
def add_items(main_dict, state, town, numbers):
try:
towns_arr, numbers_arr = main_dict[state]
towns_arr.append(town)
for i in range(len(numbers)):
numbers_arr[i] += numbers[i]
except KeyError:
town_arr = [town, ]
main_dict[state] = [town_arr, numbers]
Now you must use it in your main code that reads file:
state_dict = {}
cf = open(censusfile, "rt")
lines = cf.readlines()
for line in lines[1:]: # we skip 1st line
arr = line.strip().split(',')
town = arr[0]
state = arr[1]
numbers = [int(x) for x in arr[2:]]
add_items(state_dict, state, town, numbers)
print(state_dict)
As a homework try to print this dictionary in desired format.

parsing values from text file in java

I've got some text files I need to extract data from. The file itself contains around a hundred lines and the interesting part for me is:
AA====== test==== ====================================================/
AA normal low max max2 max3 /
AD .45000E+01 .22490E+01 .77550E+01 .90000E+01 .47330E+00 /
Say I need to extract the double values under "normal", "low" and "max". Is there any efficient and not-too-error-prone solution other than regexing the hell out of the text file?
If you really want to avoid regexes, and assuming you'll always have this same basic format, you could do something like:
HashMap<String, Double> map = new HashMap<>();
Scanner scan = new Scanner(filePath); //or your preferred input mechanism
assert (scan.nextLine().startsWith("AA====:); //remove the top line, ensure it is the top line
while (scan.hasNextLine()){
String[] headings = scan.nextLine().split("\\s+"); //("\t") can be used if you're sure the delimiters will always be tabs
String[] vals = scan.nextLine().split("\\s+");
assert headings[0].equals("AA"); //ensure
assert vals[0].equals("AD");
for (int i = 1; i< headings.length; i++){ //start with 1
map.put(headings[i], Double.parseDouble(vals[i]);
}
}
//to make sure a certain value is contained in the map:
assert map.containsKey("normal");
//use it:
double normalValue = map.get("normal");
}
Code is untested as I don't have access to an IDE at the moment. Also, I obviously don't know what's variable and what will remain constant here (read: the "AD", "AA", etc.), but hopefully you get the gist and can modify as needed.
If each line will always have this exact form you can use String.split()
String line; // Fill with one line from the file
String[] cols = line.split(".")
String normal = "."+cols[0]
String low = "."+cols[1]
String max = "."+cols[2]
If you know what index each value will start, you can just do substrings of the row. (The split method technically does a regex).
i.e.
String normal = line.substring(x, y).trim();
String low = line.substring(z, w).trim();
etc.

closest thing to NSScanner in Java

I'm moving some code from objective-c to java. The project is an XML/HTML Parser. In objective c I pretty much only use the scanUpToString("mystring"); method.
I looked at the Java Scanner class, but it breaks everything into tokens. I don't want that. I just want to be able to scan up to occurrences of substrings and keep track of the scanners current location in the overall string.
Any help would be great thanks!
EDIT
to be more specific. I don't want Scanner to tokenize.
String test = "<title balh> blah <title> blah>";
Scanner feedScanner = new Scanner(test);
String title = "<title";
String a = feedScanner.next(title);
String b = feedScanner.next(title);
In the above code I'd like feedScanner.next(title); to scan up to the end of the next occurrence of "<title"
What actually happens is the first time feeScanner.next is called it works since the default delimiter is whitespace, however, the second time it is called it fails (for my purposes).
You can achieve this with String class (Java.lang.String).
First get the first index of your substring.
int first_occurence= string.indexOf(substring);
Then iterate over entire string and get the next value of substrings
int next_index=indexOf( str,fromIndex);
If you want to save the values, add them to the wrapper class and the add to a arraylist object.
This really is easier by just using String's methodsdirectly:
String test = "<title balh> blah <title> blah>";
String target = "<title";
int index = 0;
index = test.indexOf( target, index ) + target.length();
// Index is now 6 (the space b/w "<title" and "blah"
index = test.indexOf( target, index ) + target.length();
// Index is now at the ">" in "<title> blah"
Depending on what you want to actually do besides walk through the string, different approaches might be better/worse. E.g. if you want to get the blah> blah string between the <title's, a Scanner is convenient:
String test = "<title balh> blah <title> blah>";
Scanner scan = new Scanner(test);
scan.useDelimiter("<title");
String stuff = scan.next(); // gets " blah> blah ";
Maybe String.split is something for you?
s = "The almighty String is mystring is your String is our mystring-object - isn't it?";
parts = s.split ("mystring");
Result:
Array("The almighty String is ", " is your String is our ", -object - isn't it?)
You know that in between your "mystring" must be. I'm not sure for start and end, so maybe you need some s.startsWith ("mystring") / s.endsWith.

Categories