Regex for extraction of a key value pair - java

I have a text file. Sample content of that particular text file is like
root(ROOT-0, good-4)nn(management-2, company-1)nsubj(good-4, management-2)
Now i need to separate this and store it in ArrayList. For that i write the following code
public class subject {
public void getsub(String f){
ArrayList <String>ar=new ArrayList<String>();
String a="[a-z]([a-z]-[0-9],[a-z]-[0-9])";
Pattern pattern=Pattern.compile(a);
Matcher matcher=pattern.matcher(f);
while(matcher.find()){
if(matcher.find()){
ar.add(matcher.group(0));
}
}
System.out.println(ar.size());
for(int i=0;i<ar.size();i++){
System.out.println(ar.get(i));
}
}
}
but arraylist is not getting populated. Why is that so

You are using unquoted parenthesis in your Pattern.
Unquoted parenthesis imply the definition of a group within your Pattern, for later back-references.
However, here you are trying to match actual parenthesis, so they need to be escaped as such: \\( and \\).
For a rough solution, try this:
String text = "root(ROOT-0, good-4)nn(management-2, company-1)nsubj(good-4, management-2)";
List<String> myPairs = new ArrayList<String>();
Pattern p = Pattern.compile(".+?\\(.+?,.+?\\)");
Matcher m = p.matcher(text);
while (m.find()) {
myPairs.add(m.group());
}
System.out.println(myPairs);
Output:
[root(ROOT-0, good-4), nn(management-2, company-1), nsubj(good-4, management-2)]
Final note: for an improved solution, I would try and use groups to distinguish between the first part of your Pattern and the actual pair in the parenthesis, so to build a Map<String, ArrayList<String>> as a data object in this case - but this is out of scope for this answer.

Related

Replace the opening and closing string, while keeping the enclosed value?

String value="==Hello==";
For the above string, I have to replace the "==" tags as <Heading>Hello</Heading>. I have tried doing it like this:
value = value.replaceAll("(?s)\\=\\=.","<heading>");
value = value.replaceAll(".\\=\\=(?s)","</heading>");
However, my original dataset is huge, with lots of strings like this to be replaced. Can the above be performed in a single statement, giving preference to performance?
The regex should not affect strings of form, ===<value>===, where value is any string of characters[a-z,A-Z].
To avoid iterating over string many times to first replace ===abc=== and then ==def== we can iterate over it once and thanks to Matehr#appendReplacement and Matcher#appendTail dynamically decide how to replace found match (based on amount of =).
Regex which can search find both described cases can look like: (={2,3})([a-z]+)\1 but to make it more usable lets use named groups (?<name>subregex) and also instead of [a-z] use more general [^=]+.
This will give us
Pattern p = Pattern.compile("(?<eqAmount>={2,3})(?<value>[^=]*)\\k<eqAmount>");
Group named eqAmount will hold == or ===. \\k<eqAmount> is backreference to that group, which means regex expects to find it also == or === depending on what eqAmount already holds.
Now we need some mapping between == or === and replacements. To hold such mapping we can use
Map<String,String> replacements = new HashMap<>();
replacements.put("===", "<subheading>${value}</subheading>");
replacements.put("==", "<heading>${value}</heading>");
${value} is reference to capturing group named value - here (?<value>[^=]*) so it will hold text between both == or ===.
Now lets see how it works:
String input = "===foo=== ==bar== ==baz== ===bam===";
Map<String, String> replacements = new HashMap<>();
replacements.put("===", "<subheading>${value}</subheading>");
replacements.put("==", "<heading>${value}</heading>");
Pattern p = Pattern.compile("(?<eqAmount>={2,3})(?<value>[^=]*)\\k<eqAmount>");
StringBuffer sb = new StringBuffer();
Matcher m = p.matcher(input);
while (m.find()) {
m.appendReplacement(sb, replacements.get(m.group("eqAmount")));
}
m.appendTail(sb);
String result = sb.toString();
System.out.println(result);
Output: <subheading>foo</subheading> <heading>bar</heading> <heading>baz</heading> <subheading>bam</subheading>
Try this:
public static void main(final String[] args) {
String value = "===Hello===";
value = value.replaceAll("===([^=]+)===", "<Heading>$1</Heading>");
System.out.println(value);
}

Get all matches within a string using complie and regex

I'm trying to get all matches which starts with _ and ends with = from a URL which looks like
?_field1=param1,param2,paramX&_field2=param1,param2,paramX
In that case I'm looking for any instance of _fieldX=
A method which I use to get it looks like
public static List<String> getAllMatches(String url, String regex) {
List<String> matches = new ArrayList<String>();
Matcher m = Pattern.compile("(?=(" + regex + "))").matcher(url);
while(m.find()) {
matches.add(m.group(1));
}
return matches;
}
called as
List<String> fieldsList = getAllMatches(url, "_.=");
but somehow is not finding anything what I have expected.
Any suggestions what I have missed?
A regex like (?=(_.=)) matches all occurrences of overlapping matches that start with _, then have any 1 char (other than a line break char) and then =.
You need no overlapping matches in the context of the string you provided.
You may just use a lazy dot matching pattern, _(.*?)=. Alternatively, you may use a negated character class based regex: _([^=]+)= (it will capture into Group 1 any one or more chars other than = symbol).
Since you are passing a regex to the method, it seems you want a generic function.
If so, you may use this method:
public static List<String> getAllMatches(String url, String start, String end) {
List<String> matches = new ArrayList<String>();
Matcher m = Pattern.compile(start + "(.*?)" + end).matcher(url);
while(m.find()) {
matches.add(m.group(1));
}
return matches;
}
and call it as:
List<String> fieldsList = getAllMatches(url, "_", "=");

Java how to check multiple regex patterns against an input?

(If I'm taking the complete wrong direction let me know if there is a better way I should be approaching this)
I have a Java program that will have multiple patterns that I want to compare against an input. If one of the patterns matches then I want to save that value in a String. I can get it to work with a single pattern but I'd like to be able to check against many.
Right now I have this to check if an input matches one pattern:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Matcher match = pattern.matcher(input);
String ID = match.find()?match.group():null;
So, if the input was TST1234 or abcTST1234 then ID = "TST1234"
I want to have multiple patterns like:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Pattern pattern = Pattern.compile("TWT\\w{1,}");
...
and then to a collection and then check each one against the input:
List<Pattern> rxs = new ArrayList<Pattern>();
rxs.add(pattern);
rxs.add(pattern2);
String ID = null;
for (Pattern rx : rxs) {
if (rx.matcher(requestEnt).matches()){
ID = //???
}
}
I'm not sure how to set ID to what I want. I've tried
ID = rx.matcher(requestEnt).group();
and
ID = rx.matcher(requestEnt).find()?rx.matcher(requestEnt).group():null;
Not really sure how to make this work or where to go from here though. Any help or suggestions are appreciated. Thanks.
EDIT: Yes the patterns will change over time. So The patten list will grow.
I just need to get the string of the match...ie if the input is abcTWT123 it will first check against "TST\w{1,}", then move on to "TWT\w{1,}" and since that matches the ID String will be set to "TWT123".
To collect the matched string in the result you may need to create a group in your regexp if you are matching less than the entire string:
List<Pattern> patterns = new ArrayList<>();
patterns.add(Pattern.compile("(TST\\w+)");
...
Optional<String> result = Optional.empty();
for (Pattern pattern: patterns) {
Matcher matcher = pattern.match();
if (matcher.matches()) {
result = Optional.of(matcher.group(1));
break;
}
}
Or, if you are familiar with streams:
Optional<String> result = patterns.stream()
.map(Pattern::match).filter(Matcher::matches)
.map(m -> m.group(1)).findFirst();
The alternative is to use find (as in #Raffaele's answer) that implicitly creates a group.
Another alternative you may want to consider is to put all your matches into a single pattern.
Pattern pattern = Pattern.compile("(TST\\w+|TWT\\w+|...");
Then you can match and group in a single operation. However this might might it harder to change the matches over time.
Group 1 is the first matched group (i.e. the match inside the first set of parentheses). Group 0 is the entire match. So if you want the entire match (I wasn't sure from your question) then you could perhaps use group 0.
Use an alternation | (a regex OR):
Pattern pattern = Pattern.compile("TST\\w+|TWT\\w+|etc");
Then just check the pattern once.
Note also that {1,} can be replaced with +.
Maybe you just need to end the loop when the first pattern matches:
// TST\\w{1,}
// TWT\\w{1,}
private List<Pattern> patterns;
public String findIdOrNull(String input) {
for (Pattern p : patterns) {
Matcher m = p.matcher(input);
// First match. If the whole string must match use .matches()
if (m.find()) {
return m.group(0);
}
}
return null; // Or throw an Exception if this should never happen
}
If your patterns are all going to be simple prefixes like your examples TST and TWT you can define all of those at once, and user regex alternation | so you won't need to loop over the patterns.
An example:
String prefixes = "TWT|TST|WHW";
String regex = "(" + prefixes + ")\\w+";
Pattern pattern = Pattern.compile(regex);
String input = "abcTST123";
Matcher match = pattern.matcher(input);
String ID = match.find() ? match.group() : null;
// given this, ID will come out as "TST123"
Now prefixes could be read in from a java .properties file, or a simple text file; or passed as a parameter to the method that does this.
You could also define the prefixes as a comma-separated list or one-per-line in a file then process that to turn them into one|two|three|etc before passing it on.
You may be looping over several inputs, and then you would want to create the regex and pattern variables only once, creating only the Matcher for each separate input.

Splitting line based on comma, strange line

I have the following line comma separated,
LanguageID=0,LastKnownPeriod="Active",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Using split method, I can get comma seperated values but the actual problem comes when the text c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}, since comma is found within itself.
so the word after splitting should be,
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448} (comma is again found within the word)
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"} (comma is again found within the word in curly brackets)
I tried with following code but didn't work:
String arr[]=input_line.split("(.*!{),(.*!})");
for (int i=0;i<arr.length;i++)
System.out.println(arr[i]);
Please advise.
Use regular expressions instead:
([\w_]+=(?:\{[\w=_,\[\]"\|:\.\s-]*\}))|([^,]+)
This will group the line into 4 sections:
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Code:
import java.util.regex.*;
public class JavaRegEx {
public static void main(String[] args) {
String line = "LanguageID=0,LastKnownPeriod=\"Active\",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=[\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\",\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\"}";
Pattern pattern = Pattern.compile("([\\w_]+=(?:\\{[\\w=_,\\[\\]\"\\|:\\.\\s-]*\\}))|([^,]+)");
Matcher matcher = pattern.matcher(line);
while(matcher.find())
System.out.println(matcher.group(0));
}
}
First, just splitting on a comma isn't how CSV works
a,b,"c,d"
has only three values, a, b, and c,d. I recommend using a CSV parser, like opencsv. CSV is not terribly complicated, but it isn't as simple as split by comma.
Second, your CSV data is invalid because you have a quote and a comma in a field that isn't quoted.
In othe words, if you want the values a, b","c, then the CSV is
a,"b"",""c"
(Note that quotes are double-escaped.)
Otherwise, it is impossible to tell what fields you actually wanted. A CSV parser would choke on your data.
While it might be possible to do this by split(), it's much easier to match the actual tokens (where split() matches the delimiters between the tokens). Your tokens all consist of one or more of any characters other than comma or brace, optionally followed by a pair of braces enclosing some non-brace characters (which can include commas):
[^,{}]+(?:\{[^{}]+\})?
The Java code for that would be:
List<String> matchList = new ArrayList<String>();
Pattern p = Pattern.compile("[^,{}]+(?:\\{[^{}]+\\})?");
Matcher m = p.matcher(s);
while (m.find()) {
matchList.add(m.group());
}
But it looks like you can break it down further:
Pattern p = Pattern.compile("(\\w+)=([^,{}]+|\\{[^{}]+\\})");
Matcher m = p.matcher(TEST_STR);
while (m.find()) {
System.out.printf("%nname = %s%nvalue = %s%n",
m.group(1), m.group(2));
}
output:
name = LanguageID
value = 0
name = LastKnownPeriod
value = "Active"
name = c_MultiPartyCall
value = {Counter=1,TimeStamp=1394539271448}
name = LTH
value = {Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakA
ccountID|0|1000||"}

Retrieve code between two tags

Hi guys i'm trying to retrieve the link between this two tag eg text here it will then store it in a list, how do i retrieve those text with the pattern and matcher
public void getlinks() {
Pattern Start = Pattern.compile(this.PatternStart); //<Link>
Pattern End = Pattern.compile(this.PatternEnd); //</Link>
Matcher mStart = Start.matcher(this.Source); // matches Start
Matcher mEnd = End.matcher(this.Source); // matches end
????????????
}
Trying to find the link between and inside a html source, just using as an example
In general you do like this:
public static List<String> getLinks(String text) {
Matcher matcher = Pattern.compile("<tagstart>(.*?)<tagend>").matcher(text);
List<String> linkList = new ArrayList<String>();
while (matcher.find()) {
linkList.add(matcher.group(1));
}
return linkList;
}
where <tagstart> and <tagend> are your starting and ending tags. The matcher.group(1) gives you everything between the tags, where matcher.group() or matcher.group(0) would give you the tags too.
Note that it is important to use the (.*?) if you have a text with multiple tag pairs, otherwise this will match the first <tagstart> and the last <tagend> and return everything in between.
An example usage would be:
System.out.println(getLinks("<tagstart>beer<tagend><tagstart>juice<tagend>"));
which prints
[beer, juice]

Categories