Using regex to select 3 groups from a string

Using regex to select 3 groups from a string - java

String s = #Section250342,Main,First/HS/12345/Jack/M,200010 10.00 200011 -2.00,
#Section250322,Main,First/HS/12345/Aaron/N,200010 17.00,
#Section250399,Main,First/HS/12345/Jimmy/N,200010 12.00,
#Section251234,Main,First/HS/12345/Jack/M,200011 11.00
Wherever there is the word /Jack/M in the3 string, I want to pull the section numbers(250342,251234),dates (200010,200011) and the values(10.00,11.00,-2.00) associated with it using regex each time. Sometines a single line can contain either one value or two so that what makes the regex sort of confusing. So at the end of day, there will be 3 diff groups we want to extract.
I tried
#Section(\d+)(?:(?!#Section\d).)*\bJack/M,(\d+)\h+(\d+(?:\.\d+)?)\s(\d+)\h+([-+]?\d+(?:\.\d+)?)\b
See it in action here - https://regex101.com/r/JaKeGg/1, it brings in 5 groups instead of 3 and when there is only one value here it doesn't seem to match so I need help with this.

You might use a pattern to get 2 capture groups, and then after process the capture 2 values to combine the numbers that should be grouped together.
As the dates and the values in the examples strings seem to go by pair, you can split the group 2 values from the regex on a space and create 2 groups using the modulo operator to group the even/odd occurrences.
#Section(\d+)\b(?:(?!#Section\d).)*\bJack/M,(\d+\h+[-+]?\d+(?:\.\d+)?(?:\s+\d+\h+[-+]?\d+(?:\.\d+)?)*)
Regex demo | Java demo
String regex = "#Section(\\d+)\\b(?:(?!#Section\\d).)*\\bJack/M,(\\d+\\h+[-+]?\\d+(?:\\.\\d+)?(?:\\s+\\d+\\h+[-+]?\\d+(?:\\.\\d+)?)*)";
String string = "#Section250342,Main,First/HS/12345/Jack/M,200010 10.00 200011 -2.00,\n"
+ "#Section250322,Main,First/HS/12345/Aaron/N,200010 17.00,\n"
+ "#Section250399,Main,First/HS/12345/Jimmy/N,200010 12.00,\n"
+ "#Section251234,Main,First/HS/12345/Jack/M,200011 11.00";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
List<String> group2 = new ArrayList<>();
List<String> group3 = new ArrayList<>();
System.out.println("Group 1: " + matcher.group(1));
String[] parts = matcher.group(2).split("\\s+");
for (int i = 0; i < parts.length; i++) {
if (i % 2 == 0) {
group2.add(parts[i]);
} else {
group3.add(parts[i]);
}
}
System.out.println("Group 2: " + Arrays.toString(group2.toArray()));
System.out.println("Group 3: " + Arrays.toString(group3.toArray()));
}
}
Output
Group 1: 250342
Group 2: [200010, 200011]
Group 3: [10.00, -2.00]
Group 1: 251234
Group 2: [200011]
Group 3: [11.00]
If you want to group all values, you can create 3 lists and print all the 3 lists after the looping.
List<String> group1 = new ArrayList<>();
List<String> group2 = new ArrayList<>();
List<String> group3 = new ArrayList<>();
while (matcher.find()) {
group1.add(matcher.group(1));
String[] parts = matcher.group(2).split("\\s+");
for (int i = 0; i < parts.length; i++) {
if (i % 2 == 0) {
group2.add(parts[i]);
} else {
group3.add(parts[i]);
}
}
}
System.out.println("Group 1: " + Arrays.toString(group1.toArray()));
System.out.println("Group 2: " + Arrays.toString(group2.toArray()));
System.out.println("Group 3: " + Arrays.toString(group3.toArray()));
Output
Group 1: [250342, 251234]
Group 2: [200010, 200011, 200011]
Group 3: [10.00, -2.00, 11.00]
See this Java demo

I think it is quite difficult to accomplish what you want using solely regex. According to another SO question you can't have multiple matches for the same capturing group in your regex. Instead only the last matching pattern will actually be captured.
My suggestion is to split your string by line in java, iterate through the lines, check if a line contains the substring you search for "Jack/M", and then use regex to extract the different bits by searching for simpler regex pattern instead of trying to match one long regex to the whole string.
A good walk through on how to find matches for a regex in a string: https://www.tutorialspoint.com/getting-the-list-of-all-the-matches-java-regular-expressions

Related

How to know which part of regex matched?

regex= (i.*d.*n.*t.*)|(p.*r.*o.*f.*)|(u.*s.*r.*)
string to be matched= profile
Now the regex will match with the string. But I want to know which part matched.
Meaning, I want (p.*r.*o.f.) as the output
How can I get do this in Java?

You can check if which group matched:
Pattern p = Pattern.compile("(i.*d.*n.*t.*)|(p.*r.*o.*f.*)|(u.*s.*r.*)");
Matcher m = p.matcher("profile");
m.find();
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(i + ": " + m.group(i));
}
Will output:
1: null
2: profile
3: null
Because the second line is not null, it's (p.*r.*o.*f.*) that matched the string.

In your case, It seems like you can distinguish those subpatterns with the first letter. If the first letter of the match is 'p', then it will be your desired pattern. Maybe you can construct simple function to distinguish these.

Search substring in a string using regex

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.

There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}

you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

Need to get parameters alone from query as arraylist return parameter

I have written program which will take Query as input and to get parameters alone as arraylist return (used Regex patern Split) (OUTPUT)
but i am getting with () also if we added like its also coming.
Input
SELECT * FROM some_table
WHERE some_column1 = ‘%(some_parameter_1)%’ and
some_column2=’%(some_parameter_2)%’ and
some_column3=’%(some_parameter_3)%’;`
Output
An array list containing the following elements in it:
some_parameter_1
some_parameter_2
some_parameter_3
String patternString = "%";
Pattern pattern = Pattern.compile(patternString);
String[] split = pattern.split(query);
System.out.println("split.length = " + split.length);
for(String element : split){
System.out.println("element = " + element);
a1.add(element);
}
int n = a1.size();
for(int j =1;j <= n; j=j+2){
params.add(a1.get(j));
}
System.out.println("\n List of Parameters "+params);
/* for(int j =1;j <= 7; j=+1)
System.out.println(a1.get(j)); */
return params;
}
How to use match.result in regex? thats not getting effect it seems... or any other way to solve It.
I need the parameter alone enclosed in %(some-param1)% as a arraylist return.
Thanks in advance.

Your pattern, as well as your approach, will not lead you to your solution because using split,you will split on the pattern. Instead you want to match the pattern and extract the matched pattern.
For this, we can use the Pattern and Matcher class provided in java. The pattern which I have used (%\\()(.*?)(\\)%). Note \\ has been added for escaping
It has three section
1) (%\\() Search for a string starting with %(
2) (.*?) All the characters after the first one matched
3) (\\)%) Match till you find )% character.
Below is the sample working code for your example. Because we want to extract what is between %(...)% which is the second part, I have used group(2).
So matcher.group(1) will match %(, matcher.group(2) will match some_parameter_1 and matcher.group(3) will match )%. Doing it in a loop will parse the whole string and give all three parameters in ArrayList
public static void main(String[] args) {
String input = "SELECT * FROM some_table WHERE some_column1 = ‘%(some_parameter_1)%’ and some_column2=’%(some_parameter_2)%’ "
+ "and some_column3=’%(some_parameter_3)%’;";
Pattern pattern = Pattern.compile("(%\\()(.*?)(\\)%)");
Matcher matcher = pattern.matcher(input);
List<String> parameterList = new ArrayList<>();
while (matcher.find()) {
parameterList.add(matcher.group(2));
}
System.out.println(parameterList);
}

Matcher.group() not returning correct value when more than one pattern is combined

I have the following code.
public class Test {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("Group1 (.*), Group2=(\\[(.*)\\]|null) ,Group3=\\[(.*)\\] ,Group4=\\[(.*)\\]");
String string = "Group1 12345, Group2=null ,Group3=[group3] ,Group4=[group4]";
Matcher matcher = pattern.matcher(string);
matcher.find();
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(i + ": " +matcher.group(i));
}
System.out.println();
string = "Group1 12345, Group2=[group2] ,Group3=[group3] ,Group4=[group4]";
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(i + ": " +matcher.group(i));
}
}
}
Output given by the above code:
1: 12345
2: null
3: null
4: group3
5: group4
1: 12345
2: null
3: null
4: group3
5: group4
Question 1: Why am I getting the groupCount as 5? Is it due to multiple regex patterns combined (at Group2)?
Question 2: I expect the output be
12345
null
group3
group4
12345
group2
group3
group4
What should I do to print the output in my expected way.
Please help me understand the program correctly. Thanks

Why 5 groups?
Group1 (.*), Group2=(\\[(.*)\\]|null) ,Group3=\\[(.*)\\] ,Group4=\\[(.*)\\]
^ ^ ^ ^ ^
1 2 3 4 5
Basically, you just need to count the number of opening parentheses.
So that should explain your first output.
As for the second output, your matcher is still pointing to the first string. So you need to include:
string = "Group1 12345, Group2=[group2] ,Group3=[group3] ,Group4=[group4]";
matcher = pattern.matcher(string);
matcher.find();
before the last loop.
Finally, to get the expected output, I would simply use this:
Pattern.compile("Group1 (.*), Group2=\\[?(.*?)\\]? ,Group3=\\[(.*)\\] ,Group4=\\[(.*)\\]");
which is reasonably simple but loses the fact that Group2 needs brackets for non null values. If you want to keep that conditions, you will need to introduce a condition like if (matcher.group(3).isEmpty()) { ... }.
Pattern explanation for group 2:
\\[? There may be an opening bracket or not, don't capture it
(.*?) Capture what's after "Group2=", excluding the brackets
\\]? There may be a closing bracket or not, don't capture it
Note, the ? in (.*?) is a lazy operator and is there to avoid capturing the closing bracket when there is one.

Two capturing groups correspond to your Group2 label :
(\\[(.*)\\]|null)
^---------------^
^--^
You could use a non-capturing group for the inner one :
(\\[(?:.*)\\]|null)
Or in this specific case, since the group seems useless (not used for later reference nor for applying a modifier to a group of token), you should just remove it :
(\\[.*\\]|null)

I need to get a substring from a java string Tokenizer

I need to get a substring from a java string tokenizer.
My inpunt string is = Pizza-1*Nutella-20*Chicken-65*
StringTokenizer productsTokenizer = new StringTokenizer("Pizza-1*Nutella-20*Chicken-65*", "*");
do
{
try
{
int pos = productsTokenizer .nextToken().indexOf("-");
String product = productsTokenizer .nextToken().substring(0, pos+1);
String count= productsTokenizer .nextToken().substring(pos, pos+1);
System.out.println(product + " " + count);
}
catch(Exception e)
{
}
}
while(productsTokenizer .hasMoreTokens());
My output must be:
Pizza 1
Nutella 20
Chicken 65
I need the product value and the count value in separate variables to insert that values in the Data Base.
I hope you can help me.

You could use String.split() as
String[] products = "Pizza-1*Nutella-20*Chicken-65*".split("\\*");
for (String product : products) {
String[] prodNameCount = product.split("\\-");
System.out.println(prodNameCount[0] + " " + prodNameCount[1]);
}
Output
Pizza 1
Nutella 20
Chicken 65

You invoke the nextToken() method 3 times. That will get you 3 different tokens
int pos = productsTokenizer .nextToken().indexOf("-");
String product = productsTokenizer .nextToken().substring(0, pos+1);
String count= productsTokenizer .nextToken().substring(pos, pos+1);
Instead you should do something like:
String token = productsTokenizer .nextToken();
int pos = token.indexOf("-");
String product = token.substring(...);
String count= token.substring(...);
I'll let you figure out the proper indexes for the substring() method.
Also instead of using a do/while structure it is better to just use a while loop:
while(productsTokenizer .hasMoreTokens())
{
// add your code here
}
That is don't assume there is a token.

An alternative answer you may want to use if your input grows:
// find all strings that match START or '*' followed by the name (matched),
// a hyphen and then a positive number (not starting with 0)
Pattern p = Pattern.compile("(?:^|[*])(\\w+)-([1-9]\\d*)");
Matcher finder = p.matcher(products);
while (finder.find()) {
// possibly check if the new match directly follows the previous one
String product = finder.group(1);
int count = Integer.valueOf(finder.group(2));
System.out.printf("Product: %s , count %d%n", product, count);
}

Some people dislike regex, but this is a good application for them. All you need to use is "(\\w+)-(\\d{1,})\\*" as your pattern. Here's a toy example:
String template = "Pizza-1*Nutella-20*Chicken-65*";
String pattern = "(\\w+)-(\\d+)\\*";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(template);
while(m.find())
{
System.out.println(m.group(1) + " " + m.group(2));
}
To explain this a bit more, "(\\w+)-(\\d+)\\*" looks for a (\\w+), which is any set of at least 1 character from [A-Za-z0-9_], followed by a -, followed by a number \\d+, where the+ means at least one character in length, followed by a *, which must be escaped. The parentheses capture what's inside of them. There are two sets of capturing parentheses in this regex, so we reference them by group(1) and group(2) as seen in the while loop, which prints:
Pizza 1
Nutella 20
Chicken 65

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using regex to select 3 groups from a string - java

Related

How to know which part of regex matched?

Search substring in a string using regex

Need to get parameters alone from query as arraylist return parameter

Matcher.group() not returning correct value when more than one pattern is combined

I need to get a substring from a java string Tokenizer

Categories

Resources