How to get several regex groups from Matcher in Java? - java

I have a Java program that does some String matching. I'm looking for anything that matches \d+x\d+ in a String. This works, using the Pattern and Matcher classes. However, to parse the String parts I have found, I have to manually parse the String I get from the Matcher.find() and Matcher.group(). How can I tell the Pattern I'm looking for something in the form of (\d+)x(\d+) and get the Matcher to return those groups separately?
So instead of the string "1x23" I want to get two strings, "1" and "23".

Use Matcher.group(int), not Matcher.group().
With the given regex and input, group(1) should be "1" and group(2) should be "23".

Related

String split method returning first element as empty using regex

I'm trying to get the digits from the expression [1..1], using Java's split method. I'm using the regex expression ^\\[|\\.{2}|\\]$ inside split. But the split method returning me String array with first value as empty, and then "1" inside index 1 and 2 respectively. Could anyone please tell me what's wrong I'm doing in this regex expression, so that I only get the digits in the returned String array from split method?
You should use matching. Change your expression to:
`^\[(.*?)\.\.(.*)\]$`
And get your results from the two captured groups.
As for why split acts this way, it's simple: you asked it to split on the [ character, but there's still an "empty string" between the start of the string and the first [ character.
Your regex is matching [ and .. and ]. Thus it will split at this occurrences.
You should not use a split but match each number in your string using regex.
You've set it up such that [, ] and .. are delimiters. Split will return an empty first index because the first character in your string [1..1] is a delimiter. I would strip delimiters from the front and end of your string, as suggested here.
So, something like
input.replaceFirst("^[", "").split("^\\[|\\.{2}|\\]$");
Or, use regex and regex groups (such as the other answers in this question) more directly rather than through split.
Why not use a regex to capture the numbers? This will be more effective less error prone. In that case the regex looks like:
^\[(\d+)\.{2}(\d+)\]$
And you can capture them with:
Pattern pat = Pattern.compile("^\\[(\\d+)\\.{2}(\\d+)\\]$");
Matcher matcher = pattern.matcher(text);
if(matcher.find()) { //we've found a match
int range_from = Integer.parseInt(matcher.group(1));
int range_to = Integer.parseInt(matcher.group(2));
}
with range_from and range_to the integers you can no work with.
The advantage is that the pattern will fail on strings that make not much sense like ..3[4, etc.

Use regex in Java to extract specific parts of a String

In the following string I want to extract the ids that come after {\"company_id\": the part. The first in this case will be 4100, and there are two more farther away 4045 and 2979. All of this ids will be 4 digits. Sorry for including such a long string. The reason why I want to use regex and not some sort of Json parser is because the json is string that is malformed.
String company = "[{\"company_id\":4100,\"data\":{\"drm_user_id\":572901936637129135,\"direct_status_id\":0,\"direct_optin_date\":0,\"direct_first_optin_date\":0,\"direct_last_optin_date\":0,\"direct_optout_date\":0,\"direct_last_form_date\":0,\"direct_last_form_id\":0,\"direct_last_promo_id\":0,\"anon_status_id\":600,\"anon_optin_date\":1446132360498,\"anon_first_optin_date\":1446132360498,\"anon_last_optin_date\":1446132360498,\"anon_optout_date\":0,\"anon_last_form_date\":1446132360498,\"anon_last_form_id\":101,\"anon_last_promo_id\":1002003,\"last_registration_date\":1446132360498,\"mp_status_id\":600,\"mp_control_state\":-1,\"mp_match_date\":0,\"mp_vs_version\":0,\"mp_initial_value_segment\":0,\"mp_id\":0,\"conversion_last_form_date\":0,\"conversion_last_form_id\":0,\"conversion_last_promo_id\":-1,\"last_message_date\":1446132368928,\"cg_version\":0,\"cg_version_date\":0,\"num_anon_messages_global\":0,\"num_anon_messages_global_date\":0,\"reg_creator_id\":576,\"reg_form_id\":101,\"reg_method_id\":1,\"reg_creator_type_id\":1},\"personal_data\":{\"version\":0,\"personal_data\":\"{}\",\"mdc_data\":{\"version\":0},\"custom_data\":\"{}\"},\"category_data\":{},\"campaignImpressions\":{},\"journeyStartDate\":0},{\"company_id\":4045,\"data\":{\"drm_user_id\":572901936637129135,\"direct_status_id\":0,\"direct_optin_date\":0,\"direct_first_optin_date\":0,\"direct_last_optin_date\":0,\"direct_optout_date\":0,\"direct_last_form_date\":0,\"direct_last_form_id\":0,\"direct_last_promo_id\":0,\"anon_status_id\":600,\"anon_optin_date\":1446132360498,\"anon_first_optin_date\":1446132360498,\"anon_last_optin_date\":1446132360498,\"anon_optout_date\":0,\"anon_last_form_date\":1446132360498,\"anon_last_form_id\":101,\"anon_last_promo_id\":1002003,\"last_registration_date\":1446132360498,\"mp_status_id\":600,\"mp_control_state\":-1,\"mp_match_date\":0,\"mp_vs_version\":0,\"mp_initial_value_segment\":0,\"mp_id\":0,\"conversion_last_form_date\":0,\"conversion_last_form_id\":0,\"conversion_last_promo_id\":-1,\"last_message_date\":1446132368928,\"cg_version\":0,\"cg_version_date\":0,\"num_anon_messages_global\":0,\"num_anon_messages_global_date\":0,\"reg_creator_id\":576,\"reg_form_id\":101,\"reg_method_id\":1,\"reg_creator_type_id\":1},\"personal_data\":{\"version\":0,\"personal_data\":\"{}\",\"mdc_data\":{\"version\":0},\"custom_data\":\"{}\"},\"category_data\":{},\"campaignImpressions\":{},\"journeyStartDate\":0},{\"company_id\":2979,\"data\":{\"drm_user_id\":572901936637129135,\"direct_status_id\":0,\"direct_optin_date\":0,\"direct_first_optin_date\":0,\"direct_last_optin_date\":0,\"direct_optout_date\":0,\"direct_last_form_date\":0,\"direct_last_form_id\":0,\"direct_last_promo_id\":0,\"anon_status_id\":600,\"anon_optin_date\":1446132360498,\"anon_first_optin_date\":1446132360498,\"anon_last_optin_date\":1446132360498,\"anon_optout_date\":0,\"anon_last_form_date\":1446132360498,\"anon_last_form_id\":101,\"anon_last_promo_id\":1002003,\"last_registration_date\":1446132360498,\"mp_status_id\":600,\"mp_control_state\":-1,\"mp_match_date\":0,\"mp_vs_version\":0,\"mp_initial_value_segment\":0,\"mp_id\":0,\"conversion_last_form_date\":0,\"conversion_last_form_id\":0,\"conversion_last_promo_id\":-1,\"last_message_date\":1446132368928,\"cg_version\":0,\"cg_version_date\":0,\"num_anon_messages_global\":0,\"num_anon_messages_global_date\":0,\"reg_creator_id\":576,\"reg_form_id\":101,\"reg_method_id\":1,\"reg_creator_type_id\":1},\"personal_data\":{\"version\":0,\"personal_data\":\"{}\",\"mdc_data\":{\"version\":0},\"custom_data\":\"{}\"},\"category_data\":{},\"campaignImpressions\":{},\"journeyStartDate\":0}]";
This is what I have so far:
Pattern pattern = Pattern.compile("company_id\\\\\":(\\d{4})");
Matcher matcher = pattern.matcher(company);
while(matcher.find()){
System.out.println(matcher.group(1)+"\n");
}
However this does not work,and I am not sure how to actually check that the number comes after this {\"company_id\": specific part.
Just a single backslash would be enough. \" should match a double quote.
Pattern pattern = Pattern.compile("\"company_id\":(\\d{4})");

how to get "something" from <em>something</em> use java Regular expressions

in the following, i need to get:
String regex = "Item#: <em>.*</em>";
String content = "xxx Item#: <em>something</em> yyy";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
if( matcher.find() ) {
System.out.println(matcher.group());
}
it will print:
Item#: <em>something</em>
but i just need the value "something".
i know i can use .substring(begin,end) to get the value,
but is there another way which would be more elegant?
It prints the whole string because you have printed it. matcher.group() prints the complete match. To get specific part of your matched string, you need to change your Regex to capture the content between the tag in a group: -
String regex = "Item#: <em>(.*?)</em>";
Also, use Reluctant quantifier(.*?) to match the least number of characters before an </em> is encountered.
And then in if, print group(1) instead of group()
if( matcher.find() ) {
System.out.println(matcher.group(1));
}
Anyways, you should not use Regex to parse HTML. Regex is not strong enough to achieve this task. You should probably use some HTML parser like - HTML Cleaner. Also see the link that is provided in one of the comments in the OP. That post is very nice explanation of the problems you can face.

Print out the string that matched my regular expression in java?

Possible duplicate: Print regex matches in java
I am using Matcher class in java to match a string with a particular regular expression which I converted into a Pattern using the Pattern class. I know my regex works because when I do Matcher.find(), I am getting true values where I am supposed to. But I want to print out the stings that are producing those true values (meaning print out the strings that match my regex) and I don't see a method in the matcher class to achieve that. Please do let me know if anyone has encountered such a problem before. I apologize as this question is fairly rudimentary but I am fairly new to regex and hence am still finding my way around the regex world.
Assuming mis your matcher:
m.group() will return the matched string.
[EDIT] Added info regarding matched groups
Also, if your regex has portions inside parenthesis, m.group(n) will return the string that matches the nth group inside parenthesis;
Pattern p = Pattern.compile("mary (.*) bob");
Matcher m = p.matcher("since that day mary loves bob");
m.group() returns "mary loves bob".
m.group(1) return "loves".

Whitespace in Java's regular expression

I'm trying to write a regular expression to mach an IRC PRIVMSG string. It is something like:
:nick!name#some.host.com PRIVMSG #channel :message body
So i wrote the following code:
Pattern pattern = Pattern.compile("^:.*\\sPRIVMSG\\s#.*\\s:");
Matcher matcher = pattern.matcher(msg);
if(matcher.matches()) {
System.out.println(msg);
}
It does not work. I got no matches. When I test the regular expression using online javascript testers, I got matches.
I tried to find the reason, why it doesn't work and I found that there's something wrong with the whitespace symbol. The following pattern will give me some matches:
Pattern.compile("^:.*");
But the pattern with \s will not:
Pattern.compile("^:.*\\s");
It's confusing.
The java matches method strikes again! That method only returns true if the entire string matches the input. You didn't include anything that captures the message body after the second colon, so the entire string is not a match. It works in testers because 'normal' regex is a 'match' if any part of the input matches.
Pattern pattern = Pattern.compile("^:.*?\\sPRIVMSG\\s#.*?\\s:.*$");
Should match
If you look at the documentation for matches(), uou will notice that it is trying to match the entire string. You need to fix your regexp or use find() to iterate through the substring matches.

Categories