Capture Regex repeating string between slashes in URL - java

I have following partial URL that can be
/it/xyz/test/param+1/param-2/1234/gfd4
Basically two letter at the beginning a slash another unknown string and then a series of repeatable strings between slashes
I need to capture every string (I know a split with / delimiter would be fine but I am interested to know how can I extract with regex). I came out first with this:
^\/([a-zA-Z]{2})\/([a-zA-Z]{1,10})(\/[a-zA-Z1-9\+\-]+)
but it only capture
group1: it
group2: xyz
group3: /test
and of course it ignores the rest of the string.
If I add a * sign at the end it only captures the last sentence:
^\/([a-zA-Z]{2})\/([a-zA-Z]{1,10})(\/[a-zA-Z1-9\+\-]+)*
group1: it
group2: xyz
group3: /gfd4
So, I am obviously missing some fundamentals, so in addition to the proper regex I would like to have an explanation.
I tagged as Java because the engine which parses the regex is the JDK 7. It is my knowledge that each engine may have differences.

As mentioned here, this is expected:
With one group in the pattern, you can only get one exact result in that group.
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
I would rather capture the rest of the string in group3 ((\/.*$), as in this demo), then use a split around '/'. Or apply yhat pattern on the rest of the string:
Pattern p = Pattern.compile("(\/[a-zA-Z1-9\+\-]+)");
Matcher m = p.matcher(str);
while (m.find()) {
String place = m.group(1);
...
}

Related

Regex to match the nearest character backwards

I have this string:
P.1 P.2 P.3 P.4
ASTON VETERINARY HOSPITAL
Page 1/2
00 PennelJ Road
Media, PA 19063-5983
(610) 474-5670
Client :
I want to get the text in between Client and P.\d. Here is the demo: Regex
(P.\d)[\s\S]*(?=^.+Client :?)
The problem is that it matches from the first Page P.1. I need the nearest P.\d before Client.
How to change the regex so that it would match from P.4.
I tried this with non-greedy operators but that's not going to work. I'd try to move away from having the entire regex match precisely what you want, and use groups instead. Then you can just write a matcher to match any number of those P.1 constructs, and it makes your scan for the Client string at the end a lot simpler because you don't have to try to do it as a lookahead. Thus:
String x = "P.1 P.2 P.3 P.4 foobar Client :";
Pattern p = Pattern.compile("((P\\.\\d)(.*(P\\.\\d))*)+(?<result>.*)Client");
Matcher m = p.matcher(x);
System.out.println(m.find());
System.out.println(m.group("result"));
Seems to produce precisely what you want. The syntax (?<whatever>REGEX HERE) is regular-expression-ese for: Let me grab just this bit later by asking for the group 'whatever'.

Regex to find time durations

I have read many of the regex questions on stackoverflow, but they didn't help me to develop my own code.
What I need is like the following. I am parsing texts which have already been parsed using Stanford Tagger. Now, I am trying to remove the time durations in some parts of the texts: 1) The phrase starts with the date (e.g. 1999_CARD Tom_NN was_VP) 2) when the time duration follows this format: 2/1999_CARD -_- 01/01/2000_CARD (or similar ones).
I have developed a code. But it's wrongly removing some other parts. I don't know why. My regex is like the following
String regex = "(\\s|\\b.*?_(CARD|CD)\\s([^A-Za-z0-9])+_([^A-Za-z0-9])+(.*?)+_(CARD|CD))|(\\b.*?_(CARD|CD))";
Pattern pattern2 = Pattern.compile(regex);
Matcher m2 = pattern2.matcher(chunkPhrase);
if (m2.find()) {
chunkPhrase = chunkPhrase.replace(m2.group(0), "");
}
For example, in the following phrase, it finds something (but it shouldn't)
·_NNP Research_NNP of_IN Symbian_NNP OS_NNP 7.0_CD s_NNS
After removing the time duration in the above phrase, I'm left with · s_NNS which is not what I want.
To make it more clear what I expect the code, here are some examples:
1/1/2002_CD -_- 1/2/2003_CD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
1/1/2002_CARD -_- 1/2/2003_CARD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
2000_CARD I_NN was_VP working_NP here_ADV
after applying the code, I expect:
I_NN was_VP working_NP here_ADV
For this one:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
after applying the code, I expect:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
Meanwhile, I use java.
Update: To clarify better: If a number occurs AT THE BEGINNING, it must be removed. Otherwise, it must be remained. If it follows the second format (e.g. 1999_CD -_- 2000_CARD), it must be removed, indifferent if it occurs at the beginning or middle or end of the phrase.
Can anyone help what is wrong with my code?
You can use this regex:
final String regex = "\\b(?:\\d{1,2}/*\\d{1,2}/)?\\d{4}_(?:CARD|CD)(?:\\h*[-_]+)?\\h*";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(input);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll("");
System.out.println("Substitution result: " + result);
RegEx Demo
RegEx Breakup:
\b - Word boundary
(?: - Start non-capturing group
\d{1,2}/*\d{1,2}/ - Match mm/dd part of a date
)? - End non-capturing group (optional)
\d{4} - Match 4 digits of year
_ - Match a literal _
(?:CARD|CD) - Match CARD or CD
(?: - Start non-capturing group
\h*[-_]+ - Match horizontal whitespace followed by 1 or more - or _
)? - End non-capturing group (optional)
\h* - Match 0 or more horizontal whitespaces
Based on the examples you have provided, the following regex will capture the required time durations
((?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4})_(?:CARD|CD) (?:-_- )?)
Details
(?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4}) // match minimum of 2 digits or a date in xx/xx/xx[xx] format
_(?:CARD|CD) // match _CARD or _CD
(?:-_- )? // match -_- , if it exists
The ?: at the beginning mean these are non-capturing groups. The parentheses around the whole thing is the capturing group
See demo here

Java Regex capture multiple groups with groups containing others

I'm trying to build a regular expression which captures multiple groups, with some of them being contained in others. For instance, let's say I want to capture every 4-grams that follows a 'to' prefix:
input = "I want to run to get back on shape"
expectedOutput = ["run to get back", "get back on shape"]
In that case I would use this regex:
"to((?:[ ][a-zA-Z]+){4})"
But it only captures the first item in expectedOutput (with a space prefix but that's not the point).
This is quite easy to solve without regex, but I'd like to know if it is possible only using regex.
You can make use of a regex overlapping mstrings:
String s = "I want to run to get back on shape";
Pattern pattern = Pattern.compile("(?=\\bto\\b((?:\\s*[\\p{L}\\p{M}]+){4}))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1).trim());
}
See IDEONE demo
The regex (?=\bto\b((?:\s*[\p{L}\p{M}]+){4})) checks each location in the string (since it is a zero width assertion) and looks for:
\bto\b - a whole word to
((?:\s*[\p{L}\p{M}]+){4}) - Group 1 capturing 4 occurrences of
\s* zero or more whitespace(s)
[\p{L}\p{M}]+ - one or more letters or diacritics
If you want to allow capturing fewer than 4 ngrams, use a {0,4} (or {1,4} to require at least one) greedy limiting quantifier instead of {4}.
It is the order of groups in Regex
1 ((A)(B(C))) // first group (surround two other inside this)
2 (A) // second group ()
3 (B(C)) // third group (surrounded one other group)
4 (C) // forth group ()

Regex to match only one character sequence within string

I have a string in a jList that I am looking to split with a regex (for future simplicity if requirements change)
The string looks a lot like this:
ID: GF68464, Name: productname
the ID could be any combination of letters and numbers and could be any length.
I only want the ID to be matched, i.e excluding "ID: " and anything after the comma following the ID.
Here is what I have thus far but it doesn't seem to do what I ask it to
[^ID: ][a-zA-Z1-9][^,^.]
FURTHER INFO (EDIT)
I plan on extracting the ID to match against an array. (hence the need for a regex). Could this be done a different way?
You can try this:
ID:\s*(\w+),
and extract the 1st capturing group. You can also use lookarounds (+1 to #p.s.w.g).
String str = "ID: GF68464, Name: productname";
Matcher m = Pattern.compile("ID:\\s*(\\w+),").matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
GF68464
You could try using lookarounds:
(?<ID:\s*)\w+(?=,)
This will match any sequence of one or more word characters preceded by "ID:" and any number of white space characters, and followed by a comma.
What you want is called a non-capturing group. There are already some fairly high-quality examples of doing this in Java on SO - for example, this question: What is a non-capturing group? What does a question mark followed by a colon (?:) mean?
Create a regex like /^[a-z A-Z 0-9]*,/ then use can use match function and use value match[0] like
var regex = /^[a-z A-Z 0-9]*\,/;
var matches = your_string.match(regex);
var required_value = matches[0];
hope this helps

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

Categories