I have read many of the regex questions on stackoverflow, but they didn't help me to develop my own code.
What I need is like the following. I am parsing texts which have already been parsed using Stanford Tagger. Now, I am trying to remove the time durations in some parts of the texts: 1) The phrase starts with the date (e.g. 1999_CARD Tom_NN was_VP) 2) when the time duration follows this format: 2/1999_CARD -_- 01/01/2000_CARD (or similar ones).
I have developed a code. But it's wrongly removing some other parts. I don't know why. My regex is like the following
String regex = "(\\s|\\b.*?_(CARD|CD)\\s([^A-Za-z0-9])+_([^A-Za-z0-9])+(.*?)+_(CARD|CD))|(\\b.*?_(CARD|CD))";
Pattern pattern2 = Pattern.compile(regex);
Matcher m2 = pattern2.matcher(chunkPhrase);
if (m2.find()) {
chunkPhrase = chunkPhrase.replace(m2.group(0), "");
}
For example, in the following phrase, it finds something (but it shouldn't)
·_NNP Research_NNP of_IN Symbian_NNP OS_NNP 7.0_CD s_NNS
After removing the time duration in the above phrase, I'm left with · s_NNS which is not what I want.
To make it more clear what I expect the code, here are some examples:
1/1/2002_CD -_- 1/2/2003_CD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
1/1/2002_CARD -_- 1/2/2003_CARD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
2000_CARD I_NN was_VP working_NP here_ADV
after applying the code, I expect:
I_NN was_VP working_NP here_ADV
For this one:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
after applying the code, I expect:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
Meanwhile, I use java.
Update: To clarify better: If a number occurs AT THE BEGINNING, it must be removed. Otherwise, it must be remained. If it follows the second format (e.g. 1999_CD -_- 2000_CARD), it must be removed, indifferent if it occurs at the beginning or middle or end of the phrase.
Can anyone help what is wrong with my code?
You can use this regex:
final String regex = "\\b(?:\\d{1,2}/*\\d{1,2}/)?\\d{4}_(?:CARD|CD)(?:\\h*[-_]+)?\\h*";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(input);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll("");
System.out.println("Substitution result: " + result);
RegEx Demo
RegEx Breakup:
\b - Word boundary
(?: - Start non-capturing group
\d{1,2}/*\d{1,2}/ - Match mm/dd part of a date
)? - End non-capturing group (optional)
\d{4} - Match 4 digits of year
_ - Match a literal _
(?:CARD|CD) - Match CARD or CD
(?: - Start non-capturing group
\h*[-_]+ - Match horizontal whitespace followed by 1 or more - or _
)? - End non-capturing group (optional)
\h* - Match 0 or more horizontal whitespaces
Based on the examples you have provided, the following regex will capture the required time durations
((?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4})_(?:CARD|CD) (?:-_- )?)
Details
(?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4}) // match minimum of 2 digits or a date in xx/xx/xx[xx] format
_(?:CARD|CD) // match _CARD or _CD
(?:-_- )? // match -_- , if it exists
The ?: at the beginning mean these are non-capturing groups. The parentheses around the whole thing is the capturing group
See demo here
Related
I need to replace all characters in a string which come before an open parenthesis but come after an asterisk:
Input:
1.2.3 (1.234*xY)
Needed Output:
1.234
I tried the following:
string.replaceAll(".*\\(|\\*.*", "");
but I ran into an issue here where Matcher.matches() is false even though there are two matches... What is the most elegant way to solve this?
You could try matching the whole string, and replace with capture group 1
^[^(]*\(([^*]+)\*.*
The pattern matches:
^ start of string
[^(]*\( Match any char except ( and then match (
([^*]+) Capture in group 1 matching any char except *
\*.* Match an asterix and the rest of the line
Regex demo | Java demo
String string = "1.2.3 (1.234*xY)";
System.out.println(string.replaceFirst("^[^(]*\\(([^*]+)\\*.*", "$1"));
Output
1.234
You may use this regex to match:
[^(]*\(|\*.*
and replace with an empty string.
RegEx Demo
RegEx Demo:
[^(]*\(: Match 0 or more characters that are not ( followed by a (
|: OR
\*.*: Match * and everything after that
Java Code:
String s = "1.2.3 (1.234*xY)";
String r = s.replaceAll("[^(]*\\(|\\*.*", "");
//=> "1.234"
With your shown samples and attempts please try following regex:
^.*?\(([^*]*)\*\S+\)$
Here is the Regex Online Demo and here is the Java code Demo for used regex.
Explanation: Adding detailed explanation for used Regex.
^ ##Matching starting of the value here.
.*?\( ##Using lazy match here to match till ( here.
( ##Creating one and only capturing group of this regex here.
[^*]* ##Matching everything till * here.
) ##Closing capturing group here.
\* ##Matching * here.
\S+ ##Matching non-spaces 1 or more occurrences here.
\)$ ##Matching literal ) here at the end of the value.
I have following partial URL that can be
/it/xyz/test/param+1/param-2/1234/gfd4
Basically two letter at the beginning a slash another unknown string and then a series of repeatable strings between slashes
I need to capture every string (I know a split with / delimiter would be fine but I am interested to know how can I extract with regex). I came out first with this:
^\/([a-zA-Z]{2})\/([a-zA-Z]{1,10})(\/[a-zA-Z1-9\+\-]+)
but it only capture
group1: it
group2: xyz
group3: /test
and of course it ignores the rest of the string.
If I add a * sign at the end it only captures the last sentence:
^\/([a-zA-Z]{2})\/([a-zA-Z]{1,10})(\/[a-zA-Z1-9\+\-]+)*
group1: it
group2: xyz
group3: /gfd4
So, I am obviously missing some fundamentals, so in addition to the proper regex I would like to have an explanation.
I tagged as Java because the engine which parses the regex is the JDK 7. It is my knowledge that each engine may have differences.
As mentioned here, this is expected:
With one group in the pattern, you can only get one exact result in that group.
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
I would rather capture the rest of the string in group3 ((\/.*$), as in this demo), then use a split around '/'. Or apply yhat pattern on the rest of the string:
Pattern p = Pattern.compile("(\/[a-zA-Z1-9\+\-]+)");
Matcher m = p.matcher(str);
while (m.find()) {
String place = m.group(1);
...
}
I am working on a regular expression where the pattern is:
1.0.0[ - optional description]/1.0.0.0[ - optional description].txt
The [ - optional description] part is of course, optional. So some possible VALID values are
1.0.0/1.0.0.0.txt
1.0.0/1.0.0.0 - xyz.txt
1.0.0 - abc/1.0.0.0 - xyz.txt
1.0.0 - abc/1.0.0.0.txt
To be a little more robust in the pattern matching, I'd like to match zero or more spaces before and after the "-" character. So all these would be valid too.
1.0.0 - abc/1.0.0.0 - xyz.txt
1.0.0-abc/1.0.0.0-xyz.txt
1.0.0 -abc/1.0.0.0- xyz.txt
To do this matching, I have the following regular expression (Java code):
String part1 = "((\\d+.{1}\\d+.{1}\\d+)(\\s*-\\s*(.+))?)";
String part2 = "((\\d+.{1}\\d+.{1}\\d+.{1}\\d+)(\\s*-\\s*(.+))?\\.sql)";
pattern = Pattern.compile(part1+ "/" + part2);
So far this regular expression is working well. But while unit testing I found a case I can't quite figure out yet. The use case is if the string contains the "-" character is surrounded by 1 or more spaces, but there is no description after the "-" character. This would look like:
1.0.0 - /1.0.0.0.txt
1.0.0- /1.0.0.0-xyz.txt
In these cases, I want the pattern match to FAIL. But with my current regular expression the match succeeds. I think what I want is if there is a "-" character surrounded by any number of spaces like " - " then there must also be at least 1 non-space character following it. But I can't quite figure out the regex for this.
Thanks!
Something like,
^\d+\.\d+\.\d+(?:\s*-\s*\w+)?\/\d+\.\d+\.\d+\.\d+(?:\s*-\s*\w+)?.txt$
Or you can combine the \.\d+ repetitions as
^\d+(?:\.\d+){2}(?:\s*-\s*\w+)?\/\d+(?:\.\d+){3}(?:\s*-\s*\w+)?.txt$
Regex Demo
Changes
.{1} When you want to repeat something once, no need for {}. Its implicit
(?:\s*-\s*\w+) Matches zero or more space (\s*) followed by -, another space and then \w+ a description of length greater than 1
The ? at the end of this patterns makes this optional.
This same pattern is repeated again at the end to match the second part.
^ Anchors the regex at the start of the string.
$ Anchors the regex at the end of the string. These two are necessary so that there is nothing other in the string.
Don't group the patterns using () unless it is necessary to capture them. This can lead to wastage of memory. Use (?:..) If you want to group patterns but not capture them
In the group that matches the optional part, you need to replace .+ with \\S+ where \S means any non-whitespace character. This enforces the optional part to include non-whitespace character in order to match the pattern:
String part1
= "((\\d+\\.\\d+\\.\\d+)(\\s*-\\s*(\\S+))?)";
String part2
= "((\\d+\\.\\d+\\.\\d+.{1}\\d+)(\\s*-\\s*(\\S+))?\\.txt)";
Also note that .{1} (which is the same as just .) matches any character. From the examples, you want to match a dot, so it should be replaced with \.
Something like
^\d+\.\d+\.\d+(?:\s*-\s*[^\/\s]+)?\/\d+\.\d+\.\d+\.\d+?(?:\s*-\s*[^.\s]+)?\.\w+$
Check it out here at regex101.
I would like to find all possible occurrences of text enclosed between two ~s.
For example: For the text ~*_abc~xyz~ ~123~, I want the following expressions as matching patterns:
~*_abc~
~xyz~
~123~
Note it can be an alphabet or a digit.
I tried with the regex ~[\w]+?~ but it is not giving me ~xyz~. I want ~ to be reconsidered. But I don't want just ~~ as a possible match.
Use capturing inside a positive lookahead with the following regex:
Sometimes, you need several matches within the same word. For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D. You can do it with this single regex:
(?=(\w+))
At the first position in the string (before the A), the engine starts the first match attempt. The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1. The lookahead succeeds, and so does the match attempt. Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string). It also returns what was captured by Group 1: ABCD
The engine then moves to the next position in the string and starts the next match attempt. Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1. The match succeeds, and Group 1 contains BCD.
The engine moves to the next position in the string, and the process repeats itself for CD then D.
So, use
(?=(~[^\s~]+~))
See the regex demo
The pattern (?=(~[^\s~]+~)) checks each position inside a string and searches for ~ followed with 1+ characters other than whitespace and ~ and then followed with another ~. Since the index is moved only after a position is checked, and not when the value is captured, overlapping substrings get extracted.
Java demo:
String text = " ~*_abc~xyz~ ~123~";
Pattern p = Pattern.compile("(?=(~[^\\s~]+~))");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group(1));
}
System.out.println(res); // => [~*_abc~, ~xyz~, ~123~]
Just in case someone needs a Python demo:
import re
p = re.compile(r'(?=(~[^\s~]+~))')
test_str = " ~*_abc~xyz~ ~123~"
print(p.findall(test_str))
# => ['~*_abc~', '~xyz~', '~123~']
Try this [^~\s]*
This pattern doesn't consider the characters ~ and space (refered to as \s).
I have tested it, it works on your string, here's the demo.
I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.