How to know which part of regex matched? - java

regex= (i.*d.*n.*t.*)|(p.*r.*o.*f.*)|(u.*s.*r.*)
string to be matched= profile
Now the regex will match with the string. But I want to know which part matched.
Meaning, I want (p.*r.*o.f.) as the output
How can I get do this in Java?

You can check if which group matched:
Pattern p = Pattern.compile("(i.*d.*n.*t.*)|(p.*r.*o.*f.*)|(u.*s.*r.*)");
Matcher m = p.matcher("profile");
m.find();
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(i + ": " + m.group(i));
}
Will output:
1: null
2: profile
3: null
Because the second line is not null, it's (p.*r.*o.*f.*) that matched the string.

In your case, It seems like you can distinguish those subpatterns with the first letter. If the first letter of the match is 'p', then it will be your desired pattern. Maybe you can construct simple function to distinguish these.

Related

Search substring in a string using regex

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}
In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.
There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}
you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

Regex not capturing matching in expected groups

I have been working on requirement and I need to create a regex on following string:
startDate:[2016-10-12T12:23:23Z:2016-10-12T12:23:23Z]
There can be many variations of this string as follows:
startDate:[*;2016-10-12T12:23:23Z]
startDate:[2016-10-12T12:23:23Z;*]
startDate:[*;*]
startDate in above expression is a key name which can be anything like endDate, updateDate etc. which means we cant hardcode that in a expression. The key name can be accepted as any word though [a-zA-Z_0-9]*
I am using the following compiled pattern
Pattern.compile("([[a-zA-Z_0-9]*):(\\[[[\\*]|[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}[Z]];[[\\*]|[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}[Z]]\\]])");
The pattern matches but the groups created are not what I expect. I want the group surrounded by parenthesis below:
(startDate):([*:2016-10-12T12:23:23Z])
group1 = "startDate"
group2 = "[*;2016-10-12T12:23:23Z]"
Could you please help me with correct expression in Java and groups?
You are using [ rather than ( to wrap options (i.e. using |).
For example, the following code works for me:
Pattern pattern = Pattern.compile("(\\w+):(\\[(\\*|\\d{4}):\\*\\])");
Matcher matcher = pattern.matcher(text);
if (matcher.matches()) {
for (int i = 0; i < matcher.groupCount() + 1; i++) {
System.out.println(i + ":" + matcher.group(i));
}
} else {
System.out.println("no match");
}
To simplify things I just use the year but I'm sure it'll work with the full timestamp string.
This expression captures more than you need in groups but you can make them 'non-capturing' using the (?: ) construct.
Notice in this that I simplified some of your regexp using the predefined character classes. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for more details.
Here is a solution which uses your original regex, modified so that it actually returns the groups you want:
String content = "startDate:[2016-10-12T12:23:23Z:2016-10-12T12:23:23Z]";
Pattern pattern = Pattern.compile("([a-zA-Z_0-9]*):(\\[(?:\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z|\\*):(?:\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z|\\*)\\])");
Matcher matcher = pattern.matcher(content);
// remember to call find() at least once before trying to access groups
matcher.find();
System.out.println("group1 = " + matcher.group(1));
System.out.println("group2 = " + matcher.group(2));
Output:
group1 = startDate
group2 = [2016-10-12T12:23:23Z:2016-10-12T12:23:23Z]
This code has been tested on IntelliJ and appears to be working correctly.

Java pattern for [j-*]

Please help me with the pattern matching. I want to build a pattern which will match the word starting with j- or c- in the following in a string (Say for example)
[j-test] is a [c-test]'s name with [foo] and [bar]
The pattern needs to find [j-test] and [c-test] (brackets inclusive).
What I have tried so far?
String template = "[j-test] is a [c-test]'s name with [foo] and [bar]";
Pattern patt = Pattern.compile("\\[[*[j|c]\\-\\w\\-\\+\\d]+\\]");
Matcher m = patt.matcher(template);
while (m.find()) {
System.out.println(m.group());
}
And its giving output like
[j-test]
[c-test]
[foo]
[bar]
which is wrong. Please help me, thanks for your time on this thread.
Inside a character class, you don't need to use alternation to match j or c. Character class itself means, match any single character from the ones inside it. So, [jc] itself will match either j or c.
Also, you don't need to match the pattern that is after j- or c-, as you are not bothered about them, as far as they start with j- or c-.
Simply use this pattern:
Pattern patt = Pattern.compile("\\[[jc]-[^\\]]*\\]");
To explain:
Pattern patt = Pattern.compile("(?x) " // Embedded flag for Pattern.COMMENT
+ "\\[ " // Match starting `[`
+ " [jc] " // Match j or c
+ " - " // then a hyphen
+ " [^ " // A negated character class
+ " \\]" // Match any character except ]
+ " ]* " // 0 or more times
+ "\\] "); // till the closing ]
Using (?x) flag in the regex, ignores the whitespaces. It is often helpful, to write readable regexes.

regular expression for file name

I have files in the format *C:\Temp\myfile_124.txt*
I need a regular expression which will give me just the number "124" that is whatever is there after the underscore and before the extension.
I tried a number of ways, latest is
(.+[0-9]{18,})(_[0-9]+)?\\.txt$
I am not getting the desired output. Can someone tell me what is wrong?
Matcher matcher = FILE_NAME_PATTERN.matcher(filename);
if (matcher.matches() && matcher.groupCount() == 2) {
try {
String index = matcher.group(2);
if (index != null) {
return Integer.parseInt(index.substring(1));
}
}
catch (NumberFormatException e) {
}
The first part [0-9]{18,} states you have atleast 18 digits which you don't have.
Usually with regex its a good idea to make the expression as simple as possible. I suggest trying
_([0-9]+)?\\.txt$
Note: you have to call find() to make it perform the lookup, otherwise it says "No match found"
This example
String s = "C:\\Temp\\myfile_124.txt";
Pattern p = Pattern.compile("_(\\d+)\\.txt$");
Matcher matcher = p.matcher(s);
if (matcher.find())
for (int i = 0; i <= matcher.groupCount(); i++)
System.out.println(i + ": " + matcher.group(i));
prints
0: _124.txt
1: 124
This may work for you: (?:.*_)(\d+)\.txt
The result is in the match group.
This one uses positive lookahead and will only match the number: \d+(?=\.txt)
.*_([1-9]+)\.[a-zA-Z0-9]+
The group 1 will contain the desired output.
Demo
You can do this
^.*_\([^\.]*\)\..*$
.*_([0-9]+)\.txt
This should work too. Of course you should double escape for Java.

Java regex skipping matches

I have some text; I want to extract pairs of words that are not separated by punctuation. This is the code:
//n-grams
Pattern p = Pattern.compile("[a-z]+");
if (n == 2) {
p = Pattern.compile("[a-z]+ [a-z]+");
}
if (n == 3) {
p = Pattern.compile("[a-z]+ [a-z]+ [a-z]+");
}
Matcher m = p.matcher(text.toLowerCase());
ArrayList<String> result = new ArrayList<String>();
while (m.find()) {
String temporary = m.group();
System.out.println(temporary);
result.add(temporary);
}
The problem is that it skips some matches. For example
"My name is James"
, for n = 3, must match
"my name is" and "name is james"
, but instead it matches just the first. Is there a way to solve this?
You can capture it using groups in lookahead
(?=(\b[a-z]+\b \b[a-z]+\b \b[a-z]+\b))
This causes it to capture in two groups..So in your case it would be
Group1->my name is
Group2->name is james
In regular expression pattern defined by regex is applied on the String from left to right and once a source character is used in a match, it can’t be reused.
For example, regex “121″ will match “31212142121″ only twice as “121___121″.
I tend to use the argument to the find() method of Matcher:
Matcher m = p.matcher(text);
int position = 0;
while (m.find(position)) {
String temporary = m.group();
position = m.start();
System.out.println(position + ":" + temporary);
position++;
}
So after each iteration, it searches again based on the last start index.
Hope that helped!

Categories