Pattern java Finding out what part of OR matched

Pattern java Finding out what part of OR matched - java

I have the following pattern:
Pattern TAG = Pattern.compile("(<[\\w]+]>)|(</[\\w]+]>)");
Please note the | in the pattern.
And I have a method that does some processing with this pattern
private String format(String s){
Matcher m = TAG.matcher(s);
StringBuffer sb = new StringBuffer();
while(m.find()){
//This is where I need to find out what part
//of | (or) matched in the pattern
// to perform additional processing
}
return sb.toString();
}
I would like to perform different functions depending on what part of the OR matched in
the regex. I know that I can break up the pattern into 2 different patterns and match on each but that is not the solution I am looking for because my actual regex is much more complex and the functionality I am trying to accomplish would work best if I can do it in a single loop & regex. So my question is that:
Is there a way in java for finding out which part of the OR matched in the regex?
EDIT
I am also aware of the m.group() functionality. It does not work for my case. The example below
prints out <TAG> and </TAG> So for the first iteration of the loop it matches on <[\\w]+>
and second iteration it matches on </[\\w]+>. However I need to know which part matched on each iteration.
static Pattern u = Pattern.compile("<[\\w]+>|</[\\w]+>");
public static void main(String[] args) {
String xml = "<TAG>044453</TAG>";
Matcher m = u.matcher(xml);
while (m.find()) {
System.out.println(m.group(0));
}
}

Take a look at the group() method on Matcher, you can do something like this:
if (m.group(1) != null) {
// The first grouped parenthesized section matched
}
else if (m.group(2) != null) {
// The second grouped parenthesized section matched
}
EDIT: reverted to original group numbers - the extra parens were not needed. This should work with a pattern like:
static Pattern TAG = Pattern.compile("(<[\\w]+>)|(</[\\w]+>)");

You should rewrite your patterns by factoring out common parts:
xy|xz => x(y|z)
xy|x => xy?
yx|x => y?x
Then, by putting interesting parts like y? in parentheses you can check whether it is set or not with group().

You don't have to use [] with \\w since it is already a class. Also you can surround every option of OR part in with parenthesis go be able to use them as groups (if one of the group will not be found it will have null reference). So your code can look like this:
static Pattern u = Pattern.compile("(<\\w+>)|(</\\w+>)");
public static void main(String[] args) {
String xml = "<TAG>044453</TAG>";
Matcher m = u.matcher(xml);
while (m.find()) {
if (m.group(1)!=null){// <- group 1 (<\\w+>)
System.out.println("I found <...> tag: "+m.group(0));
}else{ // if it wasn't (<\\w+>) then it means it had to be (</\\w+>) that was mathced
System.out.println("I found </...> tag: "+m.group(0));
}
}
}
You can also change pattern a little into <(/?)\\w+> making / part optional and placing it in parenthesis (which in this case will make it group 1). This way if tag will not have / then group 1 will contain only empty String "" so you can change logic to something like
if ("".equals(m.group(1))) {//
System.out.println("I found <...> tag: " + m.group(0));
} else {
System.out.println("I found </...> tag: " + m.group(0));
}

Related

Check if a String satisfies a regex

I have a List of String and I want to filter out the String that doesn't match a regex pattern
Input List = Orthopedic,Orthopedic/Ortho,Length(in.)
My code
for(String s : keyList){
Pattern p = Pattern.compile("[a-zA-Z0-9-_]");
Matcher m = p.matcher(s);
if (!m.find()){
System.out.println(s);
}
}
I expect the 2nd and 3rd string to be printed as they do not match the regex. But it is not printing anything

Explanation
You are not matching the entire input. Instead, you are trying to find the next matching part in the input. From Matcher#finds documentation:
Attempts to find the next subsequence of the input sequence that matches the pattern.
So your code will match an input if at least one character is one of a-zA-Z0-9-_.
Solution
If you want to match the whole region you should use Matcher#matches (documentation):
Attempts to match the entire region against the pattern.
And you probably want to adjust your pattern to allow multiple characters, for example by a pattern like
[a-zA-Z0-9-_]+
The + allows 1 to infinite many repetitions of the pattern (? is 0 to 1 and * is 0 to infinite).
Notes
You have an extra - at the end of your pattern. You probably want to remove that. Or, if you intended to match the character litteraly, you need to escape it:
[a-zA-Z0-9\\-_]+
You can test your regex on sites like regex101.com, here's your pattern: regex101.com/r/xvT8V0/1.
Note that there is also String#matches (documentation). So you could write more compact code by just using s.matches("[a-zA-Z0-9_]+").
Also note that you can shortcut character sets like [a-zA-Z0-9_] by using predefined sets. The set \w (word character) matches exactly your desired pattern.
Since the pattern and also the matcher don't change, you might want to move them outside of the loop to slightly increase performance.
Code
All in all your code might then look like:
Pattern p = Pattern.compile("[a-zA-Z0-9_]+");
Matcher m = p.matcher(s);
for (String s : keyList) {
if (!m.matches()) {
System.out.println(s);
}
}
Or compact:
for (String s : keyList) {
if (!s.matches("\\w")) {
System.out.println(s);
}
}
Using streams:
keyList.stream()
.filter(s -> !s.matches("\\w"))
.forEach(System.out::println);

You shouldn't construct a Pattern in a loop, you currently only match a single character, and you can use !String.matches(String) and a filter() operation. Like,
List<String> keyList = Arrays.asList("Orthopedic", "Orthopedic/Ortho", "Length(in.)");
keyList.stream().filter(x -> !x.matches("[a-zA-Z0-9-_]+"))
.forEachOrdered(System.out::println);
Outputs (as requested)
Orthopedic/Ortho
Length(in.)
Or, using the Pattern, like
List<String> keyList = Arrays.asList("Orthopedic", "Orthopedic/Ortho", "Length(in.)");
Pattern p = Pattern.compile("[a-zA-Z0-9-_]+");
keyList.stream().filter(x -> !p.matcher(x).matches()).forEachOrdered(System.out::println);

There are two problems:
1) the regular expression is wrong, it matches just one character.
2) you need to use m.matches() instead of m.find().

You can use matches instead of find:
//Added the + at the end and removed the extra -
Pattern p = Pattern.compile("[a-zA-Z0-9_]+");
for(String s : keyList){
Matcher m = p.matcher(s);
if (!m.matches()){
System.out.println(s);
}
}
Also note that the point of compiling a pattern is to reuse it, so put it outside the loop. Otherwise you may as well use:
for(String s : keyList){
if (!s.matches("[a-zA-Z0-9_]+")){
System.out.println(s);
}
}

Regular expression extracting a string from url

What I am trying is to extract my account id from a url for other validations.
see my URL samples.
http://localhost:8024/accounts/u8m21ercgelj/
http://localhost:8024/accounts/u8m21ercgelj
http://localhost:8024/accounts/u8m21ercgelj/users?
What I required is to extract u8m21ercgelj from the url. I tried it with below code but it fails for the cases like http://localhost:8024/accounts/u8m21ercgelj
i.e with out a / at the end.
public String extractAccountIdFromURL(String url) {
String accountId = null;
if ( url.contains("accounts")) {
Pattern pattern = Pattern.compile("[accounts]/(.*?)/");
Matcher matcher = pattern.matcher(url);
while (matcher.find()) {
accountId = matcher.group(1);
}
}
return accountId;
}
Can any one help me?

[accounts] doesn't try to find accounts word, but one character which is either a, c (repetition of character doesn't change anything), o, u, n, t or s because [...] is character class. So get rid of those [ and ] and replace them with / since you most likely don't want to accept cases like /specialaccounts/ but only /accounts/.
It looks like you just want to find next non-/ section after /accounts/. In that case you can just use /accounts/([^/]+)
If you are sure that there will be only one /accounts/ section in URL you can (and for more readable code should) change your while to if or even conditional operator. Also there is no need for contains("/accounts/") since it just adds additional traversing over entire string which can be done in find().
It doesn't look like your method is using any data held by your class (any fields) so it could be static.
Demo:
//we should resuse once compiled regex, there is no point in compiling it many times
private static Pattern pattern = Pattern.compile("/accounts/([^/]+)");
public static String extractAccountIdFromURL(String url) {
Matcher matcher = pattern.matcher(url);
return matcher.find() ? matcher.group(1) : null;
}
public static void main(java.lang.String[] args) throws Exception {
String examples =
"http://localhost:8024/accounts/u8m21ercgelj/\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj/users?";
for (String url : examples.split("\\R")){// split on line separator like `\r\n`
System.out.println(extractAccountIdFromURL(url));
}
}
Output:
u8m21ercgelj
u8m21ercgelj
u8m21ercgelj

Your regex is written as such that it is expecting to receive a trailing slash - that's what the slash after the (.*?) means.
You should change this so that it can accept either the trailing slash, or the end of the string. (/|$) should work in this case, meaning your regex would be [accounts]/(.*?)(/|$)

Java how to check multiple regex patterns against an input?

(If I'm taking the complete wrong direction let me know if there is a better way I should be approaching this)
I have a Java program that will have multiple patterns that I want to compare against an input. If one of the patterns matches then I want to save that value in a String. I can get it to work with a single pattern but I'd like to be able to check against many.
Right now I have this to check if an input matches one pattern:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Matcher match = pattern.matcher(input);
String ID = match.find()?match.group():null;
So, if the input was TST1234 or abcTST1234 then ID = "TST1234"
I want to have multiple patterns like:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Pattern pattern = Pattern.compile("TWT\\w{1,}");
...
and then to a collection and then check each one against the input:
List<Pattern> rxs = new ArrayList<Pattern>();
rxs.add(pattern);
rxs.add(pattern2);
String ID = null;
for (Pattern rx : rxs) {
if (rx.matcher(requestEnt).matches()){
ID = //???
}
}
I'm not sure how to set ID to what I want. I've tried
ID = rx.matcher(requestEnt).group();
and
ID = rx.matcher(requestEnt).find()?rx.matcher(requestEnt).group():null;
Not really sure how to make this work or where to go from here though. Any help or suggestions are appreciated. Thanks.
EDIT: Yes the patterns will change over time. So The patten list will grow.
I just need to get the string of the match...ie if the input is abcTWT123 it will first check against "TST\w{1,}", then move on to "TWT\w{1,}" and since that matches the ID String will be set to "TWT123".

To collect the matched string in the result you may need to create a group in your regexp if you are matching less than the entire string:
List<Pattern> patterns = new ArrayList<>();
patterns.add(Pattern.compile("(TST\\w+)");
...
Optional<String> result = Optional.empty();
for (Pattern pattern: patterns) {
Matcher matcher = pattern.match();
if (matcher.matches()) {
result = Optional.of(matcher.group(1));
break;
}
}
Or, if you are familiar with streams:
Optional<String> result = patterns.stream()
.map(Pattern::match).filter(Matcher::matches)
.map(m -> m.group(1)).findFirst();
The alternative is to use find (as in #Raffaele's answer) that implicitly creates a group.
Another alternative you may want to consider is to put all your matches into a single pattern.
Pattern pattern = Pattern.compile("(TST\\w+|TWT\\w+|...");
Then you can match and group in a single operation. However this might might it harder to change the matches over time.
Group 1 is the first matched group (i.e. the match inside the first set of parentheses). Group 0 is the entire match. So if you want the entire match (I wasn't sure from your question) then you could perhaps use group 0.

Use an alternation | (a regex OR):
Pattern pattern = Pattern.compile("TST\\w+|TWT\\w+|etc");
Then just check the pattern once.
Note also that {1,} can be replaced with +.

Maybe you just need to end the loop when the first pattern matches:
// TST\\w{1,}
// TWT\\w{1,}
private List<Pattern> patterns;
public String findIdOrNull(String input) {
for (Pattern p : patterns) {
Matcher m = p.matcher(input);
// First match. If the whole string must match use .matches()
if (m.find()) {
return m.group(0);
}
}
return null; // Or throw an Exception if this should never happen
}

If your patterns are all going to be simple prefixes like your examples TST and TWT you can define all of those at once, and user regex alternation | so you won't need to loop over the patterns.
An example:
String prefixes = "TWT|TST|WHW";
String regex = "(" + prefixes + ")\\w+";
Pattern pattern = Pattern.compile(regex);
String input = "abcTST123";
Matcher match = pattern.matcher(input);
String ID = match.find() ? match.group() : null;
// given this, ID will come out as "TST123"
Now prefixes could be read in from a java .properties file, or a simple text file; or passed as a parameter to the method that does this.
You could also define the prefixes as a comma-separated list or one-per-line in a file then process that to turn them into one|two|three|etc before passing it on.
You may be looping over several inputs, and then you would want to create the regex and pattern variables only once, creating only the Matcher for each separate input.

Multiple matches with delimiter

this is my regex:
([+-]*)(\\d+)\\s*([a-zA-Z]+)
group no.1 = sign
group no.2 = multiplier
group no.3 = time unit
The thing is, I would like to match given input but it can be "chained". So my input should be valid if and only if the whole pattern is repeating without anything between those occurrences (except of whitespaces). (Only one match or multiple matches next to each other with possible whitespaces between them).
valid examples:
1day
+1day
-1 day
+1day-1month
+1day +1month
+1day +1month
invalid examples:
###+1day+1month
+1day###+1month
+1day+1month###
###+1day+1month###
###+1day+1month###
I my case I can use matcher.find() method, this would do the trick but it will accept input like this: +1day###+1month which is not valid for me.
Any ideas? This can be solved with multiple IF conditions and multiple checks for start and end indexes but I'm searching for elegant solution.
EDIT
The suggested regex in comments below ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$ will partially do the trick but if I use it in the code below it returns different result than the result I'm looking for.
The problem is that I cannot use (*my regex*)+ because it will match the whole thing.
The solution could be to match the whole input with ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$and then use ([+-]*)(\\d+)\\s*([a-zA-Z]+)with matcher.find() and matcher.group(i) to extract each match and his groups. But I was looking for more elegant solution.

This should work for you:
^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$
First, by adding the beginning and ending anchors (^ and $), the pattern will not allow invalid characters to occur anywhere before or after the match.
Next, I included optional whitespace before and after the repeated pattern (\s*).
Finally, the entire pattern is enclosed in a repeater so that it can occur multiple times in a row ((...)+).
On a side, note, I'd also recommend changing [+-]* to [+-]? so that it can only occur once.
Online Demo

You could use ^$ for that, to match the start/end of string
^\s*(?:([+-]?)(\d+)\s*([a-z]+)\s*)+$
https://regex101.com/r/lM7dZ9/2
See the Unit Tests for your examples. Basically, you just need to allow the pattern to repeat and force that nothing besides whitespace occurs in between the matches.
Combined with line start/end matching and you're done.

You can use String.matches or Matcher.matches in Java to match the entire region.
Java Example:
public class RegTest {
public static final Pattern PATTERN = Pattern.compile(
"(\\s*([+-]?)(\\d+)\\s*([a-zA-Z]+)\\s*)+");
#Test
public void testDays() throws Exception {
assertTrue(valid("1 day"));
assertTrue(valid("-1 day"));
assertTrue(valid("+1day-1month"));
assertTrue(valid("+1day -1month"));
assertTrue(valid(" +1day +1month "));
assertFalse(valid("+1day###+1month"));
assertFalse(valid(""));
assertFalse(valid("++1day-1month"));
}
private static boolean valid(String s) {
return PATTERN.matcher(s).matches();
}
}

You can proceed like this:
String p = "\\G\\s*(?:([-+]?)(\\d+)\\s*([a-z]+)|\\z)";
Pattern RegexCompile = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
String s = "+1day 1month";
ArrayList<HashMap<String, String>> results = new ArrayList<HashMap<String, String>>();
Matcher m = RegexCompile.matcher(s);
boolean validFormat = false;
while( m.find() ) {
if (m.group(1) == null) {
// if the capture group 1 (or 2 or 3) is null, it means that the second
// branch of the pattern has succeeded (the \z branch) and that the end
// of the string has been reached.
validFormat = true;
} else {
// otherwise, this is not the end of the string and the match result is
// "temporary" stored in the ArrayList 'results'
HashMap<String, String> result = new HashMap<String, String>();
result.put("sign", m.group(1));
result.put("multiplier", m.group(2));
result.put("time_unit", m.group(3));
results.add(result);
}
}
if (validFormat) {
for (HashMap item : results) {
System.out.println("sign: " + item.get("sign")
+ "\nmultiplier: " + item.get("multiplier")
+ "\ntime_unit: " + item.get("time_unit") + "\n");
}
} else {
results.clear();
System.out.println("Invalid Format");
}
The \G anchor matches the start of the string or the position after the previous match. In this pattern, it ensures that all matches are contigous. If the end of the string is reached, it's a proof that the string is valid from start to end.

Regex to detect end of line(\n) that has double slash(//)

I need a regex for this example:
//This is a comment and I need this \n position
String notwanted ="//I do not need this end of line position";

Try this regex:
(?<!")\/\/[^\n]+(\n)
you can use Matcher method matcher.start(1) to get index of \n character, but in will not match String where \\ is preceded by ". Example in Java:
public class Main {
public static void main(String[] args){
String example = "//This is a comment and I need this \\n position\n" +
"String notwanted =\"//I do not need this end of line position\";";
Pattern regex = Pattern.compile("(?<!\")//[^\\n]+(\\n)");
Matcher matcher = regex.matcher(example);
while (matcher.find()) {
System.out.println(matcher.start(1));
}
}
}
however it would be enough to use:
(?<!")\/\/[^\n]+
and just use matcher.end(), to get start position of new line.
Another case, if you would like to split a string using this position, you can also use this one:
example.split("(?<=^//[^\n]{0,1000})\n");
The (?<=^//[^\n]{0,999}) means:
?<= - lookbehind,
^// - beginning of a line, fallowed by // comments sign
[^\n]{0,1000} - multiple characters but not new lines; here is tricky thing, as lookbehind need to have defined lenght, you cannot use quatifires like * or +, this is why you need to use interval, in this case, from 0 to 1000 characters, but be aware, if your comment is more than 1000 characters (not too possible but still possible), it will not work - so set this number (1000 in this example) carefully
\n - new line you are looking for
but if you would like to split whole string in multiple places, you will need to add modifier (?m) - multiline match - on the beginning of regex:
(?m)(?<=^//[^\n]{0,1000})\n
but I'm not entirely sure
>>EDIT<< response to questions from comments
Try this code:
public class Main {
public static void main(String[] args){
String example =
"//This is a comment and I need this \\n position\n" +
"String notwanted =\"//I do not need this end of line position\";\n" +
"String a = aaa; //comment here";
Pattern regex = Pattern.compile("(?m)(?<=(^|;\\s{0,1000})//[^\n]{0,1000})(\n|$)");
Matcher matcher = regex.matcher(example);
while(matcher.find()){
System.out.println(matcher.start());
}
System.out.println(example.replaceAll("(?<=(^|;\\s{0,1000})//[^\n]{0,1000})(\n|$)", " (X)\n"));
}
}
maybe this regex will fulfill your expectations. If not, please redefine and ask another question with more details like: input, expexted output, your current code, your goal.

This should work for you. It's really really awful. Couldn't really think of a much better, versatile solution. I'm assuming you also wanted comments like this:
String myStr = "asasdasd"; //some comment here
^[^"\n]*?(?:[^"\n]*?"(?>\\"|[^"\n])*?"[^"\n]*?)*?[^"\n]*?\/\/.*?(\n)
Regex101

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pattern java Finding out what part of OR matched - java

You should rewrite your patterns by factoring out common parts: xy|xz => x(y|z) xy|x => xy? yx|x => y?x Then, by putting interesting parts like y? in parentheses you can check whether it is set or not with group().

Related

Check if a String satisfies a regex

Regular expression extracting a string from url

Java how to check multiple regex patterns against an input?

Multiple matches with delimiter

Regex to detect end of line(\n) that has double slash(//)

Categories

Resources