Java Pattern What is the issue with these pattern match? - java

I would like a java pattern match a series of non-whitespace character followed or not by a series of whitespace character, the followed by a pair of parenthesis containing anything within with this code:
Pattern p1 = Pattern.compile("[^\\s+][\\s*]\\({1}[.*]\\){1}");
however, when I tried to match it with "a (a)", false is returned.
Maybe similar problems:
two websites saparated by white spaces:
Pattern p4 = Pattern.compile("([^\\s+]([\\.]{1}[^\\s+])+)[\\s+]([^\\s+]([\\.]{1}[^\\s+])+)");
Two strings of non-whitespace character separated by certain list of punctuation or words present in the code below (ex. and, or, aka...) (it could start with the list of words).
Pattern p2 = Pattern.compile(
"([^\\s+][\\s+])?([and|or|aka|&|Related to|moved from|now|formerly|and by the same host|and any address starting with]{1}[\\s+][^\\s+])+");
Pattern p3 = Pattern.compile("[^\\s]+[\\s*][,|&|;|\\s+/|/\\s+]{1}[\\s*][^\\s+]");

I think reading the docs on patterns in java might be helpful
Particular issue is that you put + and * to wrong place but I think the reason is that you don't understand what [something] means. The following code
Pattern p1 = Pattern.compile("[^\\s]+[\\s]*\\({1}.*\\){1}");
//Pattern p1 = Pattern.compile("[^\\s]+[\\s]*\\(.*\\)"); //simplified same pattern
String t = "a (a)";
Matcher matcher = p1.matcher(t);
System.out.println(matcher.matches());
prints true.

[^\\s]+[\\s]*\\(.*?\\)
Will do what you want. Move the asterisk and plus sign outside the character class brackets. Both instances of {1} do nothing. With no other quantifier, tokens are repeated one time and finally [.*] in the case of those two characters literally means permit one of these two characters
[test] means one of t, e, or s. The second t is irrelevant. Most characters inside character classes mean their literal counterpart, but the exceptions involve a lot more explaining than should be done in an S/O answer.
Not that while this will succeed for say a (b), this will give unexpected results if you have two occurences to match in the same sentence and is generally just a messy expression.
For a realistic expression, you need to provide realistic sample data.
An exceptional resource, after learning the basics, is the realtime testing environments provided by sites like http://regex101.com with syntax highlighting, match highlighting, match breakdown, and tooltips on mouseover of tokens, it's a great way to take the second step. While it only supports a few (commonly used) flavors, most mature programming/scripting languages share the same basic/intermediate capabilities in regex.

Related

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.
If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.
Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Algorithm: Regex character intersection

I have two regular expressions in Java and I would like to know whether the last character of a string (successfully) matched by the first regex can be the same as the first character of a string (successfully) matched by the second one.
These expression are complex, not just character restrictions, but length or form restricted too.
I was looking into https://code.google.com/archive/p/xeger/ but that is just half of the way.
(I am solving a problem whether there is a separator needed in between two consecutive strings restricted by these regexes or whether a parser would be able to tell them apart without a separator)
Examples:
Regex1 = <
Regex2 = [:a-zA-Z]([:a-zA-Z]|-|_|\.|[0-9])*
Regex3 = Regex2
[Regex1][Regex2] would need no separator, because parser would parse string <xml into 2 tokens safely (< and xml).
[Regex2][Regex3] share a lot of characters and parser would have several possibilities on how to parse lets say string table.
I know the theory behind regex evaluation (automata...), however I would like to avoid implementing DFA generation on my own.
I have an open source library on github that can build DFAs for you: http://mtimmerm.github.io/dfalex/
Note that your question seems to be formulated incorrectly. If you want to know whether or not a delimiter is required between strings that match two regexes, you probably need to know whether any character that can "extend" a successful match of the first regex can also start a match of the second regex. In a DFA, the characters that can extend a match are the ones on transitions out of accepting states.
I should add that you don't necessarily need to build DFAs to answer these questions. First + last characters, extending characters, and whether or not it matches the empty string, are questions that can be answered with simple recursive operations on the regex AST.
For example (using | and & for both Boolean and set operations):
Let NULLABLE(X) be true iff a regex matches the empty string. Then:
NULLABLE(AB) = NULLABLE(A) & NULLABLE(B)
NULLABLE(A|B) = NULLABLE(A) | NULLABLE(B)
NULLABLE(A+) = NULLABLE(A)
NULLABLE(A?) = true
Let FIRST(X) be the set of characters that can start a regex:
FIRST(AB) = NULLABLE(A) ? FIRST(A)|FIRST(B) : FIRST(A)
FIRST(A|B) = FIRST(A)|FIRST(B)
FIRST(A+) = FIRST(A?) = FIRST(A)
Let EXT(X) be the set of characters that can extend a regex:
EXT(AB) = NULLABLE(B) ? EXT(A)|EXT(B) : EXT(B)
EXT(A|B) = EXT(A) | EXT(B)
EXT(A+) = EXT(A?) = EXT(A)|FIRST(A)
Rather an abstract answer, but maybe you are able to turn this into code:
Regular expressions can always be turned into NFAs (Thompson) and therefore into DFAs (Subset Construction). See this interactive website for examples. And this particular problem might be easier to analyze in a DFA:
If and only if there is an edge that leads to a final state in the first DFA that is marked with a character, for which there is an outgoing edge of the initial state in the second DFA, then your condition is met.
See this example for the expressions aa+b* and ba+. The edges circled in blue are the edges reaching the final state in the first expression, so the final character can be a or b. The edges circled in red are the outgoing edges in the second DFA, so the first character can be only b. In this case, there can be an overlap.

When both halves of an OR regex group match, is it defined which will be chosen?

I've run the following code:
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(asd|asdf).*");
Pattern pattern2 = Pattern.compile("(asdf|asd).*");
Matcher m = pattern.matcher("asdf");
Matcher m2 = pattern2.matcher("asdf");
if (m.matches()) {
System.out.println(m.group(1));
}
if (m2.matches()) {
System.out.println(m2.group(1));
}
}
And I get the following output:
asd
asdf
It seems as though the left hand side of the OR group is chosen in cases when both match. However, I haven't been able to find this behaviour documented. Does anyone know if the behaviour is defined?
In non-POSIX regex flavors (as in Java, as The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5), the first alternative is matched. In POSIX, the longest alternative is matched.
See what Perl help says about alternation:
To match dog or cat, we form the regexp dog|cat. As before, Perl will try to match the regexp at the earliest possible point in the string. At each character position, Perl will first try to match the first alternative, dog. If dog doesn't match, Perl will then try the next alternative, cat. If cat doesn't match either, then the match fails and Perl moves to the next position in the string.
See the Alternation with The Vertical Bar or Pipe Symbol at regular-expressions.info that describes NFA-compliant alternation behavior:
The order of the alternatives matters. Suppose you want to use a regex to match a list of function names in a programming language: Get, GetValue, Set or SetValue. The obvious solution is Get|GetValue|Set|SetValue.
The regex engine starts at the first token in the regex, G, and at the first character in the string, S. The match fails. However, the regex engine studied the entire regular expression before starting. So it knows that this regular expression uses alternation, and that the entire regex has not failed yet. So it continues with the second option, being the second G in the regex. The match fails again. The next token is the first S in the regex. The match succeeds, and the engine continues with the next character in the string, as well as the next token in the regex. The next token in the regex is the e after the S that just successfully matched. e matches e. The next token, t matches t.
At this point, the third option in the alternation has been successfully matched. Because the regex engine is eager, it considers the entire alternation to have been successfully matched as soon as one of the options has. In this example, there are no other tokens in the regex outside the alternation, so the entire regex has successfully matched Set in SetValue.
And then:
But the POSIX standard does mandate that the longest match be returned, even when a regex-directed engine is used. Such an engine cannot be eager. It has to continue trying all alternatives even after a match is found, in order to find the longest one.
However, the order of alternatives may be irrelevant if the context on either side is strictly defined. If you use anchors, ^(asd|asdf)$, to match the full string, you will only get the one that corresponds to the right alternative.

Regex which matches a string containing at least the specified characters

I have a huge dictionary which I'm trying to look through using a regex. What I would like to do is to find all the words in the dictionary which contain at least one occurrences of each character I provide in no particular order.
Right now I can find words which only contain the specified characters but like I said that is not exactly what I want.
Example:
I want at least one occurrence of each of the following characters {b, a, d}
astring.matches(regex)
I would expect words like:
badder,
baddest,
baffled
Notice they all contain at least one occurence of each character but in no particular order and other characters are present in the strings.
Anyone know how to do this? Other suggestions are also welcome!
You need a series of look-aheads:
^(?=.*b)(?=.*a)(?=.*d).*
which is a pain to construct. However, you can ease the pain by using regex to build it:
String regex = "^" + "bad".replaceAll(".", "(?=.*$0)") + ".*";
If using repeatedly with String.matches(), you would be better to use the following code, because every call to String.matches() compiles the regex again (there is no caching):
// do this once
Pattern pattern = Pattern.compile(regex);
// reuse the pattern many times
if (pattern.matcher(input).matches())
You can use a lookahead to do this if it's available
(?=.*b)(?=.*a)(?=.*d)
However this is quite inefficient. Any reason you can't use multiple String.indexOf checks?

Regular expression not extracting the exact pattern

I am working in Java to read a string of over 100000 characters.
I have a list of keywords, that I search the string for, and if the string is present I call a function which does some internal processing.
The kind of keyword I have is "face", for example - I wish to get all the patterns where I have matches for "faces" not "facebook". I can accept a space character behind the face in the string so if in a string I have a match like " face" or " faces" or "face " or " faces" i can accept that too. However I can not accept "duckface" or "duckface " etc.
I have written the regex
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
where keyword is my list of keywords, but I am not getting the desired results. Can you read my description and please suggest what might be issue and how I can fix it?
Also if a pointer to a really good regex for Java page is shared I would appreciate that as well.
Thank you Contributers ..
Edit
The reason I know it is not working is I have used the following code:
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
Matcher m = p.matcher(myInputDataSting);
if(m.find())
{
System.out.println("Its a Match: "+m.group());
}
This returns a blank string...
If keyword is "face", then your current regex is
\s+faces\s+|\s+
which matches either one or more whitespace characters, followed by faces, followed by one or more whitespace characters, or one or more whitespace characters. (The pipe | has very low precedence.)
What you really want is
\bfaces?\b
which matches a word boundary, followed by face, optionally followed by s, followed by a word boundary.
So, you can write:
Pattern p = Pattern.compile("\\b"+keyword+"s?\\b");
(though obviously this will only work for words like face that form their plurals by simply adding s).
You can find a comprehensive listing of Java's regular-expression support at http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html, but it's not much of a tutorial. For that, I'd recommend just Googling "regular expression tutorial", and finding one that suits you. (It doesn't have to be Java-specific: most of the tutorials you'll find are for flavors of regular-expression that are very similar to Java's.)
You should use
Pattern p = Pattern.compile("\b"+keyword+"s?\b");
, where keyword is not plural. \\b means that keyword must be as a complete word in searched string. s? means that keyword's value may end with s.
If you are not familar enough with regular expressions I recommend reading http://docs.oracle.com/javase/tutorial/essential/regex/index.html, because there are examples and explanations.

Categories