How to write regex pattern in lucene? - java

I want to match a string from regexp query in lucene.
Test String:
program-id. acinstal.
Regex pattern in java:
^[a-z0-9 ]{6}[^*]\s*(program-id)\.
How would i write this regex specifically for lucene regexp query to match the string.

Two problems with your regex (assuming here, based on previous questions, that your test string is indexed without any tokenization. As a StringField, for instance):
The regex must match a whole term. Without any analysis, as we're assuming, that means it must match the whole field. In this case, you need to add a .* to match the rest of the field
Since you have to match the whole field anyway, anchors are not supported, so get rid of the ^ at the beginning.
So the regex that should work is:
[a-z0-9 ]{6}[^*]\s*(program-id)\..*

Related

Regex pattern for multi equal operation

My goal is to compare one string with multiple other strings for equal operation using only regex in java 8.
I used below syntax
"^UK (Main Land)|German|Japan|Swiss|French|Italian$"
But this syntax works good for German,Japan,Swiss,French but validation fails for UK (Main Land) and Italian.
What the change that I have to make it work?
There are a couple of issues here.
The parentheses, as literal chars, must be escaped.
If you use Matcher.find(), you need the ^ and $ anchors to make sure the pattern matches the entire string (although \A and \z would be better), but you need to group the alternatives with either (...) or (?:...).
You do not need the group and anchors if you use String.matches() or Pattern.matches that ensure an entire string match.
I'd rather use
Boolean result = text.matches("UK \\(Main Land\\)|German|Japan|Swiss|French|Italian");

Name validation with special conditions using regex

I want to validate the Name in Java that will allow following special characters for single time {,-.'}. I am able to achieve with the Expression that will allow user to enter only such special characters in a string. But I am not able to figure it out how to add restrictions where users cannot add these characters more then one time. I tried to achieve it using quantifiers but remain unsuccessful. I have done the following code yet!
Pattern validator = Pattern.compile("^[a-zA-Z+\\.+\\-+\\'+\\,]+$");
You can use lookahead assertion in your regex:
Pattern validator = Pattern.compile(
"^(?!(?:.*?\\.){2})(?!(?:.*?'){2})(?!(?:.*?,){2})(?!(?:.*?-){2})[a-zA-Z .',-]+$");
(?!(?:.*?[.',-]){2}) is a negative lookahead that means don't allow more than 1 of those characters in character class.
RegEx Demo
I think that you can just take into account names where such characters would only happen once. Names like "Jonathan's", "Thoms-Damm", "Thoms,Jon", "jonathan.thoms". In practice for names, I don't think that such special characters would occur at the edges of the string. As such, you can probably get away with a regex like:
Pattern validator = Pattern.compile("^[a-zA-Z]+(?:[-',\.][a-zA-Z]+)?$");
This regex should match a regular ASCII name followed optionally by a single "special" character with another name after it.

Word that matches ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$

I am totally confused right now.
What is a word that matches: ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
I tried at Regex 101 this 1Test#!. However that does not work.
I really appreciate your input!
What happens is that your regex seems to be in Java-flavor (Note the \\d)
that is why you have to convert it to work with regex101 which does not work with jave (only works with php, phyton, javascript)
see converted regex:
^.*(?=.*\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
which will match your string 1Test#!. Demo here: http://regex101.com/r/gE3iQ9
You just want something that matches that regex?
Here:
a1a!
This pattern matches
\dTest#!
if u want a pattern which matches 1Test#! try this pattern
^.(?=.\d)(?=.[a-zA-Z])(?=.[!##$%^&]).*$
Your java string ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$ encodes the regexp expression ^.*(?=.*\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$.
This is because the \ is an escape sequence.
The latter matches the string you specified.
If your original string was a regexp, rather than a java string, it would match strings such as \dTest#!
Also you should consider removing the first .*, doing so would make the regexp more efficient. The reason is that regexp's by default are greedy. So it will start by matching the whole string to the initial .*, the lookahead will then fail. The regexp will backtrack, matchine the first .* to all but the last character, and will fail all but one of the loohaheads. This will proceed until it hits a point where the different lookaheads succeed. Dropping the first .*, putting the lookahead immidiately after the start of string anchor, will avoid this problem, and in this case the set of strings matched will be the same.

How can I obtain what .* matched in a regular expression?

I have thousands of different regular expressions and they look like this:
^Mozilla.*Android.*AppleWebKit.*Chrome.*OPR\/([0-9\.]+)
How do I obtain those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different .*s. In addition, I don't know in advance how many .*s there are, even though I can possibly find out by doing some simple operation on the given regex string, but that would impose more complexity on the program. I process a fairly big amount of data, so really focus on the efficiency here.
Replace the .*s with (.*)s and use matcher.group(n). For instance:
Pattern p = Pattern.compile("1(.*)2(.*)3");
Matcher m = p.matcher("1abc2xyz3");
m.find();
System.out.println(m.group(2));
xyz
Notice how the match of the second (.*) was returned (since m.group(2) was used).
Also, since you mentioned you won't know how many .*s your regex will contain, there is a matcher.groupCount() method you can use, if the only capturing groups in your regex will indeed be (.*)s.
For your own enlightenment, try reading about capturing groups.
How do I get those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different DOT STAR.
Use groups: (.*)
I addition, I don't know in advance how many DOT STARs there are
Build your regex string, then replace .* with (.*):
String myRegex = "your regex here";
myRegex = myRegex.replace(".*","(.*)");
even though I can possible find out about that by doing some simple operation on the given regex string, but that would impose more complexity on the program
If you don't know how the regex is made and the regex is not built by your application, the only way is to process it after you have it. If you are building the regex, then append (.*) to the regex string instead of appending .*

Anyone know how to test an entire String for a match using Java regex?

I would like to search a String for an entire match. In other words, if String s = "I am coding", and I type in that I am searching for "am" nothing should get returned. I need the exact String in order to get a match. In other words, I would have to type in"I am coding" exactly in order for a match to be returned.
I need the regex pattern for this, since I am using RowFiler.regexFilter(...).
Have you tried this: ^I am coding$?
The regex, if it doesn't contain characters to escape is as what you are looking for: any character maches for itself and two next characters means concatenation. So, in this case, "\AI am coding\z" is your answer..
On the Regex side of things:
using the start of string anchor ^ and end of string anchor $ at the beginning and the end of your search pattern (respectively) to ensure that the search string doesn't contain anything else (i.e. it equals the pattern you're trying to match. Regex:
^I am Coding$
Ref: http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm

Categories