Java Regex to match repeated keywords - java

I need to filter a document if the caption is the same surname (i.e.,Smith Vs Smith or John Vs John etc.).
I am converting entire document into a string and validating that string against a regular expression.
Could any one help me to write a regular expression for the above case.

Backreferences.
Example: (\w+) Vs \1

If a had exactly understand your question: you have a string like this "X Vs Y" (Where X and Y are two names) and you want to know if X == Y.
In this case, a simple (\w+) regex can do it :
String input = "Smith Vs Smith";
// Build the Regex
Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(input);
// Store the matches in a list
List<String> str = new ArrayList<String>();
while (m.find()) {
if (!m.group().equals("Vs"))
{
str.add(m.group());
}
}
// Test the matches
if (str.size()>1 && str.get(0).equals(str.get(1)))
System.out.println(" The Same ");
else System.out.println(" Not the Same ");

(\w+).*\1
This means: a word of 1 or more characters, signed as group 1, followed by anything, and followed by whatever group 1 is.
More explained: grouping (bracketing part of regex) and referencing to groups defined in the expression ( \1 does that here).
Example:
String s = "Stewie is a good guy. Stewie does no bad things";
s.find("(\\w+).*\\1") // will be true, and group 1 is the duplicated word. (note the additional java escape);

Related

Remove everything from String which is not on a allowlist using regex

Following regular expression removes each word from a string:
String regex = "\\b(operation|for the|am i|regex|mountain)\\b";
String sentence = "I am looking for the inverse operation by using regex";
String s = Pattern.compile(regex).matcher(sentence.toLowerCase()).replaceAll("");
System.out.println(s); // output: "i am looking inverse by using "
I am looking for the inverse operation by using regex. So following example should work.
The words "am i" and "mountain" just indicate that there can be much more words in the list. And also words with spaces can occur in the list.
String regex = "<yet to find>"; // contains words operation,for the,am i,regex,mountain
String sentence = "I am looking for the inverse operation by using regex";
String s = Pattern.compile(regex).matcher(sentence.toLowerCase()).replaceAll("");
System.out.println(s); // output: " for the operation regex"
Regards, Harris
Try the regex:
(?:(?!for the|operation|am i|mountain|regex).)*(for the|operation|am i|mountain|regex|$)
Replace the matches by contents of group 1 \1 or $1
Click for Demo
Click for Code
Explanation:
(?:(?!for the|operation|am i|mountain|regex).)* - matches 0+ occurrences of any character that is NOT followed by either for the or operation or am i or mountain or regex
(for the|operation|am i|mountain|regex|$) - matches either for the or operation or am i or mountain or regex or end of the string and captures it in group 1
To expand on Singh's answer in the comments, I'd add that hard-coding the regex for a set of words is not very portable. What if the words change? Are they just words or are they patterns? Can you isolate the part of code that will do this work and test it?
Assuming they're just words:
Define a whitelist
String[] whitelist = {
"operation",
"for",
"the",
"am i",
"regex",
"mountain"
};
Write a method for filtering the words so that only the whitelisted ones are allowed.
String sanitized(String raw, String[] whitelist) {
StringBuilder termsInOr = new StringBuilder();
termsInOr.append("|");
for (String word : whitelist) {
termsInOr.append(word);
}
String regex = ".*?\\b(" + termsInOr.substring(1) + ")\\b";
return Pattern.compile(regex, Pattern.MULTILINE)
.matcher(raw)
.replaceAll(subst);
}
This way the logic is isolated, you have two inputs - a whitelist and the raw string - and the sanitized output. It can be tested with assertions based on your expected output (test cases) if you have a different whitelist or raw string somewhere else in the code you can call the method with that whitelist / raw string to sanitize.

Regular expression match a-alphanumeric&b-digits&c-digits

I have query about java regular expressions. Actually, I am new to regular expressions.
So I need help to form a regex for the statement below:
Statement: a-alphanumeric&b-digits&c-digits
Possible matching Examples: 1) a-90485jlkerj&b-34534534&c-643546
2) A-RT7456ffgt&B-86763454&C-684241
Use case: First of all I have to validate input string against the regular expression. If the input string matches then I have to extract a value, b value and c value like
90485jlkerj, 34534534 and 643546 respectively.
Could someone please share how I can achieve this in the best possible way?
I really appreciate your help on this.
you can use this pattern :
^(?i)a-([0-9a-z]++)&b-([0-9]++)&c-([0-9]++)$
In the case what you try to match is not the whole string, just remove the anchors:
(?i)a-([0-9a-z]++)&b-([0-9]++)&c-([0-9]++)
explanations:
(?i) make the pattern case-insensitive
[0-9]++ digit one or more times (possessive)
[0-9a-z]++ the same with letters
^ anchor for the string start
$ anchor for the string end
Parenthesis in the two patterns are capture groups (to catch what you want)
Given a string with the format a-XXX&b-XXX&c-XXX, you can extract all XXX parts in one simple line:
String[] parts = str.replaceAll("[abc]-", "").split("&");
parts will be an array with 3 elements, being the target strings you want.
The simplest regex that matches your string is:
^(?i)a-([\\da-z]+)&b-(\\d+)&c-(\\d+)
With your target strings in groups 1, 2 and 3, but you need lot of code around that to get you the strings, which as shown above is not necessary.
Following code will help you:
String[] texts = new String[]{"a-90485jlkerj&b-34534534&c-643546", "A-RT7456ffgt&B-86763454&C-684241"};
Pattern full = Pattern.compile("^(?i)a-([\\da-z]+)&b-(\\d+)&c-(\\d+)");
Pattern patternA = Pattern.compile("(?i)([\\da-z]+)&[bc]");
Pattern patternB = Pattern.compile("(\\d+)");
for (String text : texts) {
if (full.matcher(text).matches()) {
for (String part : text.split("-")) {
Matcher m = patternA.matcher(part);
if (m.matches()) {
System.out.println(part.substring(m.start(), m.end()).split("&")[0]);
}
m = patternB.matcher(part);
if (m.matches()) {
System.out.println(part.substring(m.start(), m.end()));
}
}
}
}

Find words in string surrounded by "[" and "]":

I need help with a simple task in java. I have the following sentence:
Where Are You [Employee Name]?
your have a [Shift] shift..
I need to extract the strings that are surrounded by [ and ] signs.
I was thinking of using the split method with " " parameter and then find the single words, but I have a problem using that if the phrase I'm looking for contains: " ". using indexOf might be an option as well, only I don't know what is the indication that I have reached the end of the String.
What is the best way to perform this task?
Any help would be appreciated.
Try with regex \[(.*?)\] to match the words.
\[: escaped [ for literal match as it is a meta char.
(.*?) : match everything in a non-greedy way.
Sample code:
Pattern p = Pattern.compile("\\[(.*?)\\]");
Matcher m = p.matcher("Where Are You [Employee Name]? your have a [Shift] shift.");
while(m.find()) {
System.out.println(m.group());
}
Here you go Java regular expression that extract text between two brackets including white spaces:
import java.util.regex.*;
class Main
{
public static void main(String[] args)
{
String txt="[ Employee Name ]";
String re1=".*?";
String re2="( )";
String re3="((?:[a-z][a-z]+))"; // Word 1
String re4="( )";
String re5="((?:[a-z][a-z]+))"; // Word 2
String re6="( )";
Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String ws1=m.group(1);
String word1=m.group(2);
String ws2=m.group(3);
String word2=m.group(4);
String ws3=m.group(5);
System.out.print("("+ws1.toString()+")"+"("+word1.toString()+")"+"("+ws2.toString()+")"+"("+word2.toString()+")"+"("+ws3.toString()+")"+"\n");
}
}
}
if you want to ignore white space remove "( )";
This is a Scanner base solution
Scanner sc = new Scanner("Where Are You [Employee Name]? your have a [Shift] shift..");
for (String s; (s = sc.findWithinHorizon("(?<=\\[).*?(?=\\])", 0)) != null;) {
System.out.println(s);
}
output
Employee Name
Shift
Use a StringBuilder (I assume you don't need synchronization).
As you suggested, indexOf() using your square bracket delimiters will give you a starting index and an ending index. use substring(startIndex + 1, endIndex - 1) to get exactly the string you want.
I'm not sure what you meant by the end of the String, but indexOf("[") is the start and indexOf("]") is the end.
That's pretty much the use case for a regular expression.
Try "(\\[[\\w ]*\\])" as your expression.
Pattern p = Pattern.compile("(\\[[\\w ]*\\])");
Matcher m = p.matcher("Where Are You [Employee Name]? your have a [Shift] shift..");
if (m.find()) {
String found = m.group();
}
What does this expression do?
First it defines a group (...)
Then it defines the starting point for that group. \[ matches [ since [ itself is a 'keyword' for regular expressions it has to be masked by \ which is reserved in Java Strings and has to be masked by another \
Then it defines the body of the group [\w ]*... here the regexpression [] are used along with \w (meaning \w, meaning any letter, number or undescore) and a blank, meaning blank. The * means zero or more of the previous group.
Then it defines the endpoint of the group \]
and closes the group )

ReGex patten to match <c:if > conditional variable names?

I need to get conditional variable name for all cases in a particular jsp
I am reading the jsp line by line and searching for particular pattern like for a line say its checking two type of cond where it finds the match
<c:if condition="Event ='Confirmation'">
<c:if condition="Event1 = 'Confirmation' or Event2 = 'Action'or Event3 = 'Check'" .....>
Desired Result is name of all cond variable - Event,Event1,Event2,Event3 I have written a parser that only satisfying the first case But not able to find variable names for second case.Need a pattern to satisfy both of them.
String stringSearch = "<c:if";
while ((line = bf.readLine()) != null) {
// Increment the count and find the index of the word
lineCount++;
int indexfound = line.indexOf(stringSearch);
if (indexfound > -1) {
Pattern pattern = Pattern
.compile(test=\"([\\!\\(]*)(.*?)([\\=\\)\\s\\.\\>\\[\\(]+?));
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
str = matcher.group(1);
hset.add(str);
counter++;
}
}
If I understood your requirement well, this may work :
("|\s+)!?(\w+?)\s*=\s*'.*?'
$2 will give each condition variable name.
What it does is:
("|\s+) a " or one or more spaces
!? an optional !
(\w+?) one or more word character (letter, digit or underscore) (([A-Za-z]\w*) would be more correct)
\s*=\s* an = preceded and followed by zero or more spaces
'.*?' zero or more characters inside ' and '
Second capture group is (\w+?) retrieving the variable name
Add required escaping for \
Edit: For the additional conditions you specified, the following may suffice:
("|or\s+|and\s+)!?(\w+?)(\[\d+\]|\..*?)?\s*(!?=|>=?|<=?)\s*.*?
("|or\s+|and\s+) A " or an or followed by one or more spaces or an and followed by one or more spaces. (Here, it is assumed that each expression part or variable name is preceded by a " or an or followed by one or more spaces or an and followed by one or more spaces)
!?(\w+?) An optional ! followed by one or more word character
(\[\d+\]|\..*?)? An optional part constituting a number enclosed in square brackets or a dot followed by zero or more characters
(!?=|>=?|<=?) Any of the following relational operators : =,!=,>,<,>=,<=
$2 will give the variable name.
Here second capture group is (\w+?) retrieving variable name and third capture group retrieves any suffix if present (eg:[2] in Event[2]).
For input containing a condition Event.indexOf(2)=something, $2 gives Event only. If you want it to be Event.indexOf(2) use $2$3.
This could suit your needs:
"(\\w+)\\s*=\\s*(?!\")"
Which means:
Every word followed by a = that isn't followed by a "
For example:
String s = "<c:if condition=\"Event ='Confirmation'\"><c:if condition=\"Event1 = 'Confirmation' or Event2 = 'Action'or Event3 = 'Check'\" .....>";
Pattern p = Pattern.compile("(\\w+)\\s*=\\s*(?!\")");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
Prints:
Event
Event1
Event2
Event3

Splitting a string using Regex in Java

Would anyone be able to assist me with some regex.
I want to split the following string into a number, string number
"810LN15"
1 method requires 810 to be returned, another requires LN and another should return 15.
The only real solution to this is using regex as the numbers will grow in length
What regex can I used to accomodate this?
String.split won't give you the desired result, which I guess would be "810", "LN", "15", since it would have to look for a token to split at and would strip that token.
Try Pattern and Matcher instead, using this regex: (\d+)|([a-zA-Z]+), which would match any sequence of numbers and letters and get distinct number/text groups (i.e. "AA810LN15QQ12345" would result in the groups "AA", "810", "LN", "15", "QQ" and "12345").
Example:
Pattern p = Pattern.compile("(\\d+)|([a-zA-Z]+)");
Matcher m = p.matcher("810LN15");
List<String> tokens = new LinkedList<String>();
while(m.find())
{
String token = m.group( 1 ); //group 0 is always the entire match
tokens.add(token);
}
//now iterate through 'tokens' and check whether you have a number or text
In Java, as in most regex flavors (Python being a notable exception), the split() regex isn't required to consume any characters when it finds a match. Here I've used lookaheads and lookbehinds to match any position that has a digit one side of it and a non-digit on the other:
String source = "810LN15";
String[] parts = source.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
System.out.println(Arrays.toString(parts));
output:
[810, LN, 15]
(\\d+)([a-zA-Z]+)(\\d+) should do the trick. The first capture group will be the first number, the second capture group will be the letters in between and the third capture group will be the second number. The double backslashes are for java.
This gives you the exact thing you guys are looking for
Pattern p = Pattern.compile("(([a-zA-Z]+)|(\\d+))|((\\d+)|([a-zA-Z]+))");
Matcher m = p.matcher("810LN15");
List<Object> tokens = new LinkedList<Object>();
while(m.find())
{
String token = m.group( 1 );
tokens.add(token);
}
System.out.println(tokens);

Categories