I am pretty new to regular expressions and I need to create a pattern that could be used in matching up different text values(cases). I can use the created pattern but it can only be used in a single case. I would like to maximize the search pattern so that it can be used to different search texts.
By the way, I am using Java 8.
Objective:
Display matcher.find() by group.
Sample Search Texts and Expected output (Group):
Search Text: "employeeName:*borgy*";
Expected Output:
-
(employeeName) (:) (*) (borgy) (*)
-
Search Text: "employeeName:Borgy Manotoy*";
Expected Output:
-
(employeeName) (:) () (Borgy Manotoy) (*)
-
Search Text: "employeeName:*Borgy Manotoy*";
Expected Output:
-
(employeeName) (:) (*) (Borgy Manotoy) (*)
-
Search Text: "employeeEmail:*borgymanotoy#iyotbihagay.com*";
Expected Output:
-
(employeeEmail) (:) (*) (borgymanotoy#iyotbihagay.com) (*)
-
Search Text: "employeeEmail:borgymanotoy#iyotbihagay.com";
Expected Output:
-
(employeeEmail) (:) () (borgymanotoy#iyotbihagay.com) ()
-
Search Text: "employeeName:*Manotoy*, employeeEmail:*#iyotbihagay.*";
Expected Output:
-
(employeeName) (:) (*) (Manotoy) (*)
(employeeEmail) (:) (*) (#iyotbihagay.com) (*)
-
Search Text: "employeeName:*Manotoy*, employeeEmail:*#iyotbihagay.*, employeeRole:*bouncer*";
Expected Output:
-
(employeeName) (:) (*) (Manotoy) (*)
(employeeEmail) (:) (*) (#iyotbihagay.com) (*)
(employeeRole) (:) (*) (bouncer) (*)
-
Search pattern:
String searchPattern = "(\\w+?)(:|!)(\\p{Punct}?)(\\w+?) (.+?)?(\\p{Punct}?),";
Sample search texts:
String text1 = "employeeName:borgy";
String text2 = "employeeName:Borgy*";
String text3 = "employeeName:*borgy*";
String text4 = "employeeName:*Borgy*";
String text5 = "employeeName:*Borgy Manotoy*";
String text6 = "employeeEmail:*borgymanotoy#iyotbihagay.com*";
String text7 = "employeeEmail:borgymanotoy#iyotbihagay.com";
String text8 = "employeeEmail:borgymanotoy#iyotbihagay.*";
String text9 = "employeeEmail:*#iyotbihagay.*";
String text10 = "employeeName:*Manotoy*, employeeEmail:*#iyotbihagay.*";
Search texts using the given pattern:
processUserSearch(text1, searchPattern);
processUserSearch(text2, searchPattern);
processUserSearch(text3, searchPattern);
...
processUserSearch(text10, searchPattern);
Display found
private void processUserSearch(String searchText, String searchPattern) {
if (!Util.isEmptyOrNull(searchText) && !Util.isEmptyOrNull(searchPattern)) {
Pattern pattern = Pattern.compile(searchPattern);
Matcher matcher = pattern.matcher(searchText + ",");
while(matcher.find()) {
System.out.println("[matcher-count]: " + matcher.groupCount());
System.out.print("found: ");
for (int x = 1; x <= matcher.groupCount(); x++) {
System.out.print("(" + matcher.group(x) + ") ");
}
System.out.println("\n");
}
}
}
I suggest using
private static final Pattern pattern = Pattern.compile("(\\w+)([:!])(\\p{Punct}?)(.*?)(\\p{Punct}?)(?=$|,)");
private static void processUserSearch(String searchText) {
if (!searchText.isEmpty() && searchText != null) {
//if (!Util.isEmptyOrNull(searchText) && !Util.isEmptyOrNull(searchPattern)) {
Matcher matcher = pattern.matcher(searchText);
while(matcher.find()) {
System.out.println(searchText + "\n[matcher-count]: " + matcher.groupCount());
System.out.print("found: ");
for (int x = 1; x <= matcher.groupCount(); x++) {
System.out.print("(" + matcher.group(x) + ") ");
}
System.out.println("\n");
}
}
}
Note you can compile it once outside of the matching method for better efficiency.
Use as
String[] texts = new String[] { "employeeName:*borgy*","employeeName:Borgy Manotoy*","employeeName:*Borgy Manotoy*",
"employeeEmail:*borgymanotoy#iyotbihagay.com*","employeeEmail:borgymanotoy#iyotbihagay.com",
"employeeName:*Manotoy*, employeeEmail:*#iyotbihagay.*",
"employeeName:*Manotoy*, employeeEmail:*#iyotbihagay.*, employeeRole:*bouncer*"};
for (String s: texts) {
processUserSearch(s);
}
}
See the Java demo
Here is the regex demo:
(\w+)([:!])(\p{Punct}?)(.*?)(\p{Punct}?)(?=$|,)
Details
(\w+) - Group 1: one or more word chars
([:!]) - Group 2: a : or !
(\p{Punct}?) - Group 3: an optional punctuation char
(.*?) - Group 4: any 0+ chars other than line break chars
(\p{Punct}?) - Group 5: an optional punctuation char
(?=$|,) - an end of string or , should come immediately to the right of the current location (but they do not get added to the match value since it is a positive lookahead).
I would like to maximize the search pattern so that it can be used to different search texts.
And what are "different search texts"? Be specific!
Your problem doesn't seem specific to Java. Your current pattern contains (:|!), but none of the examples suggest how !s may occur in the input. You use \p{Punct} to match the * surrounding the names and emails, but you have no examples of other enclosures than *. You don't say what the purpose of the *s are; are they enclosures, wildcard patterns, what?
The following pattern seems to work for some purposes:
(?:employee(Name|Email)):([\w*#. ]+)
Related
Let's imagine I have the following strings:
String one = "123|abc|123abc";
String two = "123|ab12c|abc|456|abc|def";
String three = "123|1abc|1abc1|456|abc|wer";
String four = "123|abc|def|456|ghi|jkl|789|mno|pqr";
If I do a split on them I expect the following output:
one = ["123|abc|123abc"];
two = ["123|ab12c|abc", "456|abc|def"];
three = ["123|1abc|1abc1", "456|abc|wer"];
four = ["123|abc|def", "456|ghi|jkl", "789|mno|pqr"];
The string has the following structure:
Starts with 1 or more digits followed by a random number of (| followed by random number of characters).
When after a | it's only numbers is considered a new value.
More examples:
In - 123456|xxxxxx|zzzzzzz|xa2314|xzxczxc|1234|qwerty
Out - ["123456|xxxxxx|zzzzzzz|xa2314|xzxczxc", "1234|qwerty"]
Tried multiple variations of the following but does not work:
value.split( "\\|\\d+|\\d+" )
You may split on \|(?=\d+(?:\||$)):
List<String> nums = Arrays.asList(new String[] {
"123|abc|123abc",
"123|ab12c|abc|456|abc|def",
"123|1abc|1abc1|456|abc|wer",
"123|abc|def|456|ghi|jkl|789|mno|pqr"
});
for (String num : nums) {
String[] parts = num.split("\\|(?=\\d+(?:\\||$))");
System.out.println(num + " => " + Arrays.toString(parts));
}
This prints:
123|abc|123abc => [123|abc|123abc]
123|ab12c|abc|456|abc|def => [123|ab12c|abc, 456|abc|def]
123|1abc|1abc1|456|abc|wer => [123|1abc|1abc1, 456|abc|wer]
123|abc|def|456|ghi|jkl|789|mno|pqr => [123|abc|def, 456|ghi|jkl, 789|mno|pqr]
Instead of splitting, you can match the parts in the string:
\b\d+(?:\|(?!\d+(?:$|\|))[^|\r\n]+)*
\b A word boundary
\d+ Match 1+ digits
(?: Non capture group
\|(?!\d+(?:$|\|)) Match | and assert not only digits till either the next pipe or the end of the string
[^|\r\n]+ Match 1+ chars other than a pipe or a newline
)* Close the non capture group and optionally repeat (use + to repeat one or more times to match at least one pipe char)
Regex demo | Java demo
String regex = "\\b\\d+(?:\\|(?!\\d+(?:$|\\|))[^|\\r\\n]+)+";
String string = "123|abc|def|456|ghi|jkl|789|mno|pqr";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(string);
List<String> matches = new ArrayList<String>();
while (m.find())
matches.add(m.group());
for (String s : matches)
System.out.println(s);
Output
123|abc|def
456|ghi|jkl
789|mno|pqr
I have an arbitray string, e.g.
String multiline=`
This is my "test" case
with lines
\section{new section}
Another incorrect test"
\section{next section}
With some more "text"
\subsection{next section}
With some more "text1"
`
I use LaTeX and I want to replace the quotes with those which are used in books - similar to ,, and ´´ For this I need to replace the beginning quotes with a \glqq and the ending with a \qrqq - for each group which starts with \.?section.
If I try the following
String pattern1 = "(^\\\\.?section\\{.+\\})[\\s\\S]*(\\\"(.+)\\\")";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
Matcher m = p.matcher(testString);
System.out.println(p.matcher(testString).find()); //true
while (m.find()) {
for (int i = 0; i < 4; i++) {
System.out.println("Index: " + i);
System.out.println(m.group(i).replaceAll("\"([\\w]+)\"", "\u00AB$1\u00BB"));
}
}
I get as a result on the console
true
Index: 0
\section{new section}
Another incorrect test"
\section{next section}
With some more «text1»
Index: 1
\section{new section}
Index: 2
«text1»
Index: 3
text1
My some problems with the current approach:
The first valid match ("text") isn't found. I guess it has to do with the mulitline and incorrect grouping of \section{. The grouping for the quotes should be restricted to a group which starts with \section and ends with \?.section - how to make this correct?
Even when the text is found properly - how to get a complete string with the replacements?
You may match all texts between section and the next section or end of string, and replace all "..." strings inside it with «....
Here is the Java snippet (see demo):
String s = "This is my \"test\" case\nwith lines\n\\section{new section}\nAnother incorrect test\"\n\\section{next section}\nWith some more \"text\"\n\\subsection{next section}\nWith some more \"text1\"";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile("(?s)section.*?(?=section|$)").matcher(s);
while (m.find()) {
String out = m.group(0).replaceAll("\"([^\"]*)\"", "«$1»");
m.appendReplacement(result, Matcher.quoteReplacement(out));
}
m.appendTail(result);
System.out.println(result.toString());
Output:
This is my "test" case
with lines
\section{new section}
Another incorrect test"
\section{next section}
With some more «text»
\subsection{next section}
With some more «text1»
The pattern means:
(?s) - Pattern.DOTALL embedded flag option
section - a section substring
.*? - any 0+ chars, as few as possible
(?=section|$) - a positive lookahead that requires a section substring or end of string to appear immediately to the right of the current location.
Does anyone see something wrong with this regex I have. All I want is for this to find any occurrences of the and replace it with what word the user chooses. This expression only changes some occurrences and when it does it removes the before white space and I guess concatenates it with the word before.
Also it should not replace then, there, their, they etc
private final String MY_REGEX = (" the | THE | thE | The | tHe | ThE ");
userInput = JTxtInput.getText();
String usersChoice = JTxtUserChoice.getText();
String usersChoiceOut = (usersChoice + " ");
Pattern pattern = Pattern.compile(MY_REGEX, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(userInput);
while (matcher.find())
{
userInput = userInput.replaceAll(MY_REGEX, usersChoiceOut);
JTxtOutput.setText(userInput);
System.out.println(userInput);
}
Ok this new code seems to replace all desired words and nothing else, also doing it without the spacing issues.
private final String MY_REGEX = ("the |THE |thE |The |tHe |ThE |THe ");
String usersChoiceOut = (usersChoice + " ");
The problem is because of the spaces in MY_REGEX. Check the following demo:
public class Main {
public static void main(String[] args) {
String str="This is the eighth wonder of THE world! How about a new style of writing The as tHe";
// Correct way
String MY_REGEX = ("the|THE|thE|The|tHe|ThE");
System.out.println(str.replaceAll(MY_REGEX, "###"));
}
}
Outputs:
This is ### eighth wonder of ### world! How about a new style of writing ### as ###
whereas
public class Main {
public static void main(String[] args) {
String str="This is the eighth wonder of THE world! How about a new style of writing The as tHe";
// Incorrect way
String MY_REGEX = ("the | THE | thE | The | tHe | ThE");
System.out.println(str.replaceAll(MY_REGEX, "###"));
}
}
Outputs:
This is ###eighth wonder of###world! How about a new style of writing###as tHe
The spaces in the alternation have meaning and will tried to be matched literally on both sides of the word.
As you are already using Pattern.CASE_INSENSITIVE, you could also match the followed by a single space as you mention in your updated answer, and use an inline modifier (?i) to make the pattern case insensitive.
userInput = userInput.replaceAll("(?i)the ", usersChoiceOut);
If the should not be part of a larger word, you add a word boundary \b before it.
(?i)\bthe
I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}
In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.
There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}
you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}
How Java regex expression should look like if I want to find two matches
1. NEW D City
2. 1259669
From
Object No: NEW D City | Item ID: 1259669
I tried with
(?<=:\s)\w+
but it only get
1. NEW
2. 1259669
https://regex101.com/r/j5jwK2/1
Using a pattern to capture both values is simpler. Here is the regex used :
Object No:([^|]*)\| Item ID: (\d*)
And a code generated by regex101 and adapted to match the output you want.
final String regex = "Object No: ([^|]*)\\| Item ID: (\\d*)";
final String string = "Object No: NEW D City | Item ID: 1259669";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(+ i + ": " + matcher.group(i));
}
}
Output:
1: NEW D City
2: 1259669
A similar but more generec solution would be [^:]*[:\s]*([^|]*)\|[^:]*[:\s]*(\d*) (not perfect, I didn't try to do something efficient)
You may use a combination of two splits:
String key = "Object No: NEW D City | Item ID: 1259669";
String[] parts = key.split("\\s*\\|\\s*");
List<String> result = new ArrayList<>();
for (String part : parts) {
String[] kvp = part.split(":\\s*");
if (kvp.length == 2) {
result.add(kvp[1]);
System.out.println(kvp[1]); // demo
}
}
See the Java demo
First, you split with \\s*\\|\\s* (a | enclosed with 0+ whitespaces) and then with :\\s*, a colon followed with 0+ whitespaces.
Another approach is to use :\s*([^|]+) pattern and grab and trim Group 1 value:
String s = "Object No: NEW D City | Item ID: 1259669";
List<String> result = new ArrayList<>();
Pattern p = Pattern.compile(":\\s*([^|]+)");
Matcher m = p.matcher(s);
while(m.find()) {
result.add(m.group(1).trim());
System.out.println(m.group(1).trim()); // For demo
}
See the Java demo. In this regex, the ([^|]+) is a capturing group (pushing its contents into matcher.group(1)) that matches one or more (+) chars other than | (with the [^|] negated character class).