Regex not working properly in all cases - java

I am using a regex to get word from string its working fine in alphanumeric case but return wrong answer if we are used arithmetic operator.
Matcher oMatcher;
Pattern oPattern;
String key = "a++";
oPattern = Pattern.compile("\\b" + key + "\\b");
oMatcher = oPattern.matcher("max winzer® build-a-chair cocktailsessel »luisa« in runder form, zum selbstgestalten");
if (oMatcher.find()) {
System.out.println("True");
}

You have to escape any potential regex special characters in key with Pattern.quote:
oPattern = Pattern.compile("\\b" + Pattern.quote(key) + "\\b");
^^^^^^^^^^^^^

Related

Why matcher.find() for input parameter always return 'false'?

I have a strange situation which I find difficult to understand regarding regex matcher.
When I pass the next input parameter issueBody to the matcher, the matcher.find() always return false, while passing a hard-coded String with the same value as the issueBody - it works as expected.
The regex function:
private Map<String, String> extractCodeSnippet(Set<String> resolvedIssueCodeLines, String issueBody) {
String codeSnippetForCodeLinePattern = "\\(Line #%s\\).*\\W\\`{3}\\W+(.*)(?=\\W+\\`{3})";
Map<String, String> resolvedIssuesMap = new HashMap<>();
for (String currentResolvedIssue : resolvedIssueCodeLines) {
String currentCodeLinePattern = String.format(codeSnippetForCodeLinePattern, currentResolvedIssue);
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(issueBody);
while (matcher.find()) {
resolvedIssuesMap.put(currentResolvedIssue, matcher.group());
}
}
return resolvedIssuesMap;
}
The following always return false
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(issueBody);
While the following always return true
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher("**SQL_Injection** issue exists # **VB_3845_112_lines/encode.frm** in branch **master**\n" +
"\n" +
"Severity: High\n" +
"\n" +
"CWE:89\n" +
"\n" +
"[Vulnerability details and guidance](https://cwe.mitre.org/data/definitions/89.html)\n" +
"\n" +
"[Internal Guidance](https://checkmarx.atlassian.net/wiki/spaces/AS/pages/79462432/Remediation+Guidance)\n" +
"\n" +
"[ppttx](http://WIN2K12-TEMP/bbcl/ViewerMain.aspx?planid=1010013&projectid=10005&pathid=1)\n" +
"\n" +
"Lines: 41 42 \n" +
"\n" +
"---\n" +
"[Code (Line #41):](null#L41)\n" +
"```\n" +
" user_name = txtUserName.Text\n" +
"```\n" +
"---\n" +
"[Code (Line #42):](null#L42)\n" +
"```\n" +
" password = txtPassword.Text\n" +
"```\n" +
"---\n");
My question is - why? what is the difference between the two statements?
TL;DR:
By using Pattern.UNIX_LINES, you tell Java regex engine to match with . any char but a newline, LF. Use
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.UNIX_LINES);
In your hard-coded string, you have only newlines, LF endings, while your issueBody most likely contains \r\n, CRLF endings. Your pattern only matches a single non-word char with \W (see \\W\\`{3} pattern part), but CRLF consists of two non-word chars. By default, . does not match line break chars, so it does not match neither \r, CR, nor \n, LF. The \(Line #%s\).*\W\`{3} fails right because of this:
\(Line #%s\) - matches (Line #<NUMBER>)
.* - matches 0 or more chars other than any line break char (up to CR or CRLF)
\W - matches a char other than a letter/digit/_ (so, only \r or \n)
\`{3} - 3 backticks - these are only matched if there was a \n ending, not \r\n (CRLF).
Again, by using Pattern.UNIX_LINES, you tell Java regex engine to match with . any char but a newline, LF.
BTW, Pattern.MULTILINE only makes ^ match at the start of each line, and $ to match at the end of each line, and since there are neither ^, nor $ in your pattern, you may safely discard this option.

How to find match for exact word using pattern matcher in java

I have shared my sample code here. here i am trying to find word "engine" with different strings. i used word boundary to match the words in string.
it matches word if it starts with #engine(example).
it should only match with exact word.
private void checkMatch() {
String source1 = "search engines has ";
String source2 = "search engine exact word";
String source3 = "enginecheck";
String source4 = "has hashtag #engine";
String key = "engine";
System.out.println(isContain(source1, key));
System.out.println(isContain(source2, key));
System.out.println(isContain(source3, key));
System.out.println(isContain(source4, key));
}
private boolean isContain(String source, String subItem) {
String pattern = "\\b" + subItem + "\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);
return m.find();
}
**Expected output**
false
true
false
false
**actual output**
false
true
false
true
For this case, you have to use regex OR instead of word boundary. \\b matches between a word char and non-word char (vice-versa). So your regex should find a match in #engine since # is a non-word character.
private boolean isContain(String source, String subItem) {
String pattern = "(?m)(^|\\s)" + subItem + "(\\s|$)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);
return m.find();
}
or
String pattern = "(?<!\\S)" + subItem + "(?!\\S)";
Change your pattern as below.
String pattern = "\\s" + subItem + "\\b";
If you are looking for a literal text enclosed with spaces or start/end of the string, you can split the string with a mere whitespace pattern like \s+ and check if any of the chunks equals the search text.
Java demo:
String s = "Can't start the #engine here, but this engine works";
String searchText = "engine";
boolean found = Arrays.stream(s.split("\\s+"))
.anyMatch(word -> word.equals(searchText));
System.out.println(found); // => true
Change the regexp to
String pattern = "\\s"+subItem + "\\s";
I'm using the
\s A whitespace character: [ \t\n\x0B\f\r]
For more info look into the java.util.regex.Pattern javadoc
Also if you want to support strings like these:
"has hashtag engine"
"engine"
You can improve it by adding the ending/starting line terminators (^ and $)
by using this pattern:
String pattern = "(^|\\s)"+subItem + "(\\s|$)";

Replace different Regex-Matches with Match-based results in Java

One common usage for regex is the replacement of the matches with something that is based on the matches.
For example a commit-text with ticket numbers ABC-1234: some text (ABC-1234) has to be replaced with <ABC-1234>: some text (<ABC-1234>) (<> as example for some surroundings.)
This is very simple in Java
String message = "ABC-9913 - Bugfix: Some text. (ABC-9913)";
String finalMessage = message;
Matcher matcher = Pattern.compile("ABC-\\d+").matcher(message);
if (matcher.find()) {
String ticket = matcher.group();
finalMessage = finalMessage.replace(ticket, "<" + ticket + ">");
}
System.out.println(finalMessage);
results in<ABC-9913> - Bugfix: Some text. (<ABC-9913>).
But if there are different matches in the input String, this is different. I tried a slightly different code replacing if (matcher.find()) { with while (matcher.find()) {. The result is messed up with doubled replacements (<<ABC-9913>>).
How can I replace all matching values in an elegant way?
You can simply use replaceAll:
String input = "ABC-1234: some text (ABC-1234)";
System.out.println(input.replaceAll("ABC-\\d+", "<$0>"));
prints:
<ABC-1234>: some text (<ABC-1234>)
$0 is a reference to the matched string.
Java regex reference (see "Groups and capturing").
The problem is that the replace() method transforms the string over and over again.
A better way is to replace one match at a time. The matcher class has an appendReplacement-method for this.
String message = "ABC-9913, ABC-9915 - Bugfix: Some text. (ABC-9913,ABC-9915)";
Matcher matcher = Pattern.compile("ABC-\\d+").matcher(message);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
String ticket = matcher.group();
matcher.appendReplacement(sb, "<" + ticket + ">");
}
matcher.appendTail(sb);
System.out.println(sb);

Regular expression on a string

I have a String like below
String phone = (123) 456-7890
Now I would like my program to verify if that my input is the same pattern as string 'phone'
I did the following
if(phone.contains("([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]")) {
//display pass
}
else {
//display fail
}
It didn't work. I tried with other combinations too. nothing worked.
Question :
1. How can I achieve this without using 'Pattern' like above?
2. How to do this with pattern. I tried with pattern as below
Pattern pattern = Pattern.compile("(\d+)");
Matcher match = pattern.matcher(phone);
if (match.find()) {
//Displaypass
}
String#matches checks if a string matches a pattern:
if (phone.matches("\\(\\d{3}\\) \\d{3}-\\d{4}")) {
//Displaypass
}
The pattern is a regular expression. Therefor I had to escape the round brackets, as they have a special meaning in regex (they denote capturing groups).
contains() only checks if a string contains the substring passed to it.
I'm not going to dive too deeply into regex syntax, but there definitely is something off with your regex.
"([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
it containes ( and ) and those have special meaning. Escape them
"\([0-9][0-9][0-9]\) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
and you'll also have to escape your \ for the final
"\\([0-9][0-9][0-9]\\) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
You can write like:
Pattern pattern = Pattern.compile("\\(\\d{3}\\) \\d{3}-\\d{4}");
Matcher matcher = pattern.matcher(sPhoneNumber);
if (matcher.matches()) {
System.out.println("Phone Number Valid");
}
For more information you can visit this article.
It appears that your problem is that you didn't escape the parentheses, so your Regex is failing. Try this:
\([0-9][0-9][0-9]\) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
This works
String PHONE_REGEX = "[(]\\b[0-9]{3}\\b[)][ ]\\b[0-9]{3}\\b[-]\\b[0-9]{4}\\b";
String phone1 = "(1234) 891-6762";
Boolean b = phone1.matches(PHONE_REGEX);
System.out.println("is e-mail: " + phone1 + " :Valid = " + b);
String phone2 = "(143) 456-7890";
b = phone2.matches(PHONE_REGEX);
System.out.println("is e-mail: " + phone2 + " :Valid = " + b);
Output:
is phone: (1234) 891-6762 :Valid = false
is phone: (143) 456-7890 :Valid = true

How can I change Regex search in java to overlook case

How can I change the following code so it will not care about case?
public static String tagValue(String inHTML, String tag)
throws DataNotFoundException {
String value = null;
Matcher m = null;
int count = 0;
try {
String searchFor = "<" + tag + ">(.*?)</" + tag + ">";
Pattern pattern = Pattern.compile(searchFor);
m = pattern.matcher(inHTML);
while (m.find()) {
count++;
return inHTML.substring(m.start(), m.end());
// System.out.println(inHTML.substring(m.start(), m.end()));
}
} catch (Exception e) {
throw new DataNotFoundException("Can't Find " + tag + "Tag.");
}
if (count == 0) {
throw new DataNotFoundException("Can't Find " + tag + "Tag.");
}
return inHTML.substring(m.start(), m.end());
}
Give the Pattern.CASE_INSENSITIVE flag to Pattern.compile:
String searchFor = "<" + tag + ">(.*?)</" + tag + ">";
Pattern pattern = Pattern.compile(searchFor, Pattern.CASE_INSENSITIVE);
m = pattern.matcher(inHTML);
(Oh, and consider parsing XML/HTML instead of using a regular expression to match a nonregular language.)
You can also compile the pattern with the case-insensitive flag:
Pattern pattern = Pattern.compile(searchFor, Pattern.CASE_INSENSITIVE);
First, read Using regular expressions to parse HTML: why not?
To answer your question though, in general, you can just put (?i) at the beginning of the regular expression:
String searchFor = "(?i)" + "<" + tag + ">(.*?)</" + tag + ">";
The Pattern Javadoc explains
Case-insensitive matching can also be enabled via the embedded flag expression (?i).
Since you're using Pattern.compile you can also just pass the CASE_INSENSITIVE flag:
String searchFor = "<" + tag + ">(.*?)</" + tag + ">";
Pattern pattern = Pattern.compile(searchFor, Pattern.CASE_INSENSITIVE);
You should know what case-insensitive means in Java regular expressions.
By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.
It looks like you're matching tags, so you only want US-ASCII.

Categories