Possible backslash escaping issue when trying to perform a regex - java

Hi I have the following sentence which is within a much larger string variable:
New Component <b>TEST</b> is successfully registered.
I'm trying to perform a regex match to find this sentence within the string. The word TEST is variable and can be any word.
I'm using the following pattern in regexr which runs fine:
New Component <b>\w*<\/b> is successfully registered.
In my java code I have to write it as
Pattern p = Pattern.compile("New Component <b>\\w*<\\/b> is successfully registered.");
Matcher m = p.matcher(result.toString());
if (m.matches()) {
System.out.println("hurray!");
}
This is because I need to escape the backslashes. However the pattern is not receiving a match in the php code and hurray is not printed. Is there an issue with the backslashes or the way I have used them here that is causing the matcher to fail?

Try adding .* to the start and end of the pattern:
Pattern p = Pattern.compile(".*New Component <b>\\w*<\\/b> is successfully registered\..*");
Your pattern is trying to match the string, however it won't match as it is part of a larger string, so any characters before or after the target string will not be accepted by the regular expression and cause it to fail.
.* tells the matcher to accept 0 or more of ANY character before and after your target string.
Edit: Also if you want to match the fullstop at the end of the line, you should escape the fullstop with \., this is because the dot has a special meaning in regex, it means any character.

Further to #dahui's answer, the other option is to switch m.matches() with m.find().
.matches() requires the regex to match the entire string.
.find() required the regex to match any substring of the string.
Edit:
Running the following does print "hurray!" when I run it:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SO {
public static void main(String[] args) {
Pattern p = Pattern.compile("New Component <b>\\w*<\\/b> is successfully registered.");
Matcher m = p.matcher("New Component <b>TEST</b> is successfully registered.");
if (m.matches()) {
System.out.println("hurray!");
}
}
}
Is it possible result.toString() isn't what you think it is?

Related

Cannot match string with regex pattern when such string is done of multiple lines

I have a string like the following:
SYBASE_OCS=OCS-12_5
SYBASE=/opt/sybase/oc12.5.1-EBF12850
//there is a newline here as well
The string at the debugger appears like this:
I am trying to match the part coming after SYBASE=, meaning I'm trying to match /opt/sybase/oc12.5.1-EBF12850.
To do that, I've written the following code:
String key = "SYBASE";
Pattern extractorPattern = Pattern.compile("^" + key + "=(.+)$");
Matcher matcher = extractorPattern.matcher(variableDefinition);
if (matcher.find()) {
return matcher.group(1);
}
The problem I'm having is that this string on 2 lines is not matched by my regex, even if the same regex seems to work fine on regex 101.
State of my tests:
If I don't have multiple lines (e.g. if I only had SYBASE=... followed by the new line), it would match
If I evaluate the expression extractorPattern.matcher("SYBASE_OCS=OCS-12_5\\nSYBASE=/opt/sybase/oc12.5.1-EBF12850\\n") (note the double backslash in front of the new line), it would match.
I have tried to use variableDefinition.replace("\n", "\\n") to what I give to the matcher(), but it doesn't match.
It seems something simple but I can't get out of it. Can anyone please help?
Note: the string in that format is returned by a shell command, I can't really change the way it gets returned.
The anchors ^ and $ anchors the match to the start and end of the input.
In your case you would like to match the start and end of a line within the input string. To do this you'll need to change the behavior of these anchors. This can be done by using the multi line flag.
Either by specifying it as an argument to Pattern.compile:
Pattern.compile("regex", Pattern.MULTILINE)
Or by using the embedded flag expression: (?m):
Pattern.compile("(?m)^" + key + "=(.+)$");
The reason it seemed to work in regex101.com is that they add both the global and multi line flag by default:

Search and Extract a string with a specific keyword from a string

I am processing a tsv file. I have a bunch of urls in one entry and I am looking for a specific url with '.ab.' keyword in it.
This is my data: http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexamplewith.AB.jpg
and I want output to be http://this/is/anexamplewith.ab.jpg
This is what I am using: '^http://.*[.AB.jpg]' but it's giving me entire string.
What RegEx can I use?
Thank you!
Note that ^http://.*[.AB.jpg] matches http:// at the beginning of the string and .* matches every character other than newline to the end (of line) looking for the last occurrence of the following characters - ., A, B, ., j, p, g. At the end you have g - thus the whole string is matched.
You can use
http:\/\/(?:(?!http:\/\/).)*\.ab\.(?:(?!http:\/\/).)*(?=$|http)
See demo
Regex matches:
http:\/\/ - matches http://
(?:(?!http:\/\/).)* - matches any symbol that is not starting the substring http:// (thus ensuring the shortest window between the first http:// and .ab.)
\.ab\. - literal .ab.
(?:(?!http:\/\/).)* - see above
(?=$|http) - a lookahead that tells the engine to stop in front of end of string ($) or http://.
A Java implementation:
String str = "http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexamplewith.AB.jpg";
Pattern ptrn = Pattern.compile("(?i)http://(?:(?!http://).)*\\.ab\\.(?:(?!http://).)*(?=$|http)");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output of the sample demo program:
http://this/is/anexamplewith.AB.jpg
REPLACEMENT
To replace that match, you just need to use a replaceAll:
str = str.replaceAll("(?i)http://(?:(?!http://).)*\\.ab\\.(?:(?!http://).)*(?=$|http)", "");

Regex matches in Ruby, but not in Java?

Just in an attempt to get more experience with regex (while also making life easier at work) I was trying to parse some filenames in Java.
My string is this: /home/user/example/Results/ExampleFilePrefix_20140324-0500_OptionalTextThatMightContainNumbers123.csv
basically the filename will always start with ExampleFilePrefix_ followed by the timestamp, and sometimes ends with OptionalTextThatMightContainNumbers123 just depending on how the file was generated. The relevant information I want is the timestamp followed by the optional text if it exists.
I was messing around with various regular expressions and while I can get them all to work with a Ruby regex parser I can't get any of them to work in Java. I didn't keep track of them as I went, but this is my most recent attempt:
_(\w+-\w+)
Which works as expected in Ruby: http://rubular.com/r/K2BiboURRo, but doesn't even come close to matching in Java: http://fiddle.re/c7m04
I don't think it's a problem the code I've written due to the fact the online parser doesn't match, but I'll paste it here to be sure.
private String extractFileName(String filename) {
String resultNameBase = "RegexDidntMatch";
Pattern pattern = Pattern.compile("_(\\w+-\\w+)", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(filename);
if (matcher.matches() && matcher.find()) {
resultNameBase = matcher.group(1);
}
return resultNameBase;
}
As always, thanks to all in advance
First of of its only matcher.find() And the catch the group 0 instead of 1.
if (matcher.find()) {
resultNameBase = matcher.group();
}
This part is problem:
if (matcher.matches() && matcher.find())
Matcher#matches() matches complete input string with your regex.
Replace that with:
if (matcher.find())

Java Regex - Extract link from HTML anchor

I have the following code
private String anchorRegex = "\\<\\s*?a\\s+.*?href\\s*?=\\s*?([^\\s]*?).*?\\>";
private Pattern anchorPattern = Pattern.compile(anchorRegex, Pattern.CASE_INSENSITIVE);
String content = getContentAsString();
Matcher matcher = anchorPattern.matcher(content);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
The call to getContentAsString() returns the HTML content from a web page. The problem I'm having is that the only thing that gets printed in my System.out is a space. Can anyone see what's wrong with my regex?
Regex drives me crazy sometimes.
You need to delimit your capturing group from the following .*?. There's probably double quotes " around the href, so use those:
<\s*a\s+.*?href\s*=\s*"(\S*?)".*?>
Your regex contains:
([^\s]*?).*?
The ([^\s]*?) says to reluctantly find all non-whitespace characters and save them in a group. But the reluctant *? depends on the next part, which is .; any character. So the matching of the href aborts at the first possible chance and it is the .*? which matches the rest of the URL.
The regex you should be using is this:
String anchorRegex = "(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^\\s>]*)['\"]";
This should be able to pull out the href without too much trouble.
The link is in capture group 2, its expanded and assumes dot-all.
Use Java delimiters as necessary.
(?s)
<a
(?=\s)
(?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s) href \s*=\s* (['"]) (.*?) \1
(?:".*?"|'.*?'|[^>]*?)+
>
or not expanded, not dot-all.
<a(?=\s)(?:[^>"']|"[^"]*"|'[^']*')*?(?<=\s)href\s*=\s*(['"])([\s\S]*?)\1(?:"[\s\S]*?"|'[\s\S]*?'|[^>]*?)+>

Java Regex Matcher Question

How do I match an URL string like this:
img src = "https://stackoverflow.com/a/b/c/d/someimage.jpg"
where only the domain name and the file extension (jpg) is fixed while others are variables?
The following code does not seem working:
Pattern p = Pattern.compile("<img src=\"http://stachoverflow.com/.*jpg");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
}
There were a couple of issues with the regex matching the sample string you gave. You were close, though. Here's your code fixed to make it work:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TCPChat {
static public void main(String[] args) {
String url = "<img src=\"http://stackoverflow.com/a/b/c/d/someimage.jpg\">";
Pattern p = Pattern.compile("<img src=\"http://stackoverflow.com/.*jpg\">");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
System.out.println(s);
}
}
}
First, I would use the group() method to retrieve the matched text, not toString(). But it's probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.
Second, I wouldn't assume src was the first attribute in the <img> tag. On SO, for example, it's usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can't match beyond the end of the tag. [^<>]+ will probably suffice.
Third, I would use something more restrictive than .* to match the unknown part to the path. There's always a chance that you'll find two URLs on one line, like this:
<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">
In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.
There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?
...and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it's essential that you understand their limitations.
Here's my revised version of your regex (as a Java string literal):
"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"

Categories