Java parsing with Matcher and Regex - java

I have a very simple problem but I am new to Java Matcher and I am having a hard time figuring out how to use it for my specific problem.
I have a string which is something like this <not needed content>src="url"<not needed content>src="url2"<not needed content>
Where <'not needed content'> are the things I want to ignore in my string. I basically want to extract the URLs from the string.
My code currently looks like this
Pattern MY_PATTERN = Pattern.compile("\\src=\"(.*?)\\\"");
Matcher m = MY_PATTERN.matcher(content);
String s = "something";
while (m.find()) {
s = m.group(1);
}
I apologize for such basic, and possibly duplicate question.
Thank you.

Why didn't you try a simplier pattern ? Like this one :
Pattern.compile("src=\"(.*?)\"");
(Not tested, but should be better)

You can use either of the following regexes:
src="([^"]+)
src="(.+?"

Related

Get link from url and get email by regex

I'm looking for good regex in java to get string url from all links and all emails. Now I have regex for links:
String linkRegex = "http[s]*://(\\w+\\.)*(\\w+)";
Pattern pattern = Pattern.compile(linkRegex);
Matcher matcher = pattern.matcher(stringAddres);
while (matcher.find()) {
String currentLink = matcher.group();
}
and I got links like: http://twitter.com but also I have https://google. So is there any way that I can remove links like https://google?
And I need regex that gives me email from string, for example:
from this:
href="mailto:contact#example.com">contact#example.com</a></span>
I should get only contact#example.com
There are many answered questions with simple regex patterns that work with most common mails, still I would suggest this regex based on RFC 5322 Standard:
(?:[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'+/=?^_`{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")#(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])
Copied from this site.
I would just use look-behind to lock onto the interesting attributes in the text, and then just capture everything in the "...".
Like this
((?<=href="mailto:)|(?<=src="))[^"]+

find the path param using regex in the url

what is the regular expression to find the path param from the url?
http://localhost:8080/domain/v1/809pA8
https://localhost:8080/domain/v1/809pA8
Want to retrieve the value(809pA8) from the above URL using regular expression, java is preferable.
I would suggest you do something like
url.substring(url.lastIndexOf('/') + 1);
If you really prefer regexps, you could do
Matcher m = Pattern.compile("/([^/]+)$").matcher(url);
if (m.find())
value = m.group(1);
I would try:
String url = "http://localhost:8080/domain/v1/809pA8";
String value = String.valueOf(url.subSequence(url.lastIndexOf('/'), url.length()-1));
No need for regex here, I think.
EDIT: I'm sorry I made a mistake:
String url = "http://localhost:8080/domain/v1/809pA8";
String value = String.valueOf(url.subSequence(url.lastIndexOf('/')+1, url.length()));
See this code working here: https://ideone.com/E30ddC
For your simple case, regex is an overkill, as others noted. But, if you have more cases and this is why you prefer regex, give Spring's AntPathMatcher#extractUriTemplateVariables a look, if you're using Spring. It's actually better equipped for extracting path variables than regex directly. Here are some good examples.

Regex matches in Ruby, but not in Java?

Just in an attempt to get more experience with regex (while also making life easier at work) I was trying to parse some filenames in Java.
My string is this: /home/user/example/Results/ExampleFilePrefix_20140324-0500_OptionalTextThatMightContainNumbers123.csv
basically the filename will always start with ExampleFilePrefix_ followed by the timestamp, and sometimes ends with OptionalTextThatMightContainNumbers123 just depending on how the file was generated. The relevant information I want is the timestamp followed by the optional text if it exists.
I was messing around with various regular expressions and while I can get them all to work with a Ruby regex parser I can't get any of them to work in Java. I didn't keep track of them as I went, but this is my most recent attempt:
_(\w+-\w+)
Which works as expected in Ruby: http://rubular.com/r/K2BiboURRo, but doesn't even come close to matching in Java: http://fiddle.re/c7m04
I don't think it's a problem the code I've written due to the fact the online parser doesn't match, but I'll paste it here to be sure.
private String extractFileName(String filename) {
String resultNameBase = "RegexDidntMatch";
Pattern pattern = Pattern.compile("_(\\w+-\\w+)", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(filename);
if (matcher.matches() && matcher.find()) {
resultNameBase = matcher.group(1);
}
return resultNameBase;
}
As always, thanks to all in advance
First of of its only matcher.find() And the catch the group 0 instead of 1.
if (matcher.find()) {
resultNameBase = matcher.group();
}
This part is problem:
if (matcher.matches() && matcher.find())
Matcher#matches() matches complete input string with your regex.
Replace that with:
if (matcher.find())

Use RegEx in Java to extract parameters in between parentheses

I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}

Quickie regular expression stuck

I have a line of stringy goodness:
"B8&soundFile=http%3A%2F%2Fwww.example.com%2Faeero%2Fj34d1.mp3%2Chttp%3A%2F%2Fwww.example.com%2Faudfgo%2set4.mp3"
Can I use regular expressions to just extract the http up to mp3 for all times it exists?
I have tried reading the documents for regular expressions but none mention how to go FROM http to mp3. Can anyone help?
It would be better if you directly go for index based String operation.
String data = "B8&soundFile=http%3A%2F%2Fwww.example.com%2Faeero%2Fj34d1.mp3%2Chttp%3A%2F%2Fwww.example.com%2Faudfgo%2set4.mp3";
System.out.println(data.substring(data.indexOf("http"), data.indexOf(".mp3")));
Output :
http%3A%2F%2Fwww.example.com%2Faeero%2Fj34d1
B8&soundFile=http%3A%2F%2Fwww.example.com%2Faeero%2Fj34d1.mp3%2Chttp%3A%2F%2Fwww.example.com%2Faudfgo%2set4.mp3
I probably wouldn't do this with a regex. URL decode it, break it up by tokens, and parse it using Java's URL class.
Try http.+?mp3
the following should do it (assuming you want the http and mp3 as part of your match):
.*(http.*mp3)
if you just want the bits between then:
.*http(.*)mp3
for example:
String input = "B8&soundFile=http%3A%2F%2Fwww.example.com%2Faeero%2Fj34d1.mp3%2Chttp%3A%2F%2Fwww.example.com%2Faudfgo%2set4.mp3";
Pattern p = Pattern.compile(".*(http.*mp3)");
Matcher m = p.matcher(input);
if (m.find()) {
System.out.println(m.group(1));
}
gives us
http%3A%2F%2Fwww.example.com%2Faudfgo%2set4.mp3

Categories