Regex Lookahead & Lookbehind with Java - java

I am trying to parse out data from a HTML page using a Java RegEx but have not had much luck. The data is dynamic and often includes zero to many instances of spaces, tabs, new lines. Also, depending on the number of hits the structure of the string I'm parsing may change. Here is a sample in the cleanest format:
<div class="center">Showing 25 of 2,343,098 (search took 1.245 seconds)</div>
However it can also look like this:
<div class="center">Showing 2343098 (search took 1.245 seconds)</div>
or
<div class="center">
Showing 125
of 2,343,098
(search took 1.245 seconds)</div>
What I'm trying to parse is the 2,343,098 but since the pages is HTML I have to use either "Showing" or "(search took" to search between. The spaces, tabs and new lines are tripping me up and I've been trying to use lookahead & lookbehind but so far no luck. Here are a few patterns I've tried
String pattern1 = "Showing [0-9]*\\S"; // not useful
String pattern2 = "[[\\d,+\\.?\\d+]*[\\s*\\n]\\(search took"; //fails
String pattern3 = "(/i)(Showing)(.+?)(\\(search took)"; //fails
String pattern4 = "([\\s\\S]*)\\(search took"; //fails
String pattern5 = "(?s)[\\d].*?(?=\\(search took)"; //close...but fails
Pattern pattern = Pattern.compile(pattern5);
Matcher matcher = pattern.matcher(text); // text = the string I'm parsing
while(matcher.find()) {
System.out.println(matcher.group(0));
}

HTML is not a regular language, and cannot be accurately parsed using regular expressions. Regex-based solutions are likely to break when the format of the markup changes in future, but a parser-based solution will be more accurate.
However, if this is a one-off job, you can get away with the following regex:
Showing\s+(?:\d+\s+of\s+)?([\d,.]+)\s+\(search
Demo

The examples suggest
"Showing\\s+\\d+\\s+(of\\s+[\\d,.]+\\s+)?\\(search"

Related

Get link from url and get email by regex

I'm looking for good regex in java to get string url from all links and all emails. Now I have regex for links:
String linkRegex = "http[s]*://(\\w+\\.)*(\\w+)";
Pattern pattern = Pattern.compile(linkRegex);
Matcher matcher = pattern.matcher(stringAddres);
while (matcher.find()) {
String currentLink = matcher.group();
}
and I got links like: http://twitter.com but also I have https://google. So is there any way that I can remove links like https://google?
And I need regex that gives me email from string, for example:
from this:
href="mailto:contact#example.com">contact#example.com</a></span>
I should get only contact#example.com
There are many answered questions with simple regex patterns that work with most common mails, still I would suggest this regex based on RFC 5322 Standard:
(?:[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'+/=?^_`{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")#(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])
Copied from this site.
I would just use look-behind to lock onto the interesting attributes in the text, and then just capture everything in the "...".
Like this
((?<=href="mailto:)|(?<=src="))[^"]+

Replace & only in links in a partial html document

I've tried a few methods (jsoup shown below) to turn &amp into & only in links. The difficulty I'm encountering suggests I'm going about this all wrong. I suspect I'll be facepalming when solutions are offered, but maybe good old regex is the best answer (as I need to only do the replacing in hrefs) unless the reader code is modified?
The parsing libraries (also tried NekoHTML) want to convert all &s to & so I'm having trouble using them to even get the true link hrefs with which to use String's replace method.
Input:
String toParse = "The Link with an encoded ampersand (&) is challenging."
Desired output:
The Link with an encoded ampersand (&) is challenging.
I'm encountering this trying to read an RSS feed that is rendering <link>s with & instead of &.
Update
I ended up using regex to identify the links, then using replace to insert a decoded link in place of the one with &s. Pattern.quote() turned out to be very handy, but I had to manually close and re-open the quoted portions so I could regex or my ampersand condition:
final String cleanLink = StringUtils.strip(link).replaceAll(" ", "%20").replaceAll("'", "%27");
String regex = Pattern.quote(link);
// end and re-start literal matching around my or condition
regex = regex.replaceAll("&", "\\\\E(&|&)\\\\Q");
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
int index = result.indexOf(matcher.group());
while (index != -1) {
// this replaces the links with & with the same links with &
// because cleanLink is from the DOM and has been properly decoded
result.replace(index, index + matcher.group().length(), cleanLink);
index += cleanLink.length();
index = result.indexOf(matcher.group(), index);
linkReplaced = true;
}
}
I'm not thrilled with this approach, but I had to handle too many conditions myself without using a DOM tool to identify links.
Have a look at the StringEscapeUtils. Try unescapeHtml() on your String.

Regex matches in Ruby, but not in Java?

Just in an attempt to get more experience with regex (while also making life easier at work) I was trying to parse some filenames in Java.
My string is this: /home/user/example/Results/ExampleFilePrefix_20140324-0500_OptionalTextThatMightContainNumbers123.csv
basically the filename will always start with ExampleFilePrefix_ followed by the timestamp, and sometimes ends with OptionalTextThatMightContainNumbers123 just depending on how the file was generated. The relevant information I want is the timestamp followed by the optional text if it exists.
I was messing around with various regular expressions and while I can get them all to work with a Ruby regex parser I can't get any of them to work in Java. I didn't keep track of them as I went, but this is my most recent attempt:
_(\w+-\w+)
Which works as expected in Ruby: http://rubular.com/r/K2BiboURRo, but doesn't even come close to matching in Java: http://fiddle.re/c7m04
I don't think it's a problem the code I've written due to the fact the online parser doesn't match, but I'll paste it here to be sure.
private String extractFileName(String filename) {
String resultNameBase = "RegexDidntMatch";
Pattern pattern = Pattern.compile("_(\\w+-\\w+)", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(filename);
if (matcher.matches() && matcher.find()) {
resultNameBase = matcher.group(1);
}
return resultNameBase;
}
As always, thanks to all in advance
First of of its only matcher.find() And the catch the group 0 instead of 1.
if (matcher.find()) {
resultNameBase = matcher.group();
}
This part is problem:
if (matcher.matches() && matcher.find())
Matcher#matches() matches complete input string with your regex.
Replace that with:
if (matcher.find())

Use RegEx in Java to extract parameters in between parentheses

I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}

java Regular expression matching html

solution:
this works:
String p="<pre>[\\\\w\\\\W]*</pre>";
I want to match and capture the enclosing content of the <pre></pre> tag
tried the following, not working, what's wrong?
String p="<pre>.*</pre>";
Matcher m=Pattern.compile(p,Pattern.MULTILINE|Pattern.CASE_INSENSITIVE).matcher(input);
if(m.find()){
String g=m.group(0);
System.out.println("g is "+g);
}
Regex is in fact not the right tool for this. Use a parser. Jsoup is a nice one.
Document document = Jsoup.parse(html);
for (Element element : document.getElementsByTag("pre")) {
System.out.println(element.text());
}
The parse() method can also take an URL or File by the way.
The reason I recommend Jsoup is by the way that it is the least verbose of all HTML parsers I tried. It not only provides JavaScript like methods returning elements implementing Iterable, but it also supports jQuery like selectors and that was a big plus for me.
You want the DOTALL flag, not MULTILINE. MULTILINE changes the behavior of the ^ and $, while DOTALL is the one that lets . match line separators. You probably want to use a reluctant quantifier, too:
String p = "<pre>.*?</pre>";
String stringToSearch = "H1 FOUR H1 SCORE AND SEVEN YEARS AGO OUR FATHER...";
// the case-insensitive pattern we want to search for
Pattern p = Pattern.compile("H1", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(stringToSearch);
// see if we found a match
int count = 0;
while (m.find())
count++;
System.out.println("H1 : "+count);

Categories