Regex href parsing - java

a regex question in java.
I'm scraping Id numbers from a element href attribute. I have a bunch on links like these in a string:
Whatever
After the 'pdf' and slash comes an Id number, which I'm interested in.
So I must get all Id's from multiple occurences of this kind of url in the string. What would be the best regex for it?
Thanks in advance.

If you know that the url will be exactly this, your regex can just be:
someplacelol\\.com/pdf/([0-9]+)/

I'm no regex artist but you should be able to get the url out of the element with:
\<a\s.*?href=(?:\"([\w\.:/?=&#%_\-]*)\"|([^\"][\w\.:/?=&#%_\-]*[^\"\>])).*?\>
The first group will contain the URL.
From there you should be able to extract the number without too much difficulty. I tested that link on the source of this page and it was able to correctly identify all of the HREFS in all of the as.
Please don't comment and say It breaks for <a id="<<<>><><<>>href=" href="<a href="> because OP has provided in his description of the problem that ridiculous abuses of the HTTP standard such as this one will not be present in his trail cases.
Also, if for some weird reason, an element has 2 hrefs, only the first will be grabbed. You could probably address that if you cared.
Edit: added whitespace requirement after <a so it won't match things like <asdffsdfsfg href="lol">.

Related

crawling wikipedia pages with jsoup by href

i'm trying to get urls from wikipedia pages in jsoup by this ligne code :
Elements linksOnPage = document.select("a[href~=\"/wiki/\"(([A-Za-z])*|_)]");
to get links look like : https://en.wikipedia.org/wiki/United_Stat or https://en.wikipedia.org/wiki/English_people ....etc , but it doesn't work for me , so i'm looking to get links from tag that match : /wiki/[A-Za-z]*|_
and not somthing like this : https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard
I have a couple ideas about your task:
It seems, you don't need only articles that contains only latin letters, so list of allowed characters could be extended to digits etc.
Basically, what your current regexp says is "give me '/wiki/', then give me either underscore or sequence of english letters of any length", so it makes sense to remove 'or' clause and include underscore inside the list of allowed characters.
To avoid special links that contain ':', you can check that regexp match stopped only after it matched the whole href attribute. To achive it, you can put '$' in the end of regexp.
I played a bit with jsoup and something like this parsed from wikipedia pretty much what you were looking for, I think:
Elements allInfoLinks = doc.select("a[href~=\\/wiki\\/([a-zA-Z0-9_/&?]+)$]");
By the way, in each case you have issues with regular expressions, you may find https://regex101.com/ very useful for debugging

Download list of pages from some domain with URL constraint

I need to download a list of all the pages on some domain that have specific URL endings.
For example, I have a webpage, like http://brnensky.denik.cz/, which is a Czech webpage with news. Every article has URL ending with post date, like http://brnensky.denik.cz/zpravy_region/ruzova-kola-usnadni-presun-po-brne-20140418.html.
So I would like to find the list of all URLs that begin with http://brnensky.denik.cz/, then whatever, and then for example -20140418.html. Is it possible to achieve?
I'm trying to solve this in Java, but also any other way would help.
Regex would be
^http://brnensky\.denik\.cz.*[0-9]{8}\.html
Logic
Beginning with URL and ending with date.html and date will be always 8 digit string.
You may have to escape '/' according to tool or Lang used to implement this expression

Regular Expressions to match an <a> tag

I am writing a small java program for a class, and I can't quite figure out why my regex isn't working properly. In the special case of having 2 tags on the same line that is read in, it only matches the second one.
Here is a link that has the regex included, along with a simple set of test data:
Regex Test Link.
In my java program I have the following code:
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String[] results;
System.out.println(p.toString());
Matcher m = null;
while((line = input.readLine()) != null) {
m = p.matcher(line);
while(m.find()) {
System.out.println("Matches: " + m.group(1));
}
}
The goal is to extract the href value, as long as it starts with http://, the website ends in either no page (like http://www.google.com) or ends in index.htm or index.html (like http://www.google.com/index.html).
My regex works for every case of the above, but doesnt match in the special case of 2 tags that are on the same line.
Any help is appreciated.
Just use a proper HTML parsing library, such as HTML cleaner. It is theoretically impossible to properly parse HTML with a regex - there are so many constructs that will confound it. For example:
<![CDATA[ > bar ]]>
This is not a link. This is literal text in XHTML.
baz
This is only one link.
<a rel="next" href="bar?2">Next</a>
This is a realistic example of a link with a relation attribute and a relative URI.
<a name="foo">The href="http://example.com" part is the link destination...</a>
This is a named anchor, not a link. However your regex would parse out the literal text here as a link.
Foo
Does your regex handle line-spanning links properly?
There are all kinds of other Fun edge cases that can occur. Save yourself time and headaches. These problems have already been solved and wrapped up in nice neat libraries for you to use. Take advantage of this.
Regexes may be a powerful tool, but as they say - when all you have is a hammer, everything looks like a nail. You are currently trying to hammer in a screw.
This worked for me in that regex tester page
<a[^>]*>[^<]*</a>
Regex Solution
So I was playing around and realized my issue. I adjusted my regex a bit. My main problem was at the beginning my .* was causing everything to match up until the last tag, and therefore it was really only matching once instead of twice. I made that .* lazy and it matched twice instead of once. That was the only issue. Once that regex was added to java, my loop code worked fine.
Thanks everyone that responded. While you may not have provided the answer, your comments got me thinking in the right direction!
You would have to look through all the matches you got per line and find which one looks like a url (like with some more regex ;))

REGEX: adding links in an HTML text

I have a puzzle that requires your help : I need to replace certain words with links in an HTML Text.
For example, I have to replace "word" with "<a href="...">word</ a>"
The difficulty is double :
1. not to add links in tag attributes
2. not to add links other links (nested links).
I found a solution to meet the case (1) but I can not handle the case (2).
Here is my simplified code:
String text="sample text <a>sample text</a> sample <a href='http://www.sample.com'>a good sample</a>";
String wordToReplace="sample";
String pattern="\\b"+wordToReplace+"\\b(?![^<>]*+>)"; //the last part is here to solve de problem (1)
String link="["+wordToReplace+"]"; //for more clarity, the generated link is replaced by [...]
System.out.println(text.replaceAll(pattern,link));
The result is:
[sample] text <a>[sample] text</a> [sample] <a href='http://www.sample.com'>a good [sample]</a>
Problem : there is a link in a another link.
Do you have an idea how to solve this problem ?
Thank you in advance
Parsing HTML with regex is always a bad idea, precisely because of odd cases such as this. It would be better to use an HTML parser. Java has a built-in HTML Parser with using Swing that you might want to look into.

Stripping off urls' in a java string

I've tried this for a couple of hours and wasn't able to do this correctly; so I figured I'd post it here. Here's my problem.
Given a string in java :
"this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
Now i want to strip out the link tag from this string using regular expressions - so the resulting string should look like :
"this is one \nlink some text two \nlink extra text"
I've tried all kind of things in java regular expressions; capturing groups, greedy qualifiers - you name it, and still can't get it to work quite right. If there's only one link tag in the string, I can get it work easily. However my string can have multiple url's embedded in it which is what's preventing my expression to work. Here's what i have so far - (?s).*(<a.*>(.*)</a>).*
Note that the string inside the link can be of variable length, which is why i have the .* in the expression.
If somebody can give me a regular expression that'll work, I'll be extremely grateful. Short of looping through each character and removing the links i can't find a solution.
Sometimes it's easier to do it in 2 steps:
s = "this is <a href='something'>one \nlink</a> some text <a href='fubar'>two \nlink</a> extra text"
s.replaceAll("<a[^>]*>", "").replaceAll("</a>", "")
Result: "this is one \nlink some text two \nlink extra text"
Here's the way I usually match tags:
<a .*?>|</a>
and replace with an empty string.
Alternatively, instead of removing the tag, you might comment it out. The match pattern would be the same, but the replacement would be:
<!--\0-->
or
<!--$0-->
If you want to have a reference to the anchor text, use this match pattern:
<a .*?>(.*?)</a>
and the replacement would be an index of 1 instead of 0.
Note: Sometimes you have to use programming-language specific flags to allow regex to match across lines (multi-line pattern match). Here's a Java Example
Pattern aPattern = Pattern.compile(regexString,Pattern.MULTILINE);
Off the top of my head
"<a [^>]*>|</a>"

Categories