crawling wikipedia pages with jsoup by href - java

i'm trying to get urls from wikipedia pages in jsoup by this ligne code :
Elements linksOnPage = document.select("a[href~=\"/wiki/\"(([A-Za-z])*|_)]");
to get links look like : https://en.wikipedia.org/wiki/United_Stat or https://en.wikipedia.org/wiki/English_people ....etc , but it doesn't work for me , so i'm looking to get links from tag that match : /wiki/[A-Za-z]*|_
and not somthing like this : https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard

I have a couple ideas about your task:
It seems, you don't need only articles that contains only latin letters, so list of allowed characters could be extended to digits etc.
Basically, what your current regexp says is "give me '/wiki/', then give me either underscore or sequence of english letters of any length", so it makes sense to remove 'or' clause and include underscore inside the list of allowed characters.
To avoid special links that contain ':', you can check that regexp match stopped only after it matched the whole href attribute. To achive it, you can put '$' in the end of regexp.
I played a bit with jsoup and something like this parsed from wikipedia pretty much what you were looking for, I think:
Elements allInfoLinks = doc.select("a[href~=\\/wiki\\/([a-zA-Z0-9_/&?]+)$]");
By the way, in each case you have issues with regular expressions, you may find https://regex101.com/ very useful for debugging

Related

Specific regex name with all degrees in front of name and behind name

After long hours of trying to figure out how to do special regex I realized I won't be able to solve this without any help, as long as i am novice in regular expressions. My task is to create regex which will extract names with degrees from HTML source code.
The website is here http://bacula.nti.tul.cz/~jan.hybs/ada/ where you can obviously find source code i need to create regex which will take all names with degrees. The output should look something like this - prof. Ing. Josef Novak, Ph. D. etc. - simply all things from Column called "Propojeni" should be extracted.
Order is important for me. (I am filling it to an Array list.)
I am able to write regex for any kind of different pattern, but not all of the patterns which are displayed in "propojeni".
I really appreciate any helping answer.
Proper solution shouldn't involve regex but XML/HTML parser like jsoup.
With this tool your code could look like:
Document doc = Jsoup.connect("http://bacula.nti.tul.cz/~jan.hybs/ada/").get();
Elements personel = doc.select("tr td:eq(1)");
for (Element person : personel){
System.out.println(person.text());
}
select("tr td:eq(1)") tries to find all tr elements, and inside them td whose sibling index is equal to 1 (counting from 0). So if one tr has 3 td elements the middle one will be indexed with 1 and that is what we ware after.
Element#text() returns text which selected Element will represent, like <td><a link="foo"> bar </a></td> will be printed as bar in browser (with link decoration) and that is what text() will return.
But if you really MUST use regex (because someone is threatening you or your family) then one of ideas is not to focus on content itself, but on context which guarantees that content will be there. In your case it seems like you can look for CONTENT and select CONTENT.
So your regex can look like
String regex = "(.*?)";
and all you will need to do is extract content of (.*?) (which is group 1).
So your code can look something like
String regex = "(.*?)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(yourHtml);
while(m.find()){
System.out.println(m.group(1));
}
? in (.*?) makes * reluctant, so it will try to find minimal possible match. This code will most likely work without that ? since . by default can't match line separators, but if your HTML would look like
foobar
then (.*) for regex (.*) would represent
foobar
^^^^^^^^^^^^^^^^^^^^^^^^
instead of
foobar
^^^

Java String contains a special Char but not even one more Char

I am looking for every single URL, which is linked as "eye" in a html Document. I am using a regex pattern, because a simple contains is no solution at this point. So I got a pattern like this
Pattern:: href=\"(https?://)?[a-zA-z0-9?/&=\"+-_\\.# ]*>[Ee]ye
It works... fine... more or less... Because I get more than any URL linked as "Eye" or "eye". I'll get URLs which are linked as "eyebrights" or "eyewears", too, but that's not what I want.
Is there any way to say "get me this and ignore it, when there is more than I want"?
In should try to avoid using regex to parse XML/HTML. Use XML/HTML parser like jsoup instead . With this library our code could look like:
Elements links = doc.select("a[href]:matches(^[eE]ye\\b)");
//Elements extends ArrayList<Element> so you can easily iterate over it
more info at http://jsoup.org/cookbook/extracting-data/selector-syntax
Add \b after eye:
href=\"(https?://)?[a-zA-z0-9?/&=\"+-_\\.# ]*>[Ee]ye\\b
\b: assert position at a word boundary.

extract information from xml using regular expression

I need extract the author from the text using regex. Also, I need have the index of every tags and authors. I tried few parser, none of them can preserve the index correctly. So the only solution is using regex. I have following regex and it has a problem on "[^]"
How could I fix this regex:
<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>
in order to extract the author in following text:
<post author="luckylindyslocale" datetime="2012-03-03T04:52:00" id="p7">
<img src="http://img.photobucket.com/albums/v303/lucky196/siggies/ls1.png"/>
Grams thank you, for this wonderful tag and starting this thread. I needed something to encourage me to start making some new tags.
<img src="http://img.photobucket.com/albums/v303/lucky196/holidays/stpatlucky.jpg"/>
Cruelty is one fashion statement we can all do without. ~Rue McClanahan
</post>
Why couldn't regex:
<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>
extract the author in following text.
Because
[^</post>]*
represents a character class and will match everything but the characters <, /, p, o, s, t, and > 0 or more times.
That doesn't happen in your text. As for how to fix it, consider using the following regex
<post\s*author=\"([^\"]+?)\"[^>]+>(.|\s)*?<\/post>
// obviously, escape appropriate characters in Java String literals
with a multiline flag.
You can just do it like the following
/<post author="(.*?)"/
Working Demo
The comments are correct though with Regex not being the best tool to parse HTML. But this should do what you are looking for

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Regex href parsing

a regex question in java.
I'm scraping Id numbers from a element href attribute. I have a bunch on links like these in a string:
Whatever
After the 'pdf' and slash comes an Id number, which I'm interested in.
So I must get all Id's from multiple occurences of this kind of url in the string. What would be the best regex for it?
Thanks in advance.
If you know that the url will be exactly this, your regex can just be:
someplacelol\\.com/pdf/([0-9]+)/
I'm no regex artist but you should be able to get the url out of the element with:
\<a\s.*?href=(?:\"([\w\.:/?=&#%_\-]*)\"|([^\"][\w\.:/?=&#%_\-]*[^\"\>])).*?\>
The first group will contain the URL.
From there you should be able to extract the number without too much difficulty. I tested that link on the source of this page and it was able to correctly identify all of the HREFS in all of the as.
Please don't comment and say It breaks for <a id="<<<>><><<>>href=" href="<a href="> because OP has provided in his description of the problem that ridiculous abuses of the HTTP standard such as this one will not be present in his trail cases.
Also, if for some weird reason, an element has 2 hrefs, only the first will be grabbed. You could probably address that if you cared.
Edit: added whitespace requirement after <a so it won't match things like <asdffsdfsfg href="lol">.

Categories