Jsoup select links for different websites - java

I am filtering links out of a html body using JSOUP.
for such a webpage: https://en.wikipedia.org/wiki/Cloud_computing
i want to filter links such as:
https://en.wikipedia.org/wiki/Light
for hash tag links en.wikipedia.org/wiki/Cloud_computing#cite_note-1
i try doc.select("a[href*=#]").remove(); and it works well where hash tag links in page html src: <a href="#cite_ref-1">
but when i use doc.select("a[href]*=/]").remove(); where links in page html src
CH
But there are still links not filtered . How is this possible?

You have a typo.
doc.select("a[href]*=/]").remove();
It should be like this
doc.select("a[href*=/]").remove();
But this would remove every link containing a /.
Is this what you want, or do you want to remove every link that starts with /.
In that case you need this
doc.select("a[href^=/]").remove();

Related

Jsoup extract Hrefs from the HTML content

My problem is that I try to get the Hrefs from this site with JSoup
https://www.amazon.de/s?k=kissen&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2
but it does not work.
I tried to select the class from the Href like this
Elements elements = documentMainSite.select(".a-link-normal");
and after that I tried to extract the Hrefs with the following piece of code.
for (Element element : elements) {
String href = element.attributes().get("href");
}
but unfortunately it gives me nothing...
Can someone tell me where is my mistake please?
I don't just connect to the website. I also save the hrefs in a string by extracting them with
String href = element.attributes().get("href");
after that I've print the href String but is empty.
On another side the code works with another css selector. so it has nothing to do with the code by it self. its just the css selector (.a-link-normal) that is probably wrong.
You won't get anything by simply connecting to the url via Jsoup.
Document document = Jsoup.connect(yourUrl).get();
String bodyText = document.getElementsByTag("body").get(0).text();
Here is the translation of the body text, which I got from the above code.
Enter the characters below We ask for your understanding and want to
be sure that you are not a bot. For best results, please use a browser
that accepts cookies. Type the characters you see in the image: Enter
characters Try another image Continue shopping Terms & Conditions
Privacy Policy © 1996-2015, Amazon.com, Inc. or its affiliates
Either you need to bypass captcha or emulate a browser by means of Selenium, for example.

Avoid links navigating to same page

I am using jsoup to do recursive crawl a web page.I have links like this
<a href ="#">hash</>
<a href ="#top">hashtop</>
<a href ="http://www.google.com">google</>
I don't have a problem with links like the third one. When u see first and second which will have the navigation within in the same page.When I do document. get to anchor tags I am getting same parent URL for # and parenturl#top for the second one.I don't want those kinds of links to fetch. Can some let me know how to avoid fetching those kinds of links in jsoup
You should be able to use the following :
doc.select("a[href~=^[^#]")
This uses the [attr~=regex] selector syntax with a regex that will only match strings that do not start with #.

Reading HTML using jsoup

so i am trying to get an HTML element from a website using Jsoup, but the HTML that i get from the Jsoup.connect(url) is not complete compared to the one that i get using the inspector on the website.
EDIT : this is the link i'm working with https://www.facebook.com/livemap##35.831640894,24.82275312499999,2z
The numbers in the end designate the coordinates of the map, and you don't have to sign in to access the page, so there is no authentication problem
UPDATE :
So i have found that the element that i want does not get expanded using jsoup, is this a problem related to slow page loading ? If so, how can i make sure that Jsoup.connect(url) fully loads the webpage before fetching the HTML
from inspector (the <div id="u_0_e"> is expanded)
from jsoup.connect (the <div id="u_0_e"> is not expanded)
Jsoup dont execute javascript or jQuery events, so you will get a initial page before executing javascript.

Jsoup href with function jscript

Guys I'm using the JSoup library to extract some data from a html page, but now I'm needing to jump to the next page, and this link on the next line.
<a class="jsEnabled nextBtn cursorPointer" href="javascript:setSelectedLink('NextPageButton');" title="Next page" alt="Next page"></a>
Ie, he is in a jscript function, how do I do to get the link dynamically?
Unfortunately jsoup can't execute javascripts. But you can use other libraries, eg. HtmlUnit do do so.
Did you check if the website has some plain html alternatives in, which allow you to get to the next page?

jsoup: removing iframe tags

I am using jsoup 1.6.1 and facing the problem when I try to remove iframe tag from html. When iframe do not have any body(i.e <iframe pro=value />), the remove() method removes all the contents after thet tag. Here is my sample code.
String html ="<p> This is start.</p><iframe frameborder="0" marginheight="0" /><p> This is end</p>";
Document doc = Jsoup.parse(html,"UTF-8");<br>
doc.select("iframe").remove();<br>
System.out.println(doc.text());
It returns to me -
This is start.
But I am expecting the result -
This is start. This is end
Thanks in advance
It appears the closing tag for iframe is required. You can't use a self closing tag:
http://msdn.microsoft.com/en-us/library/ie/ms535258(v=vs.85).aspx
http://stackoverflow.com/questions/923328/line-after-iframe-is-not-visible
http://www.w3resource.com/html/iframe/HTML-iframe-tag-and-element.php
So, Jsoup is following the spec and taking whatever follows the iframe tag and using that as its body. When you remove the iframe, "This is the end" gets removed along with it.

Categories