using logical or,not,and in jsoup selectors - java

Using this
for (Element link : links) {
String linkHref = link.attr("href and !#");
String linkText = link.text();
}
i can get all the links which has "a href=.."
However, there are some
href="#"
that i don't need in my string.
So i need to do something like
String linkHref = link.attr("href and !#")
i.e i don't want to save the link that has "#" as href.
is that possible or do i have to use regular expression instead ?
Please help.

After reading your question, it looks like you want to Select all anchor tags which do not have '#' as href. You can use :not Selector:
Elements links = doc.select("a[href]"); // All anchor tags with href
links = links.select(":not(href='#')"); // Filter out links which do have href=#

Jsoup select accept comma for Selector combinations
doc.select("[href], [src]"); // href **OR** src
For AND Just combine them in a single CSS selector. Check this answer.
doc.select("a[href][:not(href='#')]");

Related

JSoup Extracting absolute url of a href and a div tag data simultaneously

I want to extract two tags from a website beside each others(adjacently), the first tag is a href and it should be extracted as the the absolute url . the second tag is a div tag and I should extract
the data inside it.
I want the output to be as the following
100 USD http:\www.somesite..............
200 usd http:\www.thesite.............
Why? because later I will insert them into a table in a database .
I tried with the following code but I couldn't get the absolute url in addition I couldn't get rid of the tags while I want to extract the data only (without tags).
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element link : doc.select("div.rightFloat.price,a[abs:href].more-details"))
{
String absHref = url.attr("abs:href");
String attr = link.absUrl("href");
System.out.println(link);
}
If I try using
System.out.println(link.text())
in my code I will miss the hyperlink completely !
Any help please?
I don't think that Jsoup css selector combinators (i.e. the comma in the selector) guarantees an ordering in the output. At least I would not count on it, even if you find the two elements in the ordering you expect. Instead of using the comma selector, I would first loop over the outer containers that hold the adjacent divs you are interested in. Within each div you can then access the price and link.
something like this. Note, that this is out of my head and untested!
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element adDiv : doc.select("div.category-listing-normal-ad")){
Element priceDiv = adDiv.select("div.rightFloat.price").first();
Element linkA = adDiv.select("a.more-details").first();
System.out.println(priceDiv.text() + " " + linkA.absUrl("href"));
}

collect only relevant links from url

What I need is to collect the relevant links from the url. For example from a link like http://beechplane.wordpress.com/ , i need to collect the links that contains the actual articles. ie, links like http://beechplane.wordpress.com/2012/11/07/the-95-confidence-of-nate-silver/ , http://beechplane.wordpress.com/2012/03/06/visualizing-probability-roulette/ etc.
How can I get those links in Java? Is it possible using web-crawlers?
I use jsoup library for that.
How get all <a> tags from document:
Elements a = doc.select("a");
for (Element el : a) {
//process element
String href = el.attr("href");
}

unable to find absolute URL

I'm writing some code to find absolute URLS of a single webpage:
http://explore.bfi.org.uk/4ce2b69ea7ef3
So far I get all the links of that page and print the absolute urls
Here is part of the code:
Elements hyperLinks = htmlDoc.select("a[href]");
for(Element link: hyperLinks)
{
System.out.println(link.attr("abs:href"));
}
This prints out alot or urls just like the one above. However, it seems to skip a few URLS aswell. The ones it skips are the ones I actually need.
This is one of the a[href] elements its not turning into the absolute URL:
<div class="title">Royal Review<br /></div>
It will print this line if I just print "link" but when I put "abs:href", it will just print blank.
I am new to Java and appreciate any feedback!
You shouldn't use "a[href]", use "a" instead following this example:
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
So in your case:
Elements hyperLinks = htmlDoc.select("a");
for(Element link: hyperLinks)
{
System.out.println(link.attr("abs:href"));
}

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags
For example, if my html text is:
String html = "<div> <p> some text some link text </p> </div>"
Now I can parse the above html and select for a tag in jsoup like this,
Document doc = Jsoup.parse(inputHtml);
//this would give me all elements which have anchor tag
Elements elements = doc.select("a");
and I can remove all of them by,
element.remove()
But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags.
Also, Please Note : I know there are methods to get outerHTML() and
innerHTML() from the element, but those methods only give me ways to
retrieve the text, the remove() method removes the complete html of
the tag. Is there any way in which I can only remove the outer tags
and preserve the innerHTML ?
Thanks a lot in advance and appreciate your help.
--Rajesh
use unwrap, it preserves the inner html
doc.select("a").unwrap();
check the api-docs for more info:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29
How about extracting the inner HTML first, adding it to the DOM and then removing your tags? This code is untested, but should do the trick:
Edit:
I updated the code to use replaceWith(), making the code more intuitive and probably more efficient; thanks to A.J.'s hint in the comments.
Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
Node linkText = new TextNode(link.html(), baseUri);
// optionally wrap it in a tag instead:
// Element linkText = doc.createElement("span");
// linkText.html(link.html());
link.replaceWith(linkText);
}
Instead of using a text node, you can wrap the inner html in anything you want; you might even have to, if there's not just text inside your links.

JSoup select Div based on Id and href based on title

Im using JSoup to parse HTML response. I have multiple Div tags. I have to select Div tag based on an ID.
My pseudo code looks like this,
Document divTag = Jsoup.connect(link).get();
Elements info = divTag.select("div#navDiv");
where navDiv is the ID. But it doesnt seem to work.
Also I would want to select Href inside the Div based on some title, where hrefTitle[] would be string array. So while iterating the href I would check if the title is present in the string array, if so i would add them to list else ignore. How do i select href inside Div ? and How to select title? any inputs much appreciated.
But it doesnt seem to work.
It should work. Proof:
Document doc = Jsoup.parse("<html><body><div/>" +
"<div id=\"navDiv\">" +
"link1" +
"link2<" +
"</div></body></html>");
Element div = doc.select("div#navDiv").first();
Now, we can select the a element inside the div that has (for example) an href attribute whose value is href2:
System.out.println(div.select("a[href=href2]"));
Output:
link2
You can find the full selector syntax here:
http://jsoup.org/apidocs/org/jsoup/select/Selector.html

Categories