JSoup select Div based on Id and href based on title - java

Im using JSoup to parse HTML response. I have multiple Div tags. I have to select Div tag based on an ID.
My pseudo code looks like this,
Document divTag = Jsoup.connect(link).get();
Elements info = divTag.select("div#navDiv");
where navDiv is the ID. But it doesnt seem to work.
Also I would want to select Href inside the Div based on some title, where hrefTitle[] would be string array. So while iterating the href I would check if the title is present in the string array, if so i would add them to list else ignore. How do i select href inside Div ? and How to select title? any inputs much appreciated.

But it doesnt seem to work.
It should work. Proof:
Document doc = Jsoup.parse("<html><body><div/>" +
"<div id=\"navDiv\">" +
"link1" +
"link2<" +
"</div></body></html>");
Element div = doc.select("div#navDiv").first();
Now, we can select the a element inside the div that has (for example) an href attribute whose value is href2:
System.out.println(div.select("a[href=href2]"));
Output:
link2
You can find the full selector syntax here:
http://jsoup.org/apidocs/org/jsoup/select/Selector.html

Related

how to fetch anchor href attribute for <div class="_6ks"> shown in below screen shot using selenium in Java

how to fetch anchor href attribute for shown in below screen shot using selenium in Java
If you want to fetch the anchor node href value and the class name is not dynamic and it is unique, then you can do like below :
WebElement elemnent = driver.findElement(By.xpath("//div[#class='_6ks']/a"));
String url = elemnent.getAttribute("href");
System.out.println("=> The URL is : "+url);
If the above one doesn't work then share the full html code in the text format so that it will be easy for us to track down that element.

How do I write a css-selector for text that is not inside any dom element [duplicate]

I am writing a JUnit test for a webpage, using Selenium, and I am trying to verify that the expected text exists within a page. The code of the webpage I am testing looks like this:
<div id="recipient_div_3" class="label_spacer">
<label class="nodisplay" for="Recipient_nickname"> recipient field: reqd info </label>
<span id="Recipient_nickname_div_2" class="required-field"> *</span>
Recipient:
</div>
I want to compare what is expected with what is on the page, so I want to use
Assert.assertTrue(). I know that to get everything from the div, I can do
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ");
but this will return "reqd info * Recipient:"
Is there any way to just get the text from the div ("Recipient") using cssSelector, without the other tags?
You can't do this with a CSS selector, because CSS selectors don't have a fine-grained enough approach to express "the text node contained in the DIV but not its other contents". You can do that with an XPath locator, though:
driver.findElement(By.xpath("//div[#id='recipient_div_3']/text()")).getText()
That XPath expression will identify just the single text node that is a direct child of the DIV, rather than all the text contained within it and its child nodes.
I am not sure if it is possible with one css locator, but you can get text from div, then get text from div's child nodes and subtract them. Something like that (code wasn't checked):
String temp = "";
List<WebElement> tempElements = driver.findElements(By.cssSelector("div[id='recipient_div_3'] *"));
for (WebElement tempElement : tempElements) {
temp =+ " " + tempElement.getText();
}
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ").replace(temp, "");
This is for case when you try to avoid using xpath. Xpath allows to do it:
//div[#id='recipient_div_3']/text()
You could also get the text content of an element and remove the tags with regexp. Also notice: you should use the reluctant quntifier
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
String getTextContentWithoutTags(WebElement element) {
return element.getText().replaceAll("<[^>]*?/>", "").trim();
}

JSoup Extracting absolute url of a href and a div tag data simultaneously

I want to extract two tags from a website beside each others(adjacently), the first tag is a href and it should be extracted as the the absolute url . the second tag is a div tag and I should extract
the data inside it.
I want the output to be as the following
100 USD http:\www.somesite..............
200 usd http:\www.thesite.............
Why? because later I will insert them into a table in a database .
I tried with the following code but I couldn't get the absolute url in addition I couldn't get rid of the tags while I want to extract the data only (without tags).
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element link : doc.select("div.rightFloat.price,a[abs:href].more-details"))
{
String absHref = url.attr("abs:href");
String attr = link.absUrl("href");
System.out.println(link);
}
If I try using
System.out.println(link.text())
in my code I will miss the hyperlink completely !
Any help please?
I don't think that Jsoup css selector combinators (i.e. the comma in the selector) guarantees an ordering in the output. At least I would not count on it, even if you find the two elements in the ordering you expect. Instead of using the comma selector, I would first loop over the outer containers that hold the adjacent divs you are interested in. Within each div you can then access the price and link.
something like this. Note, that this is out of my head and untested!
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element adDiv : doc.select("div.category-listing-normal-ad")){
Element priceDiv = adDiv.select("div.rightFloat.price").first();
Element linkA = adDiv.select("a.more-details").first();
System.out.println(priceDiv.text() + " " + linkA.absUrl("href"));
}

How to extract text from all the elements in a webpage individually, using JSoup?

The problem here is, if I do:
Document doc = Jsoup.connect(url)
.timeout(30000)
.userAgent("Mozilla")
.followRedirects(true)
.get();
System.out.println(doc.select("body").text());
I get all the text in one chunk, and I don't want that.
Suppose I write a code like this:
String part="<div>
Primary div
<div>
Secondary div
</div>
</div>";
Document doc = Jsoup.parse(part);
Elements links = doc.select("div");
for(Element e:links){
out.println(e.text());
System.out.println(e.text());
}
The output is:
Primary div Secondary div
Secondary div
The inner div's text gets scraped twice.
I want that the scraping output should be like this:
Primary div
Secondary div
I want the text of each element to be unique excluding the text from the child elements.
How can this be achieved? The number of nested children can be more than just one.
You aren't getting two copies of Secondary div, you're outputting it twice: Once as part of the output of Primary div, then again on its own.
If you want just an element's own text and not the text of its child elements, use Element#ownText.

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags
For example, if my html text is:
String html = "<div> <p> some text some link text </p> </div>"
Now I can parse the above html and select for a tag in jsoup like this,
Document doc = Jsoup.parse(inputHtml);
//this would give me all elements which have anchor tag
Elements elements = doc.select("a");
and I can remove all of them by,
element.remove()
But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags.
Also, Please Note : I know there are methods to get outerHTML() and
innerHTML() from the element, but those methods only give me ways to
retrieve the text, the remove() method removes the complete html of
the tag. Is there any way in which I can only remove the outer tags
and preserve the innerHTML ?
Thanks a lot in advance and appreciate your help.
--Rajesh
use unwrap, it preserves the inner html
doc.select("a").unwrap();
check the api-docs for more info:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29
How about extracting the inner HTML first, adding it to the DOM and then removing your tags? This code is untested, but should do the trick:
Edit:
I updated the code to use replaceWith(), making the code more intuitive and probably more efficient; thanks to A.J.'s hint in the comments.
Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
Node linkText = new TextNode(link.html(), baseUri);
// optionally wrap it in a tag instead:
// Element linkText = doc.createElement("span");
// linkText.html(link.html());
link.replaceWith(linkText);
}
Instead of using a text node, you can wrap the inner html in anything you want; you might even have to, if there's not just text inside your links.

Categories