How to get specific sub-elements of html data using Jsoup

How to get specific sub-elements of html data using Jsoup - java

So I am trying to get all prices from a Html file using Jsoup. The simplified Html is structured something like this:
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
$509
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_ECONOMY_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for Economy (lowest)</span>
</div>
</div>
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
$1,046
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_MIN-BUSINESS-OR-FIRST_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for First (2-cabin, lowest)</span>
</div>
<div class="pp-remaining-seats">5 tickets left at this price</div>
</div>
//some html
This is what I have tried so far:
File input = new File("Flights.html");
Document document = Jsoup.parse(input, "UTF-8", "");
Elements prices = document.getElementsByClass("price-point");
for(Element e: prices){
System.out.println(e.toString());
}
This gives me the following result:
<div class="price-point price-point-revised use-roundtrippricing">
$509
</div>
<div class="price-point price-point-revised use-roundtrippricing">
$1,046
</div>
.....
But now I only want prices like:
509
1046
I tried regex by only keeping the digits e.toString().replaceAll("\\D+","") when printing it, this seems to work but that is not how I want to achieve it. How can I get only the numbers using Jsoup?

Thanks to the comment from #Eritrean, I needed to use e.text() instead of e.toString()which gave me
$509
$1,046
I still need to use regex like e.replaceAll("[$,]", "") to get rid of the dollar signs.

Related

Get first level element of HTML document - Java

I want to get the first level elements of the HTML tag <wicket:extend> in below document. I am using Jericho API for html parsing but didn't find any method/way to get the first level elements.
<body>
<wicket:extend>
<div wicket:id="container1"> some elements</div>
<div wicket:id="container2">
<h2>This is container 2</h2>
</div>
<div wicket:id="container3">
<h2>This is container 3</h2>
</div>
<div id="panel2">
<h2>This is panel2</h2>
</div>
<h3>This is heading3</h3>
</wicket:extend>
Expected output
<div wicket:id="container1">
<div wicket:id="container2">
<div wicket:id="container3">
<div id="panel2">
<h3>This is heading3</h3>

parse data of certain tag which is before a particular class

I need parse data from web page by tag ("p"). I try like this:
Elements content = document.getElementsByTag("p");
for(Element el : content) {
System.out.println(el.text());
}
And it's work fine. But I get superfluous data.
For example:
<div class="DicCellTerm">
<h1>Impossible</h1>
<div class=des>
<p class=par2><span class=hint><em>smth</em></span></p>
<p class=par2>1) (<em>with</em>) all, do</p>
<p class=par2>2) <span class=hint><em>text</em></span> some words</p>
<p class=par3>it is impossible</p>
</div>
</div>
</div><!--DicCell end-->
<div align="center" class="AdContent" id="adcontentnoprint">
<div class=SharedItems>
<div class=DicCellParent>
<span class=LinkOtherDic>+ dictionary <strong>impossible</strong> - translate</span>
<div class=DicCellOther id=diccellothershow>
<h2>impossible</h2>
<div class=des>
<p class=par1>1) important, is</p>
<p class=par1>what</p>
<p class=par1>2) true, false</p>
</div>
</div>
<!--DicCellOther end-->
</div>
<!--DicCellParent end-->
<div class=DicCellParent>
<span class=LinkOtherDic>+ translate <strong>important</strong> - dictionary</span>
<div class=DicCellOther id=diccellothershow>
<h2>importnant</h2>
<div class=des>
<p class=par1>1) müim, emiyetli; emiyet bar</p>
<p class=par1>it is very important - bu pek müimdir, bunıñ büyük emiyeti bar</p>
<p class=par1>2) qopayıp, qabarıp</p>
</div>
</div>
<!--DicCellOther end-->
</div>
<!--DicCellParent end-->
</div>
<!--SharedItems end-->
I need to get data by tag "p" before class SharedItems.
I tried parse data by class "DicCellTerm" and I get properly data. And all data is written in one line, but I need to get data as on web page.

Elements elements = document.select(".DicCellTerm p");
This grabs all p inside the .DicCellTerm class, then you can iterate over elements. Here is a link to all possible selectors in jsoup, this is where i get most of my help =)
https://jsoup.org/apidocs/index.html?org/jsoup/select/Selector.html

position() function brings me wrong data

I am using Selenium and Java to write a test, I have a DOM below:
<body>
<div class='t'><span>1</span></div>
<div class='t'></div>
<div class='t'><span>2</span></div>
<div class='t'><span>3</span></div>
<div class='t'><span>4</span></div>
<div class='t'><span>5</span></div>
<div class='t'><span>6</span></div>
<div class='t'><span>7</span></div>
</body>
why the result is the same for both:
//div[position()>1 and #class='t' and .//span ]
and
//div[position()>2 and #class='t' and .//span ]
and the result is:
<div class="t">
<span>2</span>
</div>
<div class="t">
<span>3</span>
</div>
<div class="t">
<span>4</span>
</div>
<div class="t">
<span>5</span>
</div>
<div class="t">
<span>6</span>
</div>
<div class="t">
<span>7</span>
</div>
my expectation for the first xpath is the same but for the second one I think it should be:
<div class="t">
<span>3</span>
</div>
<div class="t">
<span>4</span>
</div>
<div class="t">
<span>5</span>
</div>
<div class="t">
<span>6</span>
</div>
<div class="t">
<span>7</span>
</div>

I jus figured out that it the xpath should be //div[ #class='t' and .//span ][position()>2] so it first selects all div having t as their class attribute and at least one <span> tag inside and then it gets the array of webelement after the first position

Below xpath:
//div[position()>1 and #class='t' and .//span ]
clearly specifying that the div should contains class='t', a span tag and its position should be greater than 1. There is no span tag in 2nd div. So that above xpath prints result from third div.
Mean while the Below xpath:
//div[position()>2 and #class='t' and .//span ]
also specifying that the div should contains class='t', a span tag and its position should be greater than 2. Means result starts from again third div.
div in third position is
<div class='t'><span>2</span></div>
It contains class='t' and a span tag, and also position of div is greater than 2.

Positioning of jQuery/Java toggle content

I have the following HTML-Code:
<div class="test-container">
<div class="slide-button" data-content="panel1">
<p><span class="panel-icon">+</span> Test1</p>
</div>
<div id="panel1" style="display: none">
<p> Test jquery menu1 </p>
</div>
<div class="slide-button" data-content="panel2">
<p><span class="panel-icon">+</span> Test2</p>
</div>
<div id="panel2" style="display: none">
<p> Test jquery menu2 </p>
</div>
</div>
And the following jQuery/Java-Code:
$(".slide-button").on('click', function() {
var panelId = $(this).attr('data-content');
$('#'+panelId).toggle(500);
$(this).find('.panel-icon').text(function(_, txt) {
return txt === "+" ? "-" : "+";
});
});
The toggle itself works perfectly. When I click on the slide-button the content will slide-down. However, after the slide-down animation is finished the content somehow "jumps up" to its final position.
How can I avoid this "jump" and get the content stays where it is after the slide-down animation is finished?
Thanks for any help :-)

I don't know exactly what your circumstances are, and whether you need <p>aragraph tags or not, but if you switch the <p> tags inside your "panels" to <span> tags it seems to fix your issue.
The HTML code I used which fixed the jump looks like this:
<div class="test-container">
<div class="slide-button" data-content="panel1">
<p><span class="panel-icon">+</span> Test1</p>
</div>
<div id="panel1" style="display: none">
<!-- Here is the first change from a paragraph to a span tag -->
<span>Test jquery menu1</span>
</div>
<div class="slide-button" data-content="panel2">
<p><span class="panel-icon">+</span> Test2</p>
</div>
<div id="panel2" style="display: none">
<!-- Here is the second change from a paragraph to a span tag -->
<span>Test jquery menu2</span>
</div>
Also, just a friendly tip, Java is not the same as JavaScript. (: Keep that in mind when tagging your questions.

How do you get the value inside an XML document?

In below XML I need to confirm "Internet" is there.
<section id="landing-content">
<div id="header">
<div class="container">
<div class="row">
<div class="span12">
<h1 class="theme--primary">Internet</h1>
</div>
</div>
</div>
</div>
I tried the following:
WebElement findInternet = driver.findElement(By.xpath("//h1"));
System.out.println(findInternet);

I think this will work for you:
WebElement findInternet = driver.findElement(By.cssSelector("h1.theme--primary"));
System.out.println(findInternet.getText());
Your xpath selector will probably work as well, the key thing you were missing is your println was printing the findInternet object. getText() will get the inner text of the selected element.`

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get specific sub-elements of html data using Jsoup - java

Thanks to the comment from #Eritrean, I needed to use e.text() instead of e.toString()which gave me $509 $1,046 I still need to use regex like e.replaceAll("[$,]", "") to get rid of the dollar signs.

Related

Get first level element of HTML document - Java

parse data of certain tag which is before a particular class

position() function brings me wrong data

Positioning of jQuery/Java toggle content

How do you get the value inside an XML document?

Categories

Resources