JSoup | Fetching part of the HTML - java

I have problem with fetch site with car ads. I would like to get advertiser's name from it. The main problem is that sometimes that name is showing in different way.
1) Name is Kajetan
(https://www.otomoto.pl/oferta/mercedes-benz-klasa-e-w211-bardzo-dobry-stan-bez-wkladu-finansowego-warszawa-ryki-ID6BEBy9.html#2bd424144f)
<div class="seller-box__seller-info">
<small class="seller-box__seller-registration">Sprzedający na OTOMOTO od 2015</small>
<small class="seller-box__seller-type">Osoba prywatna</small>
<h2 class="seller-box__seller-name"> Kajetan </h2>
</div>
2) Name is AS MOTORS Centrum Pojazdów Używanych KIA
(https://www.otomoto.pl/oferta/kia-ceed-1-6-crdi-136-km-m-bws-fvat-salon-serwis-polska-ID6BHFu3.html#2bd424144f)
<div class="seller-box__seller-info">
<small class="seller-box__seller-registration">Sprzedający na OTOMOTO od 2019</small>
<small class="seller-box__seller-type">Dealer</small>
<h2 class="seller-box__seller-name">
<div class="seller-badge"> <img src="xx.jpg" data-toggle="tooltip" data-placement="bottom" title="" data-original-title="Ten dealer korzysta z pakietu usług Premium Plus" class="">
</div>
AS MOTORS Centrum Pojazdów Używanych KIA
</h2>
</div>
In the first case the solution is easy because I'll do it like this:
public static String fetchOwnerName (String html) {
Elements ownerElement = Jsoup.parse(html).getElementsByClass("seller-box__seller-info").select("h2");
String owner = StringUtils.substringBetween(String.valueOf(ownerElement), "\">", "</h2>");
return owner;
}
But in the second case the problem is that after <h2> there are additional <div> and what is more, name of the advertiser is between <a href="".
How should I change fetchOwnerName method to be universal? I'm using JSOUP library to parse HTML page. Thanks for all of your suggestions.

You can get text inside the h2 tags without worrying about the additional tags i.e div a
You just have to call .text()
Elements ownerElement = Jsoup.parse(html).getElementsByClass("seller-box__seller-info").select("h2");
String owner = ownerElement.text();
This will work if no other text except advertiser's name is present between h2 tags

Related

Thymeleaf template passing objects through forms and get/post mapping

I have the following thymeleaf template
<!doctype html>
<html lang="nl" xmlns:th="http://www.thymeleaf.org">
<head th:replace="fragments::head(title='Film zoeken')">
<title>Film zoeken</title>
</head>
<body>
<nav th:replace="fragments::menu"></nav>
<form method="get" th:object="${zoekForm}" th:action="#{/film/zoeken}">
<label>
Nummer:
</label>
<span th:errors="*{id}"></span>
<input th:field="*{id}" type="number" autofocus required min="1">
<button>Zoeken</button>
</form>
<th:block th:if="${film}" th:object="${film}">
<dl>
<dt>Titel</dt>
<dd th:text="*{title}"></dd>
<dt>Regisseur</dt>
<dd th:text="*{regisseur}"></dd>
<dt>Uitgekomen op</dt>
<dd th:text="*{releaseDate}"></dd>
<dt>Karakters</dt>
<dd th:each="karakter : *{karakters}" th:text="${karakter}"></dd>
</dl>
<form th:if="not ${score}" th:object="${scoreForm}" method="post"
th:action="#{/film/{id}/score(id=${param.id})}">
<label>
Score:
<span th:errors="*{score}"></span>
<input th:field="*{score}" type="number" required min="1" max="10">
</label>
<button>Bewaren</button>
</form>
<div th:if="${score}">
Je score voor deze film is <strong th:text="${score.score}"></strong>
</div>
</th:block>
</body>
</html>
I have two #GetMapping methods and then the final #PostMapping.
The first one do a modelAndView.addObject(new ZoekForm(null));, so it shows only the first form.
Then, the second getmapping do a thing with the content (it shows the film data) and then it has a
modelAndView.addObject("scoreForm", new ScoreForm(null));.
So far the template shows 1) search form for a movie, 2) the film data and finally a field to give the movie a score.
I need to give a score (hit the button from the second form) and show the div
<div th:if="${score}">
Je score voor deze film is <strong th:text="${score.score}"></strong>
</div>
but the website must keep showing the film data. Now, when I hit that score button, all goes away.
It seems like the th:if=${film} is not happening. Clues?
(I assume also that the score object needs to be passed in the post, because if I look up again the same movie, I must be unable to give it a score).
Thymeleaf doesn't create objects for you. You need to pass it the object in your controller. You can either do it in every controller method that returns this template page or just use #ModelAttribute (place it as a method to your controller)
#ModelAttribute("film")
public Film methodNameDoesntMatter() {
return new Film();
}
This will occur on every page reload, so Film object is always new. Make sure there are getters/setters and no arg constructor defined - this is important for thymleaf

parse data of certain tag which is before a particular class

I need parse data from web page by tag ("p"). I try like this:
Elements content = document.getElementsByTag("p");
for(Element el : content) {
System.out.println(el.text());
}
And it's work fine. But I get superfluous data.
For example:
<div class="DicCellTerm">
<h1>Impossible</h1>
<div class=des>
<p class=par2><span class=hint><em>smth</em></span></p>
<p class=par2>1) (<em>with</em>) all, do</p>
<p class=par2>2) <span class=hint><em>text</em></span> some words</p>
<p class=par3>it is impossible</p>
</div>
</div>
</div><!--DicCell end-->
<div align="center" class="AdContent" id="adcontentnoprint">
<div class=SharedItems>
<div class=DicCellParent>
<span class=LinkOtherDic>+ dictionary <strong>impossible</strong> - translate</span>
<div class=DicCellOther id=diccellothershow>
<h2>impossible</h2>
<div class=des>
<p class=par1>1) important, is</p>
<p class=par1>what</p>
<p class=par1>2) true, false</p>
</div>
</div>
<!--DicCellOther end-->
</div>
<!--DicCellParent end-->
<div class=DicCellParent>
<span class=LinkOtherDic>+ translate <strong>important</strong> - dictionary</span>
<div class=DicCellOther id=diccellothershow>
<h2>importnant</h2>
<div class=des>
<p class=par1>1) müim, emiyetli; emiyet bar</p>
<p class=par1>it is very important - bu pek müimdir, bunıñ büyük emiyeti bar</p>
<p class=par1>2) qopayıp, qabarıp</p>
</div>
</div>
<!--DicCellOther end-->
</div>
<!--DicCellParent end-->
</div>
<!--SharedItems end-->
I need to get data by tag "p" before class SharedItems.
I tried parse data by class "DicCellTerm" and I get properly data. And all data is written in one line, but I need to get data as on web page.
Elements elements = document.select(".DicCellTerm p");
This grabs all p inside the .DicCellTerm class, then you can iterate over elements. Here is a link to all possible selectors in jsoup, this is where i get most of my help =)
https://jsoup.org/apidocs/index.html?org/jsoup/select/Selector.html

Webdriver getText not returning any meaningful value

I have the following html in a page
<div class="mst_updt" style=""> 27/06/2017 12:02:31 </div>
I am trying to extract the dynamic date value between the divs :
WebElement webElement =driver.findElement(By.cssSelector(".mst_updt"));
String text = webElement.getText();
System.out.println("i am text : " + text);
System.out.println("Most read 1 : " + webElement.getText());
String a = driver.findElement(By.cssSelector(".mst_updt")).getText();
System.out.println("Most read 2 : " + a);
System.out.println(webElement);
Boolean isTheTextPresent = driver.getPageSource().contains("mst_updt");
System.out.println("And did we find the string ? : " + isTheTextPresent);
you will see i am trying various methods , here are the results i am getting, why cant i extract the date and time ??
i am text :
Most read 1 :
Most read 2 :
[[FirefoxDriver: firefox on ANY (e4a9c548-6146-4685-9944-6b5d51308bff)] -> css selector: .mst_updt]
And did we find the string ? : true
Full Code which should help..
<div class="content">
<div role="main" class="content-inner content-full-width">
<div class="main-content">
<section class="component-list">
<div class="section group">
<div class="col-lrg span-lrg_1_of_3">
<div class="component-weekly-wrap">
<header class="header-weekly-wrap">
<h4 itemprop="name">
<a class="section-title-link" href="example.com/static/survey-panel">Computer Survey Panel</a>
</h4>
</header>
<article>
<img alt="" src="/w-images/fa8edbd1-2c78-4415-9abf-334f7087ff8b/2/CTGRS17OA344213-370x229.jpg" />
<div class="col-inner weekly-wrap-details">
<p><p>Join our <strong>Research Survey Panel</strong> and earn an Amazon voucher for each survey you complete!</p>
<p><strong>Find out more</strong></p></p>
<!--<a class="btn download" href="#">More information</a> -->
</div>
</article>
</div>
</div>
<!--most read homepage updated START-->
<div class= "mst_updt" style="display:none;">
28/06/2017 12:38:15
</div>
<!--most read homepage updated END-->
<div class="col-lrg component-list-most-read span-lrg_2_of_3">
<div class="col-inner component-most-read">
<header class="header-most-read">
<h4 itemprop="name">Most read</h4>
</header>
<div class="ol">
<ol>
<li>
</li>
<li>
</li>
<li>
</li>
<li>
</li>
<li>
</li>
</ol>
</div>
</div>
</div>
</div>
</section>
</div>
</div>
</div>
Here is the Answer to your Question:
To locate a particular div to get text using class attribute must be avoided. Class attribute is applied to multiple div tags as required. You can consider using some other locator preferably an xpath or css to identify this unique div tag and use getText() method to retrieve the text.
Here in this case as you have copied the single div tag it would be tough to help you. But you can consider to construct anxpath to traverse the HTML DOM by binding to an id or name attribute at the parent level and then access theclass property of this div.
Try this xpath, if that's working, it means the identifier you are using is giving you wrong element.
//*[contains(text(),'27/06/2017')]
the solution was supplied by : Rafał Laskowski
i could get the value by xpath of cssSelector, it needed returning by using : getAttribute‌​("innerHTML");
s String a = driver.findElement(By.cssSelector(".mst_updt")).getAttribute‌​("innerHTML");
String b = driver.findElement(By.xpath("//html/body/div[3]/div[2]/div[7‌​]/div/div/section/di‌​v/div[2]")).getAttri‌​bute("innerHTML");
System.out.println("Most read 2 : " + a);
System.out.println("Most read 3 : " + b);

How to click Buy Now button using selenium?

Source HTML look like this :
<script id="during-reserve-tpl" type="text/x-lodash-template">
<div class="gd-row">
<div class="gd-col gu16">
<div class="emailModule message module-tmargin">
<div class="error-msg"></div>
<div class="register brdr-btm">
<div class="jbv jbv-orange jbv-buy-big jbv-reserve">Buy Now</div>
</div>
<div class="topTextWrap brdr-btm tmargin20">
<div class="subHeading">
Only one phone per registered user
<p>
First come, first serve!
</p>
</div>
</div>
</div>
</div>
</div>
</script>
When I code : IWebElement buy = driver.FindElement(By.CssSelector(".jbv.jbv-orange.jbv-buy-big.jbv-reserve")); It says Element not found.
I tried putting By.ClassName with while spaces but it says, compound classes are not supported.
Is there any alternative to click it ?
driver.FindElement(By.cssselector("div.jbv.jbv-orange.jbv-buy-big.jbv-reserve"))
In the above example css selector looks for div tag with name and it will look for all the dot with space
Try this By.xpath("//*[contains(#class, 'jbv')]") if it works.
You can try either of these:
IWebElement buy = driver.FindElement(By.CssSelector("div.register>div"));
OR
IWebElement buy = driver.FindElement(By.CssSelector("div.register"));

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.
The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

Categories