Extract text in a order using jsoup - java

I want to extract the text inside the "job title" and the text inside "summary" class. There are many with the same class names. So I want the job title of the first one and summary of it. And then the job title of the next one and the summary of it. In that order.
The following code works. But it first gives all the titles and then all the text inside all the summary classes. I want the first job title and the first summary. Then the second job title and the second summary and so on. How do I modify the code for this? Please help.
<div class=" row result" id="p_64c5268586001bd2" data-jk="64c5268586001bd2" itemscope="" itemtype="http://schema.org/JobPosting" data-tn-component="organicJob">
<h2 id="jl_64c5268586001bd2" class="jobtitle">
<a rel="nofollow" href="/rc/clk?jk=64c5268586001bd2" target="_blank" onmousedown="return rclk(this,jobmap[0],0);" onclick="return rclk(this,jobmap[0],true,0);" itemprop="title" title="Fashion Assistant" class="turnstileLink" data-tn-element="jobTitle"><b>Fashion</b> Assistant</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Itv?from=SERP&campaignid=serp-linkcompanyname&fromjk=64c5268586001bd2&jcid=3bf3e8a57da58ff5" target="_blank">
ITV Jobs</a></span>
</span>
<a data-tn-element="reviewStars" data-tn-variant="cmplinktst2" class="turnstileLink " href="/cmp/Itv/reviews?jcid=3bf3e8a57da58ff5" title="Itv Jobs reviews" onmousedown="this.href = appendParamsOnce(this.href, '?campaignid=cmplinktst2&from=SERP&jt=Fashion+Assistant&fromjk=64c5268586001bd2');" target="_blank">
<span class="ratings"><span class="rating" style="width:49.5px;"><!-- -> </span></span><span class="slNoUnderline">28 reviews</span></a>
<span itemprop="jobLocation" itemscope="" itemtype="http://schema.org/Place"> <span class="location" itemprop="address" itemscope="" itemtype="http://schema.org/Postaladdress"><span itemprop="addressLocality">London</span></span></span>
<table cellpadding="0" cellspacing="0" border="0">
<tbody><tr>
<td class="snip">
<div>
<span class="summary" itemprop="description">
Do you have a passion for <b>Fashion</b>? You will be responsible for running our <b>fashion</b> cupboard, managing a team of interns and liaising with press officers to...</span>
</div>
doc = Jsoup.connect("http://www.indeed.co.uk/jobs?q=fashion&l=England").timeout(5000).get();
Elements f = doc.select(".jobtitle");
Elements e = doc.select(".summary");
System.out.println("Title: " + f.text());
System.out.println("Details: "+ e.text());

Iterate over titles and then find the summary for each title:
for (Element title : doc.select(".jobtitle")) {
Element summary = title.parent().select(".summary").first();
System.out.format("Title: %s. Summary: %s%n", title.text(), summary.text());
}

Related

How i can select element from a drop down with div tag [duplicate]

This question already has answers here:
'UnexpectedTagNameException' and Element should have been "select" but was "div" error using 'Select' function through Selenium java
(1 answer)
org.openqa.selenium.support.ui.UnexpectedTagNameException: Element should have been "select" but was "span" while selecting a dropdown value
(2 answers)
Closed 2 years ago.
I'm trying to select an option from a drop-down that has a div tag instead of select. With my below code, I am able to open the respective div, however unable to select the element.
This is the HTML tags:
<div id="selectator_LocationListDD" class="selectator_element single options-hidden" style="width:
100%; min-height: 35px; padding: 6px 12px; flex-grow: 0; position: relative;">
<span class="selectator_textlength" style="position: absolute; visibility: hidden;">
</span>
<div class="selectator_selected_items">
<div class="selectator_selected_item selectator_value_">
<div class="selectator_selected_item_title">--Select--</div>
<div class="selectator_selected_item_subtitle"></div>
</div>
</div>
<input class="selectator_input" placeholder="Search here..." autocomplete="false">
<ul class="selectator_options" style="top: 73px;"><li class="selectator_option selectator_value_">
<div class="selectator_option_title">--Select--</div><div class="selectator_option_subtitle">
</div>
<div class="selectator_option_subtitle2">
</div>
<div class="selectator_option_subtitle3">
</div>
</li>
<li class="selectator_option selectator_value_CST0003970">
<div class="selectator_option_title">21ST STREET</div>
<div class="selectator_option_subtitle">1031 21st</div>
<div class="selectator_option_subtitle2">Lewiston, ID</div>
</li>
<li class="selectator_option selectator_value_CST0003214">
<div class="selectator_option_title">3RD & STEVENS</div>
<div class="selectator_option_subtitle">508 W Third Ave</div>
<div class="selectator_option_subtitle2">Spokane, WA</div>
</li>
<li class="selectator_option selectator_value_CST0003956 active">
<div class="selectator_option_title">9TH AVE</div>
<div class="selectator_option_subtitle">600 S 9th Ave</div>
<div class="selectator_option_subtitle2">Walla Walla, WA</div>
</li>
<li class="selectator_option selectator_value_CST0003991">
<div class="selectator_option_title">10TH & BANNOCK</div>
<div class="selectator_option_subtitle">950 W Bannock St, Ste 100</div>
<div class="selectator_option_subtitle2">Boise, ID</div>
</li>
</ul>
</div>
The Code ni has so far is:
Page Object:
#FindBy(id="selectator_LocationListDD")
WebElement locationDD;
public void select_locationEI(int index) throws InterruptedException {
Thread.sleep(2000);
locationDD.click();
Select locationEI = new Select(locationDD);
locationEI.selectByIndex(index+1);
// wait.until(ExpectedConditions.visibilityOfElementLocated
(By.xpath("//div[#class=\"selectator_selected_item selectator_value_\"]//li["+
(index+1)+"]"))).click();
}
step definition:
#When("user added equipment for each location")
public void user_added_equipment_for_each_location() throws InterruptedException {
atmAgreement = new AgreementsATM(driver);
for(int ei: emptyLocation) {
atmAgreement.click_addNewEquipment_tab();
loaderVisibilityWait();
loaderInvisibilityWait();
atmAgreement.select_locationEI(ei);
atmAgreement.enter_modelText();
String dt = reader.getCellData("ATM", "Depositor Type", 2);
atmAgreement.select_depositorType(dt);
String manufacture = reader.getCellData("ATM", "Manufacturer", 2);
atmAgreement.select_manufacturer(manufacture);
atmAgreement.enter_terminalID();
atmAgreement.click_addButtonEI();
loaderVisibilityWait();
}
emptyLocation.clear();
}
I got an org.openqa.selenium.support.ui.UnexpectedTagNameException: Element should have been "select" but was "div".
I'm not sure how to handle this as I've only worked with selects before.
Let's say I wanted to select "9TH AVE" for the agent code. How would I go about this?
Any help is appreciated! Thanks.
Use this xpath and get all the option title (findelements).
//ul//li/div[#class='selectator_option_title']
store above element in yourListOFElements
Once you have the list of webelements you can iterate through each entry and compare the innerHTML
yourListOFElements.get(i).getAttribute("innerHTML")
and compare with your required text.
if matches you can click that element
hope you got the idea.
as I see your dropdown list contains search field
<input class="selectator_input" placeholder="Search here..." autocomplete="false">
the best way is to
Select the main div with id="selectator_LocationListDD"
select the search field inside the main div.
type in the search field a unique part of the option name.
then click on the only displayed <li> inside the main div.
that way you avoid using the index, which can change frequently and use the text in the selection which most likely depends on your inserted Test data so you have full control of it.

Can't figure out how to scrape specific text - Using Jsoup

I just started learning how to use JSoup. I think I've successfully selected this section of the html, and I successfully took "DARK SOULS III Deluxe Edition" out by doing .select("span.title").text but I was trying to get the prices, in this case $84.98 and $55.23. I tried doing .select("div.col search_price responsive_secondrow").text but it comes up as blank. I was wondering if someone could help me figure out how to extract that part, thanks in advance! Here's the html of the section of the page.
The full html is view-source:http://store.steampowered.com/search/?filter=topsellers
<a href="http://store.steampowered.com/sub/94174/?snr=1_7_7_topsellers_150_1" data-ds-packageid="94174" data-ds-appid="374320,442010"onmouseover="GameHover( this, event, 'global_hover', {"type":"sub","id":94174,"public":1,"v6":1} );" onmouseout="HideGameHover( this, event, 'global_hover' )" class="search_result_row ds_collapse_flag" >
<div class="col search_capsule"><img src="http://cdn.edgecast.steamstatic.com/steam/subs/94174/capsule_sm_120.jpg?t=1476893662"></div>
<div class="responsive_search_name_combined">
<div class="col search_name ellipsis">
<span class="title">DARK SOULS III Deluxe Edition</span>
<p>
<span class="platform_img win"></span> </p>
</div>
<div class="col search_released responsive_secondrow">12 Apr, 2016</div>
<div class="col search_reviewscore responsive_secondrow">
<span class="search_review_summary positive" data-store-tooltip="Very Positive<br>86% of the 29,204 user reviews for games in this bundle are positive.">
</span>
</div>
<div class="col search_price_discount_combined responsive_secondrow">
<div class="col search_discount responsive_secondrow">
<span>-35%</span>
</div>
<div class="col search_price discounted responsive_secondrow">
<span style="color: #888888;"><strike>$84.98</strike></span><br>$55.23 </div>
</div>
</div>
<div style="clear: left;"></div>
</a>
Use doc.select("a.search_result_row") instead:
public class JsoupSteamTest {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://store.steampowered.com/search/?filter=topsellers").userAgent("Mozilla")
.get();
Elements table = doc.select("a.search_result_row");
Iterator<Element> ite = table.iterator();
while (ite.hasNext()) {
Element element = ite.next();
System.out.println(element.text());
}
}
}
You will get a list like this:
PLAYERUNKNOWN'S BATTLEGROUNDS 23 Mar, 2017 29,99€
Steel Division: Normandy 44 Coming Soon 39,99€
DARK SOULS™ III 11 Apr, 2016 -50% 59,99€ 29,99€
Your particular problem comes from the div that has multiple classes.
To select an element that has multiple classes, use a dot instead of a space in your select:
doc.select("div.col.search_price.discounted.responsive_secondrow");
Take a look at this question: JSOUP get element with multiple classes

Multiple fields within an id - Webdriver - Java

The input fields I am needing to grab are within this #id="contractorsWrapper".
In this example, there are 2 input fields within that wrapper (but this number is dynamic depending on the case) located at #class="contactEntry".
What I'm trying to do is say, how many className=contactEntry fields are there within the id=contractorsWrapper. Then, be able to input text into them independently.
<div id="contractorsWrapper" class="contactInputAndInfoDisplays_wrapper">
<div id="contractorsRow_5d1532ba-b37e-4aac-85c2-4a5e6c6c2796" class="contactInputAndInfoDisplay">
<div class="contactName">
<div class="contactFlag"/>
<a class="smallRemove removeAContact" href="#"/>
<span class="littleGreyTitles">
Name
<br/>
</span>
<input class="contactEntry " type="text" value=""/>
</div>
<div class="descriptionInput littleGreyTitles">
Description
<br/>
<input type="text"/>
</div>
<a class="contactLink" href="#" style="display: none;"/>
</div>
<div class="spacerDiv1"/>
<div id="contractorsRow_5fc58f1a-906f-4239-93ae-b0a2e4b8b70c" class="contactInputAndInfoDisplay">
<div class="contactName">
<div class="contactFlag"/>
<a class="smallRemove removeAContact" href="#"/>
<span class="littleGreyTitles">
Name
<br/>
</span>
<input class="contactEntry " type="text" value=""/>
</div>
<div class="descriptionInput littleGreyTitles">
Description
<br/>
<input type="text"/>
</div>
<a class="contactLink" href="#" style="display: none;"/>
</div>
<div class="spacerDiv1"/>
</div>
Find your wrapper:
WebElement wrapperElement = driver.findElement(By.id("contractorsWrapper"));
Number of input elements:
wrapperElement.findElements(By.className("contactEntry ")).size();
I don't know what you mean with "input text into them independently" but here's how you could enter the same thing in all of them:
for (WebElement element : wrapperElement.findElements(By.className(className))) {
element.sendKeys("keysToSend");
};
update
after more details from OP
If you want to insert some "unique" Strings to the element, you can use an ArrayList
// create as much array entries as you need
List<String> namesList = new ArrayList<String>();
namesList.add("John Doe");
namesList.add("Jane Doe");
...
// then
int count = 0;
for (WebElement element : wrapperElement.findElements(By.className(className))) {
element.sendKeys(namesList.get(count++));
};
of course you would then need to make sure, that your list is always longer than the number of input elements...

Thymeleaf: Loop iteration in <div> tag and create rows.

I am new in thymeleaf, and trying to iterate values using thymeleaf th:each attribute, but get wrong output. I am using <div> instead of table, when thymeleaf render the page, all objects values override the first row values and rest of the rows are empty show. Following is my code:
My Spring MVC controller code
ProductCategory category = new ProductCategory();
category.setId(BigInteger.valueOf(558711));
category.setTitle("Category 1");
category.setStatus(FinEnum.STATUS.IN_ACTIVE.getStatus());
ProductCategory category2 = new ProductCategory();
category.setId(BigInteger.valueOf(558722));
category.setTitle("Category 2");
category.setStatus(FinEnum.STATUS.ACTIVE.getStatus());
List<ProductCategory> categories = new ArrayList<ProductCategory>();
categories.add(category);
categories.add(category2);
model.addAttribute("categories", categories);
return "admin/product/view-categories";
My thymeleaf code:
<div class="row-area" th:each="category: ${categories}">
<div class="column2 tariff-date" style="width: 15%;"><span th:text="${category.id}">Dummy Data</span></div>
<div class="column2 tariff-date" style="width: 15%;"><span th:text="${category.title}">Dummy Data</span></div>
<div class="column2 tariff-date" style="width: 13%;"><span th:text="${category.status}">Dummy Data</span></div>
<div class="column5 icons middle-area" style="margin-left: 7px; width: 40%;">
<a class="icon7" href="javascript:void(0)" style="width: 140px;">View Sub Category</a>
<a class="icon2" href="javascript:void(0)"><p>Edit</p></a>
<div th:switch="${category.status}" style="margin-left: 195px;">
<a class="icon8" href="javascript:void(0)" th:case="'Inactive'" style="width: 88px;">Deactivate</a>
<a class="icon9" href="javascript:void(0)" th:case="'Active'">Active</a>
</div>
<a class="icon14" href="javascript:void(0)" style="width: 60px;"><p>Delete</p></a>
</div>
</div>
My Output is:
The problem has nothing to do with Thymeleaf, it's just a simple typo. After the line:
ProductCategory category2 = new ProductCategory();
you are still modifying (overwriting) the category object instead of category2. Therefore, the properties of category2 never got set. Corrected code should be:
category2.setId(BigInteger.valueOf(558722));
category2.setTitle("Category 2");
category2.setStatus(FinEnum.STATUS.ACTIVE.getStatus());
Tested this locally and saw we "rows" of data after the fix.

How do I correctly parse data using JSoup (java)

I want to parse the data out of this HTML (CompanyName, Location, jobDescription,...) using JSoup (java). I get stuck when trying to iterate the joblistings
The extract from the HTML is one of many "JOBLISTING" divs which I want to iterate and extract the Data out of it. I just can't handle how to iterate the specific div objects. Sorry for this noob question, but maybe someone can help me who already knows which function to use. Select?
<div class="between_listings"><!-- local.spacer --></div>
<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">
<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
<img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
</a>
</div>
<div class="job_info">
<div class="h3 job_title">
<a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
<span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
</a>
</div>
<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Delivery Hero Holding GmbH</span>
</div>
</div>
<div class="job_location_date">
<div class="job_location target-location">
<div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">
<div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="addressLocality"> Berlin</span>
</div>
<span class="location_actions">
<a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
<span class="location-icon"><!-- --></span>
<span class="location-label">Google Maps</span>
</a>
</span>
</div>
</div>
<div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>
<div class="job_actions">
</div>
</div>
<div class="between_listings"><!-- local.spacer --></div>
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"); // Load file into extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]"); // other element properties System.out.println(jobListingElements);
Java code:
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements jobListingElements = ParseResult.select(".joblisting");
for (Element jobListingElement: jobListingElements) {
jobListingElement.select(".companyName span[itemprop=\"name\"]");
// other element properties
System.out.println(jobListingElements);
}
Thank you!
So you got your Jsoup document right? Than it seems pretty easy if the css class joblisting does not appear anywhere else.
Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
System.out.println(companyName);
System.out.println(jobTitle);
}
I don't know why the attribute [itemprop*=\"name\"] selector does not find the span (Further reading: http://jsoup.org/cookbook/extracting-data/selector-syntax )
Got it: span[itemprop=name] without any quotes or escapes. Other attributes or values also should work to get a more specific selection.

Categories