I'm relatively new to using jsoup, and I can't seem to find the correct query to parse out the value I'm looking for. The HTML is as follows.
<img src='http://rootzwiki.com/public/style_images/ginger/t_unread.png' alt='New Replies' /><br />
</a>
</td>
<td class='col_f_content '>
<h4><a id="tid-link-12251" href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/" title='View topic, started 17 December 2011 - 09:32 AM' class='topic_title'>[ROM][LTE] RootzBoat 4.0.3 V6.1</a></h4>
<br />
<span class='desc lighter blend_links'>
Started by <a hovercard-ref="member" hovercard-id="5" class="_hovertrigger url fn " href='http://rootzwiki.com/user/5-birdman/'>birdman</a>, 17 Dec 2011
</span>
<ul class='mini_pagination'>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/" title='Go to page 1'>1</a></li>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__10" title='Go to page 2'>2</a></li>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__20" title='Go to page 3'>3</a></li>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__1990" title='Go to page 200'>200 →</a></li>
</ul>
</td>
<td class='col_f_preview __topic_preview'>
<a href='http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/' class='expander closed' title='Preview this topic'> </a>
</td>
<td class='col_f_views desc blend_links'>
<ul>
<li>
<span class='ipsBadge ipsBadge_orange'>Hot</span>
1,999 replies
</li>
<li class='views desc'>180,213 views</li>
</ul>
</td>
<td class='col_f_post'>
<a href='http://rootzwiki.com/user/49940-jakeday/' class='ipsUserPhotoLink left'>
<img src='http://rootzwiki.com/uploads/profile/photo-thumb-49940.jpg' class='ipsUserPhoto ipsUserPhoto_mini' />
</a>
<ul class='last_post ipsType_small'>
<li><a hovercard-ref="member" hovercard-id="49940" class="_hovertrigger url fn " href='http://rootzwiki.com/user/49940-jakeday/'>jakeday</a></li>
<li>
<a href='http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__view__getlastpost' title='Go to last post'>Today, 04:20 AM</a>
</li>
</ul>
</td>
I need to parse out birdman from there. I know that once I've defined the element, I can get "birdman" out with author.text();, but I cant figure out how to define the author element. I thought perhaps the following block of code would work, but as I mentioned, I'm pretty new to jsoup and html and it obviously didnt work. Theres nothing wrong with the connection, and jsoup is working for the other values I parsed out.
TitleResults titleArray = new TitleResults();
Document doc = null;
try {
doc = Jsoup.connect(Constants.FORUM).get();
} catch (IOException e) {
e.printStackTrace();
}
Elements threads = doc.select(".topic_title");
for (Element thread : threads) {
titleArray = new TitleResults();
//Thread title
threadTitle = thread.text();
titleArray.setItemName(threadTitle);
//Thread link
String threadStr = thread.attr("abs:href");
String endTag = "/page__view__getnewpost"; //trim link
threadStr = new String(threadStr.replace(endTag, ""));
threadArray.add(threadStr);
titleArray.setAuthorDate("Author/Date");
results.add(titleArray);
}
Elements authors = doc.select("a[hovercard-ref]");
for (Element author : authors) {
if (author.attr("abs:href").contains("/user/")){
Log.d("POC", "SUCCESS " + author.attr("abs:href"));
} else {
Log.d("POC", "FAILURE " + author.text());
}
}
}
I think you're thinking too hard ;)
To get the birdman portion of the link, just use the following:
Elements authors = doc.select("a");
for (Element author : authors) {
Log.d("POC", author.text());
}
The "a" retrieves all links. After that you can just use the .text() like you said to retrieve the value.
Selvin answered it in the comments. I wasnt getting the source correctly and it was causing errors.
http://pastebin.com/xfUQkGw0
Related
I followed this tutorial:
https://www.youtube.com/watch?v=Aie8n12EFQc
Pagination works. How can I restrict the number of page links below the table? If there is to much records, it can be to much numbers.
Numbers are below closing /table tag, it's not inside the table, but below (like in tutorial).
I know there are some solutions in Stackoverflow but they are mostly in php code. Can it be done somehow with java server-side or with Thymeleaf? Or can you send me an already good answered question on this topic.
My thymeleaf:
<div th:if = "${totalPages > 1}">
<nav aria-label="Page navigation" class="paging">
<ul class="pagination">
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo=1}">First</a>
</li>
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo={currentPage}(currentPage=${currentPage-1})}">Previous</a>
</li>
<li th:each="i: ${#numbers.sequence(1, totalPages)}" th:classappend="${i==currentPage} ? 'page-item active' : 'page-item'">
<a class="page-link" th:href="#{/customer_list?pageNo={i}(i=${i})}">[[${i}]]</a>
</li>
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo={currentPage}(currentPage=${currentPage+1})}">Next</a>
</li>
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo={totalPages}(totalPages=${totalPages})}">Last</a>
</li>
</ul>
</nav>
</div>
Controller method in Java: (pageSize only 2, for experimenting)
#GetMapping("/customer_list")
public String customerList(#RequestParam(name = "pageNo", required = false) Integer pageNo ,Model model)
{
if(pageNo == null)
pageNo = 1;
int pageSize = 2;
try
{
Page<CustomerDTO> pageOfCustomers = customerService.findPaginated(pageNo, pageSize);
List<CustomerDTO> listOfCustomers = pageOfCustomers.getContent();
model.addAttribute("listOfCustomers", listOfCustomers);
model.addAttribute("currentPage", pageNo);
model.addAttribute("totalPages", pageOfCustomers.getTotalPages());
model.addAttribute("totalItems", pageOfCustomers.getTotalElements());
} catch (Exception e)
{
log.error("Error in retrieving the list of customers!", e);
e.printStackTrace();
}
return "customer_list";
}
Have a look to this post : pagination
It should answer your question.
In a nutshell you add to your template an array with the length equals to the number of pages.
Then with some tests you display either dots or the link to the page.
It will give something like :
<< < ... 3456789 ... > >>
i am trying to display two "text text-pass" from html in chrome browser to my print console, apparently, it did not work, any advise please?
my browser html code
<a href="/abc/123" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
<a href="/abc/1234" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
My code
String 123= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(123);
String 1234= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(1234);
You can use .findElements to get multiple elements with the same pattern, it will return a list collection.
UPDATE
Refers to your comment, you need put the string into a list again and check with the Collection.contains() method:
List<String> results = new ArrayList<>();
List<WebElement> elements = driver.findElements(By.xpath("//div[#class='sidebar-text']//span"));
for(WebElement element: elements) {
String attr = element.getAttribute("class");
results.add(attr);
System.out.println(attr);
}
if(results.contains("text text-fail")) {
System.out.println("this is list contains 'text text-fail'");
}
Try this Code :
String pass = driver.findElement(By.xpath("//*[#class='sidebar-text']/span")).getAttribute("class");
System.out.println(pass);
I want to extract the text inside the "job title" and the text inside "summary" class. There are many with the same class names. So I want the job title of the first one and summary of it. And then the job title of the next one and the summary of it. In that order.
The following code works. But it first gives all the titles and then all the text inside all the summary classes. I want the first job title and the first summary. Then the second job title and the second summary and so on. How do I modify the code for this? Please help.
<div class=" row result" id="p_64c5268586001bd2" data-jk="64c5268586001bd2" itemscope="" itemtype="http://schema.org/JobPosting" data-tn-component="organicJob">
<h2 id="jl_64c5268586001bd2" class="jobtitle">
<a rel="nofollow" href="/rc/clk?jk=64c5268586001bd2" target="_blank" onmousedown="return rclk(this,jobmap[0],0);" onclick="return rclk(this,jobmap[0],true,0);" itemprop="title" title="Fashion Assistant" class="turnstileLink" data-tn-element="jobTitle"><b>Fashion</b> Assistant</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Itv?from=SERP&campaignid=serp-linkcompanyname&fromjk=64c5268586001bd2&jcid=3bf3e8a57da58ff5" target="_blank">
ITV Jobs</a></span>
</span>
<a data-tn-element="reviewStars" data-tn-variant="cmplinktst2" class="turnstileLink " href="/cmp/Itv/reviews?jcid=3bf3e8a57da58ff5" title="Itv Jobs reviews" onmousedown="this.href = appendParamsOnce(this.href, '?campaignid=cmplinktst2&from=SERP&jt=Fashion+Assistant&fromjk=64c5268586001bd2');" target="_blank">
<span class="ratings"><span class="rating" style="width:49.5px;"><!-- -> </span></span><span class="slNoUnderline">28 reviews</span></a>
<span itemprop="jobLocation" itemscope="" itemtype="http://schema.org/Place"> <span class="location" itemprop="address" itemscope="" itemtype="http://schema.org/Postaladdress"><span itemprop="addressLocality">London</span></span></span>
<table cellpadding="0" cellspacing="0" border="0">
<tbody><tr>
<td class="snip">
<div>
<span class="summary" itemprop="description">
Do you have a passion for <b>Fashion</b>? You will be responsible for running our <b>fashion</b> cupboard, managing a team of interns and liaising with press officers to...</span>
</div>
doc = Jsoup.connect("http://www.indeed.co.uk/jobs?q=fashion&l=England").timeout(5000).get();
Elements f = doc.select(".jobtitle");
Elements e = doc.select(".summary");
System.out.println("Title: " + f.text());
System.out.println("Details: "+ e.text());
Iterate over titles and then find the summary for each title:
for (Element title : doc.select(".jobtitle")) {
Element summary = title.parent().select(".summary").first();
System.out.format("Title: %s. Summary: %s%n", title.text(), summary.text());
}
I want to parse the data out of this HTML (CompanyName, Location, jobDescription,...) using JSoup (java). I get stuck when trying to iterate the joblistings
The extract from the HTML is one of many "JOBLISTING" divs which I want to iterate and extract the Data out of it. I just can't handle how to iterate the specific div objects. Sorry for this noob question, but maybe someone can help me who already knows which function to use. Select?
<div class="between_listings"><!-- local.spacer --></div>
<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">
<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
<img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
</a>
</div>
<div class="job_info">
<div class="h3 job_title">
<a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
<span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
</a>
</div>
<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Delivery Hero Holding GmbH</span>
</div>
</div>
<div class="job_location_date">
<div class="job_location target-location">
<div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">
<div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="addressLocality"> Berlin</span>
</div>
<span class="location_actions">
<a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
<span class="location-icon"><!-- --></span>
<span class="location-label">Google Maps</span>
</a>
</span>
</div>
</div>
<div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>
<div class="job_actions">
</div>
</div>
<div class="between_listings"><!-- local.spacer --></div>
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"); // Load file into extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]"); // other element properties System.out.println(jobListingElements);
Java code:
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements jobListingElements = ParseResult.select(".joblisting");
for (Element jobListingElement: jobListingElements) {
jobListingElement.select(".companyName span[itemprop=\"name\"]");
// other element properties
System.out.println(jobListingElements);
}
Thank you!
So you got your Jsoup document right? Than it seems pretty easy if the css class joblisting does not appear anywhere else.
Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
System.out.println(companyName);
System.out.println(jobTitle);
}
I don't know why the attribute [itemprop*=\"name\"] selector does not find the span (Further reading: http://jsoup.org/cookbook/extracting-data/selector-syntax )
Got it: span[itemprop=name] without any quotes or escapes. Other attributes or values also should work to get a more specific selection.
I have:
<table class="cast_list">
<tr><td colspan="4" class="castlist_label"></td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_i1" ><img height="44" width="32" alt="Tim Robbins" title="Tim Robbins"src="http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._V379389446_.png"class="loadlate hidden " loadlate="http://ia.media-imdb.com/images/M/MV5BMTI1OTYxNzAxOF5BMl5BanBnXkFtZTYwNTE5ODI4._V1_SY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td class="itemprop" itemprop="actor" itemscope itemtype="http://schema.org/Person">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_t1" itemprop='url'> <span class="itemprop" itemprop="name">Tim Robbins</span>
</a> </td>
<td class="ellipsis">
...
</td>
how can I get only the information inside the second td class? (td class= itemprop). I want to get "/name/nm0000209/?ref_=ttfc_fc_cl_t1" and "Tim Robbins".
This is my code:
Elements elms = doc.getElementsByClass("cast_list").first().getElementsByTag("table");
Elements tds = elms.select("td");
for(Element td : tds){
if(td.attr("class").contains("itemprop")){
Elements links = tds.select("a[href]");
for(Element link : links){
if(link.attr("href").contains("name/nm"))
{
String castname = link.text();
String castImdbId = link.attr("href");
System.out.println("CastName:" + castname + "\n");
System.out.println("CastImdbID:" + castImdbId + "\n");
}
but it also returns the text of the link inside td class="primary_phptp" which is null, this is part of my output:
CastName:
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_i1
CastName:Tim Robbins
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_t1
CastName:
......
Could someone please let me know where is my problem? I think the condition if(td.attr("class").contains("itemprop")) does not work at all.
Thanks,
Use a different css selector instead of td. Since the right <td> is identified be the class, why not use it:
td.itemprop
Your java code then would start like this instead
Elements tds = elms.select("td.itemprop");