I am very new to JSOUP, have only been using it for a couple days, learning mostly from this website. Now I'm trying to get some information from the below HTML:
<td class="day no-repetition">Sun</td>
<td class="full-date" nowrap="nowrap">17/05/15</td>
<td class="competition">PRL</td>
<td class="team team-a ">
<a href="/teams/england/sunderland-association-football-club/683/" title="Sunderland">
Sunderland
</a>
</td>
<td class="score-time score">
<a href="/matches/2015/05/16/england/premier-league/sunderland-association-football-club/leicester-city-fc/1704225/" class="result-draw">
0 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/leicester-city-fc/682/" title="Leicester City">
Leicester City
</a>
</td>
<td class="events-button button first-occur">
</td>
<td class="info-button button">
More info
</td>
I need to extract the Home team, score and the Away Team from the above however I am currently having issues with this. I need both the link and the text itself. Below is the code I have:
try {
Document doc = Jsoup.connect(URL).get();
Element table = doc.select("table[class=matches]").first();
Elements rows = table.select("tr");
for (int i=0; i<rows.size(); i++){
Element row = rows.get(i);
Elements data = row.select("td[class=team.team-a]");
System.out.println(data.text());
}
} catch (IOException e) {
e.printStackTrace();
}
This hasn't worked so far. I tried 'team.team-a', 'team.team.a' and all other variations of it. I managed to get the data that's in the 'competition' class, which works when I just replace ("td[class=team.team=a]") with (td[class=competition]) however this doesn't work with any of the classes that have a link.
Assistance would be highly appreciated!
Just separate multiple classes with a dot:
td.team.team-a > a // first team
td.team.team-b > a // second team
td.score > a // score
Related
My table looks like below:
clm1 clm2 clm3
1 b hi
2 c hello
3 d hi
Now the requirement is I have to find all the 'hi' and click on the other cell of same row. For e.g. in 1st occurance I have to click on 1 and the again on 3.
I am able to find 'hi' with below code but how to find the corresponding cell on the same row.
List rows = driver.findElements(By.xpath("//span[text()='hi']"));
<table>
<tr class="abc">
<td class="efg">
<a id="asg">1</a>
</td>
</tr>
<tr class="abc">
<td class="efg">
<span>1</span>
</td>
</tr>
<tr class="abc">
<td class="efg">
<span>1</span>
</td>
</tr>
</table>
Please ignore the typo as I am typing from mobile.
Any help will be highly appreciated.
By.xpath("//td[./span[text()='hi']]/../td[1]") would return the first column of that matching row.
Here is something simple in C#. I hope that you will be able to convert it to java.
var tableRows = driver.FindElements(By.TagName("tr"));
foreach(var tableRow in tableRows)
{
var td = tableRow.FindElements(By.TagName("td"));
if(td[2].Text.Contains("hi"))
{
td[0].FindElement(By.TagName("a")).Click();
}
}
If i understand you should use xpath:
.//yourow/span[not(text()='Hi')][..//span[text()='Hi']]
You get all "brothers" or "sisters" in html tree which do not have text Hi but are in the same row as cell contain text "Hi"
I am writing a code for detecting matching tags patterns in web page. Here is the example.
<body>
<table width="200" border="1">
<tr>
<td>Name</td>
<td>Place</td>
<td>Animal</td>
</tr>
<p>hello World</p>
<tr>
<td>Jack</td>
<td>New york</td>
<td>Lion</td>
</tr>
<b>Code Works</b>
<tr>
<td>George</td>
<td>Sydney</td>
<td>Tiger</td>
</tr>
<tr>
<td>Tina</td>
<td>Delhi</td>
<td>Cat</td>
</tr>
</table>
<table>
<tbody>
<tr>
<td> </td>
<td>
1
2
3
4
5
</td>
</tr>
</tbody>
</table>
</body>
For above Tag pattern, I need to find the tags which are occurring repeatedly. And to discard those that are not in the pattern like tags b and p. For first table tags tr and td are occurring . For 2nd table 'a' tag is repeated.
This is what I have done till now:
Parsed to DOM tree using Jsoup.
Then used node visitor class to traverse the tree. Using head and tail methods, I can enter and exit tags.
But I am confused about how to proceed further.
Note: The tags pattern are not fixed.Tag pattern will vary depending on web page structure. Any kind of help will be appreciated.
But I am confused about how to proceed further.
Your confusion is propagating and reach us too. However, I'll try to give you an hint.
You can count the tags in your HTML code. If a tag count reaches a certain threshold, you can consider this tag as "repeatedly occuring".
// Load document
String html = ...
Document doc = Jsoup.parse(html);
// Count tags
String tagsSelector = "*";
Map<Element, Integer> tagsCountByType = new Hashmap<>();
for(Element e : doc.select("*")) {
Integer count = tagsCountByType.get(e);
if (count == null) {
tagsCountByType.put(e, new Integer(1));
} else {
tagsCountByType.put(e, new Integer(count.intValue() + 1));
}
}
// Find tag with a count greater than a given threshold
// ...
I didn't test the code. Just take it as an idea, some sort of inspiration.
Another idea, you can narrow down the tagsSelector. For example:
// All elements (tags) under any table directly under body.
String tagsSelector = "body > table *";
I am looking for getting the inner most web element in a page, when there are similar nested Webelements in a page.
Consider the example below:
<body>
<table id="level1">
<tr>
<td>
<table id="level2">
<tr>
<td>
<table id="level3">
<tr>
<td>
<p>Test</p>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table id="level1_table2">
<tr>
<td>
<table id="level2_table2">
<tr>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</body>
So when I do a search on the page by Driver.findElements by tag "table" and which have some text - "Test",
I will get 5 WebElements in total, namely - "level1", "level3" , "level1_table2" , "level2_table2"
What I want to achieve is to have a list of innermost(nested) elements which satisfy my search criteria .
So the List I should get should only have 2 WebElements namely - "level3" and "level2_table2".
I am looking something probably on the lines of recursion. Can somebody help me out.
You don't need recursion - everything you need is the proper XPath expression:
driver.findElements(By.xpath("table[not(.//table)]"))
I would use this strategy:
Search WebElements containing text Test
For each WebElement search for the first parent which match tag name is table
Here is in Java:
List<WebElement> elementsWithTest = driver.findElements(By.xpath("//*[contains(text(),'Test')]"));
List<WebElement> result = new ArrayList<>();
for(WebElement element : elementsWithTest) {
WebElement parent = element.findElement(By.xpath(".."));
while (! "table".equals(parent.getTagName())) {
parent = parent.findElement(By.xpath(".."));
}
if ("table".equals(parent.getTagName())) {
result.add(parent);
}
}
System.out.println(result);
Hope that helps.
I have:
<table class="cast_list">
<tr><td colspan="4" class="castlist_label"></td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_i1" ><img height="44" width="32" alt="Tim Robbins" title="Tim Robbins"src="http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._V379389446_.png"class="loadlate hidden " loadlate="http://ia.media-imdb.com/images/M/MV5BMTI1OTYxNzAxOF5BMl5BanBnXkFtZTYwNTE5ODI4._V1_SY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td class="itemprop" itemprop="actor" itemscope itemtype="http://schema.org/Person">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_t1" itemprop='url'> <span class="itemprop" itemprop="name">Tim Robbins</span>
</a> </td>
<td class="ellipsis">
...
</td>
how can I get only the information inside the second td class? (td class= itemprop). I want to get "/name/nm0000209/?ref_=ttfc_fc_cl_t1" and "Tim Robbins".
This is my code:
Elements elms = doc.getElementsByClass("cast_list").first().getElementsByTag("table");
Elements tds = elms.select("td");
for(Element td : tds){
if(td.attr("class").contains("itemprop")){
Elements links = tds.select("a[href]");
for(Element link : links){
if(link.attr("href").contains("name/nm"))
{
String castname = link.text();
String castImdbId = link.attr("href");
System.out.println("CastName:" + castname + "\n");
System.out.println("CastImdbID:" + castImdbId + "\n");
}
but it also returns the text of the link inside td class="primary_phptp" which is null, this is part of my output:
CastName:
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_i1
CastName:Tim Robbins
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_t1
CastName:
......
Could someone please let me know where is my problem? I think the condition if(td.attr("class").contains("itemprop")) does not work at all.
Thanks,
Use a different css selector instead of td. Since the right <td> is identified be the class, why not use it:
td.itemprop
Your java code then would start like this instead
Elements tds = elms.select("td.itemprop");
I am currently experimenting with jsoup and my goal is to extract data from this retail website, in the form of:
Title: blabl
Link: foba
Grösse: 9999
KP: FALSE
Miete: TRUE
Preis: 1923,23
I have written so far this test program:
public class jsoup_test {
public static void main(String[] args) throws IOException {
String url = "http://derstandard.at/anzeiger/immoweb/Suchergebnis.aspx?Regionen=9&Bezirke=&Arten=&AngebotTyp=×tamp=1363305908912";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements price = doc.select("tr.topangebot");
Elements price1 = doc.select("tr.white");
System.out.println("--------------------------------");
System.out.println(price);
System.out.println("--------------------------------");
System.out.println(price1);
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
}
However, this program gives me my data like that:
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_InseratInfoTR" class="topangebot">
<td class="BildTD" rowspan="2"> <img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoupload/2013/02/27/277515f7-f935-4a13-83fb-dbe3af930e28.jpg" alt="" /> </td>
<td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong>Gehobene Qualität, Design und exquisite Ausführung: Dachausbau mit Weitblick und 100 m² Terrasse</strong><br />Wien 16.,Ottakring, Dachgeschoss<br /><span style="color: gray">Erstbezug, Küche, Parkettboden, Hauptmiete, Terrasse, Lift, Keller, Altbau, Kabel/Sat-TV, Barrierefrei</span> </td>
<td class="GroessenTD" rowspan="2"> <span class="strong">125 m²</span><br /><span class="strong">4 </span>Zimmer </td>
<td class="PreisTD" style="border:none;"> <span class="light">Miete</span> 2.190 <br /> </td>
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_MerklisteTR" class="topangebot">
<td class="merkliste"> </td>
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl03_InseratInfoTR" class="topangebot">
<td class="BildTD" rowspan="2"> <img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoimporte/justimmo2/files.justimmo.at/public/pic/big/AEs_YegpKC.JPG" alt="" /> </td>
<td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong>HS-IMMO: 14. PREISSENSATION Eckzinshaus 1414m² Leerstand - Gesamtnutzfläche 1670m² + Rohdachboden ca. 700m² erzielbar ( Baubescheid ) € 1555.-/m² NFL</strong><br />Wien 14.,Penzing, Zinshaus<br /><span style="color: gray">Parkettboden, Altbau, Kabel/Sat-TV</span> </td>
<td class="GroessenTD" rowspan="2"> <span class="strong">1.670 m²</span><br /> </td>
<td class="PreisTD" style="border:none;"> <span class="light">KP</span> 2.590.000 <br /> </td>
</tr>...
Which is not in a human readable format. Therefore my question is. How to get jsoup, that it extracts the data DIRECTLY in the Format I want?
Thx for your replies?
For example for selecting title you need to do something like this
String title = doc.select("tr.topangebot > td.TitleTD").first.text();
you can navigate the page using DOM if you know the page structure:
http://jsoup.org/cookbook/extracting-data/dom-navigation
This question has a bunch of good web scrapers
Web scraping with Java
I like to use Jsoup because it's methods were literally built for DOM traversal. So, if you are good at HTML, CSS, and Jquery, this library was built for you. Yes, the Jsoup approach may be too fast. Yes, it may not suit your needs. But, when it comes to gathering any type of information from any type of website, Jsoup is flexible enough to meet your needs.