How can I find the Table id of the large table on in the following url: http://en.wikipedia.org/wiki/States_and_territories_of_India
I was able to see the classes wikitable sortable jquery-tablesorter
This is the table which has list of states in India. I was able confirm from firebug that this table = wikitable sortable jquery-tablesorter is having the list of states. How can I get the ID of that table?
What is the CSS equivalent to get all the names in that table?
I want to get only the states... the first column. I am using jsoup.
If this is still pending issue, here is how you can get list of states in India :
public static void main(String[] args) throws IOException
{
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/States_and_territories_of_India").get();
Elements tables = doc.select("table");
for (Element table : tables) {
Element tableCaption = table.getElementsByTag("big").first();
if (tableCaption != null && tableCaption.text().equals("States of India")) {
Document statesDoc = Jsoup.parse(table.toString());
Elements states = statesDoc.select("tr td:eq(0)");
for (Element state : states) {
System.out.println(state.text().replaceAll("\\[\\d\\]", ""));
}
}
}
}
There is no ID on that table. If you want to get the content of the table which has the class "wikitable". Use Jsoup with this code
package com.main;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main (String args[]){
Document doc;
try {
doc = Jsoup.connect("http://en.wikipedia.org/wiki/States_and_territories_of_India").get();
Elements newsHeadlines = doc.select("table.wikitable").get(0).select("td:eq(0) a");
System.out.println(newsHeadlines.html());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
So it looks like you're trying to screenscrape this table.
The answer to your question is there there is no id on that particular <table>.
The html that starts the table is:
<table class="wikitable sortable jquery-tablesorter" style="width:70%;">
As you can see there is no id attribute for that element.
What libraries are you using to parse the HTML? In JavaScript you could use document.getElementsByClassName('wikitable')[0] and find that uniquely on the page. But the syntax you would use will depend on what kind of HTML DOM traversing are available to you.
The id element is optional; not every element on a page will have one. This table doesn't.
Using JQuery. You want the first table with classes wikitable sortable jquery-table-sorter.
$(".wikitable.sortable.jquery-table-sorter").first()
Although, the css classes could change at any time so I wouldn't rely on that. It might be worth asking someone who can edit the wiki page to add an id to all the tables.
Related
Very new to JSoup, trying to retrieve a changeable value that is stored within an tag, specifically from the following website and html.
Snapshot of HTML
the results after "consitituency/" are changeable and dependent on the input of the user. I am able to retrieve the h2 tags themselves but not the information within. At the moment the best return I can get is just tags using the method below
The desired return would be something that I can substring down into
Dublin Bay South
The actual return is
<well.col-md-4.h2></well.col-md-4.h2>
private String jSoupTDRequest(String aLine1, String aLine3) throws IOException {
String constit = "";
String h2 = "h2";
String url = "https://www.whoismytd.com/search?utf8=✓&form-input="+aLine1+"%2C+"+aLine3+"+Ireland";
//Switch to try catch if time
Document doc = Jsoup.connect(url)
.timeout(6000).get();
//Scrape elements from relevant section
Elements body = doc.select("well.col-md-4.h2");
Element e = new Element("well.col-md-4.h2");
constit = e.toString();
return constit;
I am extremely new to JSoup and scraping in general. Would appreciate any input from someone who knows what they're doing or any alternate ways to try and get the desired result
Change your scraping elements from relevant section code as follows:
Select the very first <div class="well"> element first.
Element tdsDiv = doc.select("div.well").first();
Select the very first <a> link element next. This link points to the constituency.
Element constLink = tdsDiv.select("a").first();
Get the constituency name by grabbing this link's text content.
constit = constLink.text();
import org.junit.jupiter.api.Test;
import java.io.IOException;
#DisplayName("JSoup, how to return data from a dynamic <a href> tag")
class JsoupQuestionTest {
private static final String URL = "https://www.whoismytd.com/search?utf8=%E2%9C%93&form-input=Kildare%20Street%2C%20Dublin%2C%20Ireland";
#Test
void findSomeText() throws IOException {
String expected = "Dublin Bay South";
Document document = Jsoup.connect(URL).get();
String actual = document.getElementsByAttributeValue("href", "/constituency/dublin-bay-south").text();
Assertions.assertEquals(expected, actual);
}
}
I'm scrapping IMDB chart of 250 movies. I want to store each movie name in an array, but I don't know why it puts all the movie names into the first index, i.e Array[0].
Below is my code.
Can anyone please help me out. I've to complete another project and this is the main thing that is needed.
If you can direct me any website or tutorial I'll be very thankful to you.
try {
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").userAgent("Mozilla").get();
int counterVariable = 0;
for (Element el : doc.select(".lister-list")) {
mString[counterVariable] = el.select(".titleColumn").text();
totalNumberOfLines++;
counterVariable++;
}
} catch (Exception e) {
System.out.println("Sorry website couldn't be opened");
System.out.println(e);
}
System.out.println(mString[0]);// It's putting all the names into this index
The problem is that you have only one element matching selector .lister-list, so iterating over it does not make much sense. When you call el.select(".titleColumn").text(); Jsoup concatenates text from all matching elements. This is why you get all results in one element. Instead you can try to select all td tags with class tittleColumn that are children of tr element that are child of .lister-list
for (Element el : doc.select(".lister-list > tr > td.titleColumn")) {
mString[counterVariable] = el.text();
totalNumberOfLines++;
counterVariable++;
}
More about jsoup css selectors you can learn here.
Hi i need to scrape a web site using JSOUP and i needed to get output in key- value pairs can anyone suggest me.
The url which i need to scrape is https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=
The code which i written is:
package com.jaysons;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ScrapeBody {
public static void main( String[] args ) throws IOException{
String url = "https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=";
Document doc = Jsoup.connect(url).get();
Elements content = doc.select("div.views-field views-field-php");
doc = Jsoup.parse( content.html().replaceAll("</div>", "</div><span>")
.replaceAll("<div", "</span><div") );
Elements labels = doc.select("div.remedy");
for (Element label : labels) {
System.out.println(String.format("%s %s", label.text().trim(),
label.nextElementSibling().text()));
}
}
}
i need output in key value pairs like
Date:OCTOBER 20, 2017
remedy:
units:
website:http://www.bosch-home.com/us
phone:(888) 965-5813
kindly let me know where did i do mistake
Theres no need to reassign and re-parse the value of the content variable.
Elements content = doc.select("div.views-field >span");
for (Element viewField : content) {
/*
each viewField corresponds to one
<div class="views-field views-field-php">
<span class="field-content">
<a href="/Recalls/2018/BSH-Home-Appliances-amplía-retiro-del-mercado-de-lavavajillas">
<div class="date">
October 20, 2017
</div>
...
</span>
</div>
*/
Elements divs = viewField.getElementsByTag("div");
for (Element div : divs) {
String className = div.className();
if (className.equals("date")) {
// store and extract date
} else if (className.equals("...")) {
// do something else
} // else...
}
}
Not only you can select subelements by tag, but also by name, by some attributes etc. Check the official documentation for more info: https://jsoup.org/cookbook/extracting-data/dom-navigation
Disclaimer: I could not test the code right now.
I have looked through multiple forms before asking this question.Basically, what i need is to select part of the text in a HTML file. the html is constructed something like this
<div class = "pane big">
<code>
<pre>
SomeText
<a id="par1" href="#par1">¶</a>
MoreText
.
.
.
<a id="par2" href="#par2">¶</a>
MoreText
</pre>
</code>
</div>
So what i need to do, is to extract the text under the href tag par1 by itself and then get the text under par2 href tag separately. i tried to use Jsoup but all i could do is to select the whole text withing the div. Also tried XPath but the expression that I'm evaluating is not accepted. not sure maybe because it's not an XML file to begin with.
and example of XPath expressions that I used is .
/html/body/div/div[2]/code[2]/pre/text()[3]
and CSS
body > div > div.pane.big > code:nth-child(7) > pre
It's not possible to do that with pure CSS selectors, additional extracting and appending logic in Java code needed:
Select pre element
Split it to sequence of text parts by a element as splitter.
Skip 1st element and join two (or more) next parts.
Here simple code sample for that (JDK 1.8 style with stream API and old JDK 1.5 - 1.7 style):
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
import static java.util.Arrays.stream;
import static java.util.stream.Collectors.joining;
public class SimpleParser {
public static void main(String[] args) throws IOException {
final Document document = Jsoup.parse(new File("div.html"), "UTF-8");
final Elements elements = document.select("div.pane.big pre");
System.out.println("JDK 1.8 style");
System.out.println(
stream(elements.html().split("\\s+<a.+</a>\\s+"))
.skip(1)
.collect(joining("\n")
));
System.out.println("\nJDK 1.7 style");
String[] textParts = elements.html().split("\\s+<a.+</a>\\s+");
StringBuilder resultText = new StringBuilder();
for (int i = 1; i < textParts.length; i++) {
resultText.append(textParts[i] + "\n");
}
System.out.println(resultText.toString());
}
}
P.S. Note that last tag div in your HTML code sample should be closed-tag.
Wait ,so you need the part inside the href tag,right ?Say we have
<a id="par1" href="#iNeedThisPart">¶</a> , then do you want 'iNeedThisPart'?
If that is indeed what you want ,then you need to use the css query a[href] ,which would select all 'a' tags with 'href' attribute. The JSoup code for the same will be as follows:
public List<String> getTextWithinHrefAttribute(final File file) throws IOException{
final List<String> hrefTexts = new ArrayList<>();
final Document document=Jsoup.parse(file,"utf-8");
final Elements ahrefs =document.select("a[href]");
for(final Element ahref : ahrefs ){
hrefTexts.add(ahref.attr("href"));
}
return hrefTexts;
}
I am assuming that you are parsing from a file, and not crawling a web page.
I am trying to scrape the Top Stories section in google news for all the titles. In order to only get the titles in the Top Stories section, I must narrow into this tag:
<div class="section top-stories-section" id=":2r">..</div>
This is the code I use (in Eclipse):
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://news.google.com";
Document document = Jsoup.connect(url).get();
// Extract data
Element topStories = document.getElementById(":2r").;
Elements titles = topStories.select("span.titletext");
// Output data
for (Element title : titles) {
System.out.println("Title: " + title.text());
}
}
I always seem to be getting a NullPointerException. It doesn't work either, when I try to reach the Top Stories like this:
Element topStories = document.select("#:2r").first();
Am I missing something? Shouldn't this be working? I am relatively new to this, please help and thank you!
Judging from the error message (and actually looking at the page) that div tag doesn't contain an id attribute. Instead you could select based on the CSS class
Element topStories = document.select("div.section.top-stories-section").first();