Select a particular HTML table with JSOUP - java

I have my code as:
public static void main(String[] args) throws IOException {
org.jsoup.nodes.Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
org.jsoup.select.Elements rows = doc.select("tr");
for (org.jsoup.nodes.Element row : rows) {
org.jsoup.select.Elements columns = row.select("td");
for (org.jsoup.nodes.Element column : columns) {
System.out.print(column.text());
}
System.out.println();
}
}
It is printing out all the table rows that on the webpage, is it possible if I just want to print out a selected table in the website?

Try to select a particular table element first and then loop over its nested elements.
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
Element table = doc.select("table.wikitable").get(1);
Elements body = table.select("tbody");
Elements rows = body.select("tr");
for (Element row : rows) {
System.out.print(row.select("th").text());
System.out.print(row.select("td").text());
System.out.println();
}
}
Output:
Ibu negaraKuala Lumpur
Pusat pentadbiranPutrajaya
Tarikh Hari Kebangsaan31 Ogos 1957
Cogan Kata NegaraBersekutu Bertambah Mutu
BenuaAsia, Asia Tenggara
Koordinat Geografi2 30 U, 112 30 T
Jumlah hujan tahunan2000mm ~ 2500mm
IklimTropika dengan suhu 24–35 Darjah Celsius
Bunga kebangsaanBunga Raya
Binatang rasmiHarimau
Puncak tertinggiGunung Kinabalu, Banjaran Crocker (4175m)
Puncak tertinggi SemenanjungGunung Tahan, Banjaran Tahan (2187 m)
Banjaran terpanjangBanjaran Titiwangsa (500 km)
Sungai terpanjangSungai Rajang, Sarawak (563 km)
Sungai terpanjang di SemenanjungSungai Pahang (475 km)
Jambatan terpanjangJambatan Pulau Pinang (13.5 km)
Gua terbesarGua Niah, Sarawak
Bangunan tertinggiMenara Berkembar Petronas (452m)
Negeri terbesarSarawak (124,450 km persegi)
Negeri terkecilPerlis (810 km persegi)
Tempat paling lembapBukit Larut (lebih 5080 mm)
Tempat paling keringJelebu (kurang daripada 1500 mm)
Kawasan paling padatKuala Lumpur (6074/km², 15,543/batu persegi)
Penanaman eksport utamaKelapa sawit dan getah
Read more documentation here about JSOUP.

The best way to do this is grab the table by its title. Since the title is embedded in a cousin element of the table, and CSS has no parent selector, you can use a combination of CSS and Jsoup API calls to achieve this.
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
Element table = doc.select("span#Trivia").parents().first().nextElementSibling();
Elements rows = table.select("tr");
for (Element row : rows) {
String header = row.select("th").text();
String value = row.select("td").text();
System.out.println(header + ": " + value);
}
}

Related

How to get Ticker symbol from table using jsoup?

I'm trying to get the symbols from the table at YahooFinance, but can't figure out why my code doesn't detect the table.
This is what I tried:
public String[] getTrendingTickers() {
String[] trendingTickers = new String[30];
int numTickers = 0;
String url = "https://finance.yahoo.com/trending-tickers/";
try {
Document document = Jsoup.connect(url).get();
for (Element row : document.select("table.W(100%) tr")) {
String ticker = row.select(
".Fz\\(s\\).Ta\\(start\\)\\!.Bgc\\(\\$lv2BgColor\\).Z\\(1\\).Bgc\\(\\$lv3BgColor\\).Pos\\(st\\).simpTblRow\\:h_Bgc\\(\\$hoverBgColor\\).Pend\\(10px\\).Start\\(0\\).Pend\\(15px\\).Pstart\\(6px\\).Ta\\(start\\).Va\\(m\\)")
.text();
System.out.println(ticker);
trendingTickers[numTickers] = ticker;
numTickers++;
}
} catch (Exception e) {
System.out.println(e);
}
return trendingTickers;
}
With the error org.jsoup.select.Selector$SelectorParseException: Could not parse query 'table.W(100%).tr': unexpected token at '(100%).tr'
Here is some sample code that creates a list of all the symbols in the table of the page you reference:
Document document = Jsoup.connect("https://finance.yahoo.com/trending-tickers/").get();
Element table = document.select("table tbody").first();
List<String> symbols = new ArrayList<>();
for (Element row: table.select("tr")) {
symbols.add(row.select("td").first().text());
}
System.out.println(symbols);
See https://jsoup.org/apidocs/org/jsoup/select/Selector.html for details on the selector syntax.

How can I print the contents of this HTML table using JSoup?

I will start off by stating that working with HTML and JSoup for that matter is very foreign to me so if this comes across as a stupid question, I apologize.
What I am trying to achieve with my code is to print the contents from the table on this link https://www.stormshield.one/pve/stats/daviddean/sch into my console in a format like this for each entry:
Wall Launcher
50
grade grade grade grade grade
15% ImpactKnockback
42% Reload Speed
15% Impact Knockback
42% Reload Speed
15% ImpactKnockback
42% Durability
My main issue is pretty much supplying the correct name for the table and the rows, once I can do that the formatting isn't really an issue for me.
This is the code I have tried to use to no avail:
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
for (Element table : doc.select("table schematics")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
You need to find your table element, and it's head and rows.
Be careful, it is not always the first() element, I add it as an example.
Here is what you need to do:
Document doc = null;
try {
doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
} catch (IOException e) {
e.printStackTrace();
}
Element table = doc.body().getElementsByTag("table").first();
Element thead = table.getElementsByTag("thead").first();
StringBuilder headBuilder = new StringBuilder();
for (Element th : thead.getElementsByTag("th")) {
headBuilder.append(th.text());
headBuilder.append(" ");
}
System.out.println(headBuilder.toString());
Element tbody = table.getElementsByTag("tbody").first();
for (Element tr : tbody.getElementsByTag("tr")) {
StringBuilder rowBuilder = new StringBuilder();
for (Element td : tr.getElementsByTag("td")) {
rowBuilder.append(td.text());
rowBuilder.append(" ");
}
System.out.println(rowBuilder.toString());
}
The output is :

Jsoup - arrangement of table data from website

I want to get the table from https://ms.wikipedia.org/wiki/Malaysia.
Here is the table I want from the website.
But the result is not what I want.
I have got 2 questions:
1st Question is how can I arrange them like a table with arrangement Row and Column similar with the table from my picture. Below is my source code on how i get the data.
String URL = "https://ms.wikipedia.org/wiki/Malaysia";
Document doc = Jsoup.connect(URL).get();
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148)");
String currentRow = null;
for (Element tr : trs){
Elements tdDay = tr.select("tr:has(th)");
currentRow = tdDay.text();
System.out.print(currentRow);
}
2nd Question is from my source code, is it the best way to scraping the particular data from all the element like for example the element from the website https://ms.wikipedia.org/wiki/Malaysia by using
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148)");
Because from the website, there have got 3 table class with name wikitable. <table class="wikitable">. So how can I call only particular table?
Since the website u provide has some wikitable in it. So u can try to find out the selector of the data from table and I found there is <td> and <th>.
for (int i = x; i < x; i++) {
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148) > tbody > tr:nth-child(" + i + ") > th");
Elements tds = doc.select("#mw-content-text > div > table:nth-child(148) > tbody > tr:nth-child(" + i + ") > td");
try this while the x in the for loops is the number of row in the table so it can scrape the data
public static void main(String[] args) throws IOException{
String URL = "https://ms.wikipedia.org/wiki/Malaysia";
Document doc = Jsoup.connect(URL).get();
//Select the table which is under the header containing "Trivia"
//having the value "wikitable" for the class attribute
Element table = doc.select("h2:contains(Trivia)+[class=\"wikitable\"]").first();
//then select each row of the table
Elements trs = table.select("tr");
//for each row get first and second child corresponding to column 1 and two of table
for (Element tr : trs){
Element th = tr.child(0);
Element td = tr.child(1);
System.out.printf("%-40s %-40s%n",th.text(), td.text());
}
}

Get all <p> texts after <div> and between <h2> by using Jsoup

<h2><span class="mw-headline" id="The_battle">The battle</span></h2>
<div class="thumb tright"></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2>Second Title I want to stop collecting p tags after</h2>
I am learning Jsoup by trying to scrap all the p tags, arranged by title from wikipedia site. I can scrap all the p tags between h2, from the help of this question:
extract unidentified html content from between two tags, using jsoup? regex?
by using
Elements elements = docx.select("span.mw-headline, h2 ~ p");
but I can't scrap it when there is a <div> between them. Here is the wikipedia site I am working on:
https://simple.wikipedia.org/wiki/Battle_of_Hastings
How can I grab all the p tags where they are between two specific h2 tags?
Preferably ordered by id.
Try this option : Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
sample code :
package jsoupex;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class stackoverflw {
public static void main(String[] args) throws IOException {
//Validate.isTrue(args.length == 1, "usage: supply url to fetch");
//String url = "http://localhost/stov_wiki.html";
String url = "https://simple.wikipedia.org/wiki/Battle_of_Hastings ";
//args[0];
System.out.println("Fetching %s..." + url);
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
for (Element elem : elements) {
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
}
System.out.println(elem.text());
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
} else {
System.out.println("");
}
}
}
}
public static void main(String[] args) {
String entity =
"<h2><span class=\"mw-headline\" id=\"The_battle\">The battle</span></h2>" +
"<div class=\"thumb tright\"></h2>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<h2>Second Title I want to stop collecting p tags after</h2>";
Document element = org.jsoup.Jsoup.parse(entity,"", Parser.xmlParser());
element.outputSettings().prettyPrint(false);
element.outputSettings().outline(false);
List<TextNode>text=getAllTextNodes(element);
}
private static List<TextNode> getAllTextNodes(Element newElementValue) {
List<TextNode>textNodes = new ArrayList<>();
Elements elements = newElementValue.getAllElements();
for (Element e : elements){
for (TextNode t : e.textNodes()){
textNodes.add(t);
}
}
return textNodes;
}

How to read word document and get parts of it with all styles using docx4j

I am using docx4j to deal with word document formatting. I have one word document which is divided in number of tables. I want to read all the tables and if I find some keywords then I want to take those contents to another word document with all the formatting. My word document is as follow.
Like from above I want to take content which is below Some Title. Here my keyword is Sample Text. So whenever Sample Text gets repeated, content needs to be fetched to new word document.
I am using following code.
MainDocumentPart mainDocumentPart = null;
WordprocessingMLPackage docxFile = WordprocessingMLPackage.load(new File(fileName));
mainDocumentPart = docxFile.getMainDocumentPart();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
ClassFinder finder = new ClassFinder(Tbl.class);
new TraversalUtil(mainDocumentPart.getContent(), finder);
Tbl tbl = null;
int noTbls = 0;
int noRows = 0;
int noCells = 0;
int noParas = 0;
int noTexts = 0;
for (Object table : finder.results) {
noTbls++;
tbl = (Tbl) table;
// Get all the Rows in the table
List<Object> allRows = DocxUtility.getDocxUtility()
.getAllElementFromObject(tbl, Tr.class);
for (Object row : allRows) {
Tr tr = (Tr) row;
noRows++;
// Get all the Cells in the Row
List<Object> allCells = DocxUtility.getDocxUtility()
.getAllElementFromObject(tr, Tc.class);
toCell:
for (Object cell : allCells) {
Tc tc = (Tc) cell;
noCells++;
// Get all the Paragraph's in the Cell
List<Object> allParas = DocxUtility.getDocxUtility()
.getAllElementFromObject(tc, P.class);
for (Object para : allParas) {
P p = (P) para;
noParas++;
// Get all the Run's in the Paragraph
List<Object> allRuns = DocxUtility.getDocxUtility()
.getAllElementFromObject(p, R.class);
for (Object run : allRuns) {
R r = (R) run;
// Get the Text in the Run
List<Object> allText = DocxUtility.getDocxUtility()
.getAllElementFromObject(r, Text.class);
for (Object text : allText) {
noTexts++;
Text txt = (Text) text;
}
System.out.println("No of Text in Para No: " + noParas + "are: " + noTexts);
}
}
System.out.println("No of Paras in Cell No: " + noCells + "are: " + noParas);
}
System.out.println("No of Cells in Row No: " + noRows + "are: " + noCells);
}
System.out.println("No of Rows in Table No: " + noTbls + "are: " + noRows);
}
System.out.println("Total no of Tables: " + noTbls );
Assuming your text is in a single run (ie not split across runs), then you can search for it via XPath. Or you can manually traverse using TraversalUtil. See docx4j's Getting Started for more info.
So finding your stuff is pretty easy. Copying the formatting it uses, and any rels in it, is in the general case, complicated. See my post http://www.docx4java.org/blog/2010/11/merging-word-documents/ for more on the issues involved.

Categories