Jsoup - arrangement of table data from website - java

I want to get the table from https://ms.wikipedia.org/wiki/Malaysia.
Here is the table I want from the website.
But the result is not what I want.
I have got 2 questions:
1st Question is how can I arrange them like a table with arrangement Row and Column similar with the table from my picture. Below is my source code on how i get the data.
String URL = "https://ms.wikipedia.org/wiki/Malaysia";
Document doc = Jsoup.connect(URL).get();
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148)");
String currentRow = null;
for (Element tr : trs){
Elements tdDay = tr.select("tr:has(th)");
currentRow = tdDay.text();
System.out.print(currentRow);
}
2nd Question is from my source code, is it the best way to scraping the particular data from all the element like for example the element from the website https://ms.wikipedia.org/wiki/Malaysia by using
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148)");
Because from the website, there have got 3 table class with name wikitable. <table class="wikitable">. So how can I call only particular table?

Since the website u provide has some wikitable in it. So u can try to find out the selector of the data from table and I found there is <td> and <th>.
for (int i = x; i < x; i++) {
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148) > tbody > tr:nth-child(" + i + ") > th");
Elements tds = doc.select("#mw-content-text > div > table:nth-child(148) > tbody > tr:nth-child(" + i + ") > td");
try this while the x in the for loops is the number of row in the table so it can scrape the data

public static void main(String[] args) throws IOException{
String URL = "https://ms.wikipedia.org/wiki/Malaysia";
Document doc = Jsoup.connect(URL).get();
//Select the table which is under the header containing "Trivia"
//having the value "wikitable" for the class attribute
Element table = doc.select("h2:contains(Trivia)+[class=\"wikitable\"]").first();
//then select each row of the table
Elements trs = table.select("tr");
//for each row get first and second child corresponding to column 1 and two of table
for (Element tr : trs){
Element th = tr.child(0);
Element td = tr.child(1);
System.out.printf("%-40s %-40s%n",th.text(), td.text());
}
}

Related

Parsing currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

I prepare the program and I wrote this code with helping but the first 10 times it works then it gives me NULL values,
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
//Document doc = Jsoup.parse(url);
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
int i = 0;
String[] currencyStr = new String[11];
String[] buyStr = new String[11];
String[] sellStr = new String[11];
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table.table-markets");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(i);
currencyStr[i] = curreny.text();
buyStr[i] = buy.text();
sellStr[i] = sell.text();
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
i++;
}
for(i = 0; i < 11; i++){
System.out.println("currency: " + currencyStr[i]);
System.out.println("buy: " + buyStr[i]);
System.out.println("sell: " + sellStr[i]);
}
here is the code, I guess it is a connection problem but I could not solve it I use Netbeans, Do I have to change the connection properties of Netbeans or should I have to add something more in the code
can you help me?
There's nothing wrong with the connection. Your query simply doesn't match the page structure.
Somewhere on your page, there's an element with class borsaMain, that has a direct child with class detL. And then somewhere in the descendants tree of detL, there is your table. You can write this as the following CSS element selector query:
.borsaMain > .detL table
There will be two tables in the result, but I suspect you are looking for the first one.
So basically, you want something like:
Element table = doc.selectFirst(".borsaMain > .detL table");
for (Element row : table.select("tr:has(td)")) {
// your existing loop code
}

Selenium Web-table java.lang.IndexOutofBoundsException: Index: 1,size: 1

I need to get the Select Tag and Input Tag in web table for this I created the below code to get the Tag name in web table.
For this
Created a List of Elements to get the Number of rows in table.
For declare the variable as "i" to looping.
Find the Select tag in each row to sent the inputs in table.if the select tag presence in row pass the input value else pass different value in web table.
// web Table
WebElement table =d.findElement(By.xpath("//*[#id='ui-grid']/div/div/div/div[2]/table/tbody"));
List<WebElement> trcount = table.findElements(By.tagName("tr"));
int size = trcount.size();
System.out.println(size);
//Using size created the for loop to find each row available in table.
for(int i=1;i<size;i++) {
//Declare the Xpath to find the particular row
By tag = By.xpath("(//*[#id='ui-grid']/div/div/div/div[2]/table/tbody/tr/td/span/select)["+i+"]");
By Input_tag = By.xpath("(//*[#id='ui-grid']/div/div/div/div[2]/table/tbody/tr/td/span/input)["+i+"]");
List<WebElement> tdcount = trcount.get(i).findElements(tag);
String tag1 = tdcount.get(i).getTagName();
System.out.println(tag1);
if(tag1.equals("select")){
d.findElement(By.xpath(tag))Select level = new Select(d.FindElement(tag));
level.selectByVisibleText("YES");
}else {
d.findElement(Input_tag).sendKeys("12");
}
}
Expected Result:
If Web table presence Select tab then value selected from dropdown else Input Tag presence value should be passed.
Actual Result:
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
//Get the number of count in web table.
WebElement table =d.findElement(By.xpath("//*[#id='ui-grid']/div/div/div/div[2]/table/tbody"));
List<WebElement> trcount = table.findElements(By.tagName("tr"));
int size = trcount.size();
System.out.println(size);
//Select particular tag(Select tag)
List<WebElement> Select = table.findElements(By.xpath("//*[#id='ui-grid']/div/div/div/div[2]/table/tbody/tr/td/span/select"));
int select_size = Select.size();
System.out.println(select_size);
//If web table have select tag perfrom below else to catch no search elements exceptions.
try{
for(int j=1;j<=select_size;j++) {
By tag = By.xpath("(//*[#class='ng-star-inserted']/span/select)["+j+"]");
System.out.println("Test");
}
}catch(Exception e){
System.out.println((e.getMessage()));
}
//Select particular tag(Input tag)
List<WebElement> Select1 = table.findElements(By.xpath("//*[#id='ui-grid']/div/div/div/div[2]/table/tbody/tr/td/span/input"));
int select_input = Select1.size();
System.out.println(select_input);
try{
for(int i=1;i<=select_input;i++) {
By tag1 = By.xpath("(//*[#class='ng-star-inserted']/span/input)["+i+"]");
System.out.println("Test");
d.findElement(tag1).sendKeys("12345");
Thread.sleep(3000);
//d.findElement(Accept_button).click();
}
}catch(Exception e){
System.out.println((e.getMessage()));
}

Missing Table Elements When Scraping

URL: https://stats.nba.com/player/1628381/defense-dash/
Attempting to get:
`<table>
<tbody>
<!----><tr data-ng-repeat="(i, row) in page" index="0">
<td class="player">Overall</td>
<td>45</td>
<td>45</td>
<td>5.7</td>
<td>12.3</td>
<td>46.6</td>
<td>100%</td>
<td>46.7</td>
<td>-0.1</td>
</tr><!---->
</tbody>
</table> `
My coding:
public static void getData(String url, String Name, int ID) throws
IOException
{
String html = Jsoup.connect(url).execute().body();
html = html.replaceAll("<!---->", "");
html = html.replaceAll("<!--", "");
html = html.replaceAll("-->", "");
Document doc = Jsoup.parse(html);
Elements tableElements = doc.select("table");
System.out.println("Elements " + tableElements);
for (Element tableElement : tableElements)
{
String tableId = tableElement.id();
if (tableId.isEmpty()) {
continue;
}
String fileName = "table" + Name + tableId + ID + ".csv";
System.out.println(fileName);
FileWriter writer = new FileWriter(new File("C:\\Users\\noman\\eclipse-workspace\\Senior Project\\src\\", fileName));
//System.out.println(doc);
Elements tableRowElements = tableElement.select(":not(thead) tr td");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if (j != rowItems.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
}
Problem is no elements are being found. this same code works on another site perfectly which (seemingly) no differences in how they store data
Is there something different with this website preventing web-scraping? or a subtle difference maybe?
Please note HTML code provided is a shorten version
As said at the comments, the data you are looking for is loaded dynamically, but, you can fetch it with a simple GET request from this link -
https://stats.nba.com/stats/playerdashptshotdefend?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerID=1628381&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision=
EDIT
To find this link I've used the browser's developer tools and checked for xhr requests.
You can see that the link includes several parameters, among them the playerID which is identical to the number that appears in your intial link. By changing its value you can get stats of other players.

Trouble Getting information from html tables in java

I want to get information from the first table inside this site
Link
This its the code i have
Document document = Jsoup.parse(DownloadPage("http://www.transtejo.pt/clientes/horarios" +
"-ligacoes-fluviais/ligacao-barreiro-terreiro-do-paco/#dias-uteis"));
Elements table = document.select("table.easy-table-creator:nth-child(1) tbody");
Elements trAll = table.select("tr");
//For the Table Hour
Elements tr_first = table.select("tr:nth-child(1)");
Element tr = tr_first.get(1);
Elements td = tr.getElementsByTag("td");
for(int i = 0; i < td.size(); i++) {
Log.d("TIME TABLE:"," " + td.get(i).text());
for(int i1 = 1; i1 < trAll.size(); i1++) {
Elements td_inside = trAll.get(i1).getElementsByTag("td");
Log.d("TD INSIDE:"," " + td_inside.get(i).text());
}
}
Right now im being able to get information, the problem its that im getting content from other tables, because all tables class name are the same and im having trouble specifying the table that i need, and im also getting IndexOutOfBoundsException
This its the Log of this
Loglink
The type of log i want its something like this:
The Hour(TIME TABLE) and then in this hour i want to get all the bottom lines with the minutes (TD INSIDE) for that hour, and then move to next hour (...)
Thans for your time.
[EDIT]
Better log example
Check first table.
TIME TABLE: 05H
TD INSIDE: 15
TD INSIDE: 45
TIME TABLE: 06H
TD INSIDE: 15
TD INSIDE: 35
TD INSIDE: 45
TD INSIDE: 55
TIME TABLE: 07H
TD INSIDE: 05
TD INSIDE: 15
TD INSIDE: 20
TD INSIDE: 25
TD INSIDE: 35
TD INSIDE: 40
TD INSIDE: 50
TD INSIDE: 55
(...)
You can do it:
Element table = document
.select("table.easy-table-creator:nth-child(1) tbody").first();
Elements trAll = table.select("tr");
Elements trAllBody = table.select("tr:not(:first-child)");
// For the Table Hour
Element trFirst = trAll.first();
Elements tds = trFirst.select("td");
for(int i = 0; i < tds.size(); i++){
Element td = tds.get(i);
Log.d("TIME TABLE:", " " + td.text());
String query = "td:nth-child(" + (i + 1) + ")";
Elements subTds = trAllBody.select(query);
for (int j = 0; j < subTds.size(); j++) {
Element subTd = subTds.get(j);
String tdText = subTd.text();
if(!tdText.isEmpty()){
Log.d("TD INSIDE:", " " + subTd.text());
}
}
}
Some interesting points:
your table.easy-table-creator:nth-child(1) tbody selector was selecting all the tables in the page;
with a progressive select you can retrieve all the tds in a given column: td:nth-child(index);
trAllBody here contains all the trs that are not the first one (using the tr:not(:first-child) selector).

How to generate XPath query matching a specific element in Jsoup?

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.
As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1
This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}
This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}
Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

Categories