Trouble Getting information from html tables in java - java

I want to get information from the first table inside this site
Link
This its the code i have
Document document = Jsoup.parse(DownloadPage("http://www.transtejo.pt/clientes/horarios" +
"-ligacoes-fluviais/ligacao-barreiro-terreiro-do-paco/#dias-uteis"));
Elements table = document.select("table.easy-table-creator:nth-child(1) tbody");
Elements trAll = table.select("tr");
//For the Table Hour
Elements tr_first = table.select("tr:nth-child(1)");
Element tr = tr_first.get(1);
Elements td = tr.getElementsByTag("td");
for(int i = 0; i < td.size(); i++) {
Log.d("TIME TABLE:"," " + td.get(i).text());
for(int i1 = 1; i1 < trAll.size(); i1++) {
Elements td_inside = trAll.get(i1).getElementsByTag("td");
Log.d("TD INSIDE:"," " + td_inside.get(i).text());
}
}
Right now im being able to get information, the problem its that im getting content from other tables, because all tables class name are the same and im having trouble specifying the table that i need, and im also getting IndexOutOfBoundsException
This its the Log of this
Loglink
The type of log i want its something like this:
The Hour(TIME TABLE) and then in this hour i want to get all the bottom lines with the minutes (TD INSIDE) for that hour, and then move to next hour (...)
Thans for your time.
[EDIT]
Better log example
Check first table.
TIME TABLE: 05H
TD INSIDE: 15
TD INSIDE: 45
TIME TABLE: 06H
TD INSIDE: 15
TD INSIDE: 35
TD INSIDE: 45
TD INSIDE: 55
TIME TABLE: 07H
TD INSIDE: 05
TD INSIDE: 15
TD INSIDE: 20
TD INSIDE: 25
TD INSIDE: 35
TD INSIDE: 40
TD INSIDE: 50
TD INSIDE: 55
(...)

You can do it:
Element table = document
.select("table.easy-table-creator:nth-child(1) tbody").first();
Elements trAll = table.select("tr");
Elements trAllBody = table.select("tr:not(:first-child)");
// For the Table Hour
Element trFirst = trAll.first();
Elements tds = trFirst.select("td");
for(int i = 0; i < tds.size(); i++){
Element td = tds.get(i);
Log.d("TIME TABLE:", " " + td.text());
String query = "td:nth-child(" + (i + 1) + ")";
Elements subTds = trAllBody.select(query);
for (int j = 0; j < subTds.size(); j++) {
Element subTd = subTds.get(j);
String tdText = subTd.text();
if(!tdText.isEmpty()){
Log.d("TD INSIDE:", " " + subTd.text());
}
}
}
Some interesting points:
your table.easy-table-creator:nth-child(1) tbody selector was selecting all the tables in the page;
with a progressive select you can retrieve all the tds in a given column: td:nth-child(index);
trAllBody here contains all the trs that are not the first one (using the tr:not(:first-child) selector).

Related

Parsing currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

I prepare the program and I wrote this code with helping but the first 10 times it works then it gives me NULL values,
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
//Document doc = Jsoup.parse(url);
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
int i = 0;
String[] currencyStr = new String[11];
String[] buyStr = new String[11];
String[] sellStr = new String[11];
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table.table-markets");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(i);
currencyStr[i] = curreny.text();
buyStr[i] = buy.text();
sellStr[i] = sell.text();
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
i++;
}
for(i = 0; i < 11; i++){
System.out.println("currency: " + currencyStr[i]);
System.out.println("buy: " + buyStr[i]);
System.out.println("sell: " + sellStr[i]);
}
here is the code, I guess it is a connection problem but I could not solve it I use Netbeans, Do I have to change the connection properties of Netbeans or should I have to add something more in the code
can you help me?
There's nothing wrong with the connection. Your query simply doesn't match the page structure.
Somewhere on your page, there's an element with class borsaMain, that has a direct child with class detL. And then somewhere in the descendants tree of detL, there is your table. You can write this as the following CSS element selector query:
.borsaMain > .detL table
There will be two tables in the result, but I suspect you are looking for the first one.
So basically, you want something like:
Element table = doc.selectFirst(".borsaMain > .detL table");
for (Element row : table.select("tr:has(td)")) {
// your existing loop code
}

didnot write in excel sheet

i want to write my data in excel.but it didnot write in formatted way
driver.get("https://careernavigator.naukri.com/sales-executive-retail-careers-in-mahindra-and-mahindra-financial-services-15731");
List<WebElement> row = driver.findElements(By.xpath("(//*[name()='svg'])[2]//*[name()='rect' and #height='40']"));
List<WebElement> column = driver.findElements(By.xpath("(//*[name()='svg'])[2]//*[name()='text']//*[name()='tspan' and (#dy=4 or #dy='3.5')]"));
for (int i=0;i<column.size();i++) {
System.out.println(column.get(i).getText());
XSSFRow row1 = sheet.createRow(i);
for(int j=0;j<4;j++)
{
Cell cell1 = row1.createCell(j);
cell1.setCellValue(column.get(j).getText());
The issue appears to be with your for loop structure. A couple of notes, first, on some observations.
You need a high implicit wait here because the site is slow to load the findings.
Your rows xpath finds nothing, but your column xpath works, even though it did not when I ported to Protactor.
I am sketching an alternative using locator by id "f1" and splitting the resulting text, but it turns out that is equivalent to what your xpath "column" finds.
The key here is to know when a new row begins and when the items are no longer really data you want. Rows start when the index is a multiple of 3, since each row has 3 fields (a name and 2 numbers). Stuff about which we do not care begins with the number 1. I did not write code to put the data in Excel, but I indicated where you would.
Here is my code (yours will vary slightly because of where your chrome driver or whichever is located, as well as because you have the Excel sheet pieces):
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
public class ShonaDriver {
public static void main(String[] args) {
System.setProperty("webdriver.chrome.driver", "bin/chromedriver");
WebDriver driver = new ChromeDriver();
driver.manage().timeouts().implicitlyWait(90, TimeUnit.SECONDS);
driver.manage().window().maximize();
driver.get(
"https://careernavigator.naukri.com/sales-executive-retail-careers-in-mahindra-and-mahindra-financial-services-15731");
/// the row xpath that ws once here found nothing
List<WebElement> column = driver.findElements(
By.xpath("(//*[name()='svg'])[2]//*[name()='text']//*[name()='tspan' and (#dy=4 or #dy='3.5')]"));
int spreadSheetRowNum = 0;
int spreadSheetColumnNum = 0;
WebElement f1 = driver.findElement(By.id("f1"));
for (int i = 0; i < column.size(); i++) {
if (column.get(i).getText().equalsIgnoreCase("1")) {
// reached end of meaningful fields
break;
}
if (i % 3 == 0) {// start of new row
spreadSheetRowNum++;
System.out.println("here create row: " + spreadSheetRowNum);
spreadSheetColumnNum = 1;// assuming excel column A is column 1
} else {
spreadSheetColumnNum++;
}
System.out.println("for column list " + i + " text is:");
System.out.println(column.get(i).getText());
System.out.println("write it to row " + spreadSheetRowNum + " column " + spreadSheetColumnNum);
}
String[] f1Arr = f1.getText().split("\n");
System.out.println("if you prefer to use the f1 array, its contents are:");
for (int i = 0; i < f1Arr.length; i++) {
System.out.println("f1[" + i + "] = " + f1Arr[i]);
}
driver.close();
}
}
and here is the output I get indicating what is there and what you need to do at the various steps:
here create row: 1
for column list 0 text is:
Mahindra and Mahindra Financia..
write it to row 1 column 1
for column list 1 text is:
4.8
write it to row 1 column 2
for column list 2 text is:
2.4
write it to row 1 column 3
here create row: 2
for column list 3 text is:
Tata Motors
write it to row 2 column 1
for column list 4 text is:
5.0
write it to row 2 column 2
for column list 5 text is:
2.6
write it to row 2 column 3
here create row: 3
for column list 6 text is:
Mahindra and Mahindra
write it to row 3 column 1
for column list 7 text is:
4.6
write it to row 3 column 2
for column list 8 text is:
2.9
write it to row 3 column 3
if you prefer to use the f1 array, its contents are:
f1[0] = Mahindra and Mahindra Financia..
f1[1] = 4.8
f1[2] = 2.4
f1[3] = Tata Motors
f1[4] = 5.0
f1[5] = 2.6
f1[6] = Mahindra and Mahindra
f1[7] = 4.6
f1[8] = 2.9
f1[9] = 1
f1[10] = 1
f1[11] = 2
f1[12] = 2
f1[13] = 3
f1[14] = 3
f1[15] = 4
f1[16] = 4
f1[17] = 5
f1[18] = 5
f1[19] = 6
f1[20] = 6
f1[21] = 7
f1[22] = 7
f1[23] = 8
f1[24] = 8
f1[25] = Avg.Exp
f1[26] = Avg.Sal
f1[27] = In lacs
f1[28] = Avg.Exp
f1[29] = Avg.Sal
f1[30] = In lacs
f1[31] = Top Companies
f1[32] = Top Companies
f1[33] = Top Companies
f1[34] = Top Companies
f1[35] = 1.6
f1[36] = 4.9
f1[37] = 2.4
f1[38] = 1.6
f1[39] = 7.0
f1[40] = 2.6
f1[41] = 1.3
f1[42] = 7.2
f1[43] = 2.9
f1[44] = View 22 more Companies.

Missing Table Elements When Scraping

URL: https://stats.nba.com/player/1628381/defense-dash/
Attempting to get:
`<table>
<tbody>
<!----><tr data-ng-repeat="(i, row) in page" index="0">
<td class="player">Overall</td>
<td>45</td>
<td>45</td>
<td>5.7</td>
<td>12.3</td>
<td>46.6</td>
<td>100%</td>
<td>46.7</td>
<td>-0.1</td>
</tr><!---->
</tbody>
</table> `
My coding:
public static void getData(String url, String Name, int ID) throws
IOException
{
String html = Jsoup.connect(url).execute().body();
html = html.replaceAll("<!---->", "");
html = html.replaceAll("<!--", "");
html = html.replaceAll("-->", "");
Document doc = Jsoup.parse(html);
Elements tableElements = doc.select("table");
System.out.println("Elements " + tableElements);
for (Element tableElement : tableElements)
{
String tableId = tableElement.id();
if (tableId.isEmpty()) {
continue;
}
String fileName = "table" + Name + tableId + ID + ".csv";
System.out.println(fileName);
FileWriter writer = new FileWriter(new File("C:\\Users\\noman\\eclipse-workspace\\Senior Project\\src\\", fileName));
//System.out.println(doc);
Elements tableRowElements = tableElement.select(":not(thead) tr td");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if (j != rowItems.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
}
Problem is no elements are being found. this same code works on another site perfectly which (seemingly) no differences in how they store data
Is there something different with this website preventing web-scraping? or a subtle difference maybe?
Please note HTML code provided is a shorten version
As said at the comments, the data you are looking for is loaded dynamically, but, you can fetch it with a simple GET request from this link -
https://stats.nba.com/stats/playerdashptshotdefend?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerID=1628381&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision=
EDIT
To find this link I've used the browser's developer tools and checked for xhr requests.
You can see that the link includes several parameters, among them the playerID which is identical to the number that appears in your intial link. By changing its value you can get stats of other players.

Jsoup - arrangement of table data from website

I want to get the table from https://ms.wikipedia.org/wiki/Malaysia.
Here is the table I want from the website.
But the result is not what I want.
I have got 2 questions:
1st Question is how can I arrange them like a table with arrangement Row and Column similar with the table from my picture. Below is my source code on how i get the data.
String URL = "https://ms.wikipedia.org/wiki/Malaysia";
Document doc = Jsoup.connect(URL).get();
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148)");
String currentRow = null;
for (Element tr : trs){
Elements tdDay = tr.select("tr:has(th)");
currentRow = tdDay.text();
System.out.print(currentRow);
}
2nd Question is from my source code, is it the best way to scraping the particular data from all the element like for example the element from the website https://ms.wikipedia.org/wiki/Malaysia by using
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148)");
Because from the website, there have got 3 table class with name wikitable. <table class="wikitable">. So how can I call only particular table?
Since the website u provide has some wikitable in it. So u can try to find out the selector of the data from table and I found there is <td> and <th>.
for (int i = x; i < x; i++) {
Elements trs = doc.select("#mw-content-text > div > table:nth-child(148) > tbody > tr:nth-child(" + i + ") > th");
Elements tds = doc.select("#mw-content-text > div > table:nth-child(148) > tbody > tr:nth-child(" + i + ") > td");
try this while the x in the for loops is the number of row in the table so it can scrape the data
public static void main(String[] args) throws IOException{
String URL = "https://ms.wikipedia.org/wiki/Malaysia";
Document doc = Jsoup.connect(URL).get();
//Select the table which is under the header containing "Trivia"
//having the value "wikitable" for the class attribute
Element table = doc.select("h2:contains(Trivia)+[class=\"wikitable\"]").first();
//then select each row of the table
Elements trs = table.select("tr");
//for each row get first and second child corresponding to column 1 and two of table
for (Element tr : trs){
Element th = tr.child(0);
Element td = tr.child(1);
System.out.printf("%-40s %-40s%n",th.text(), td.text());
}
}

How to generate XPath query matching a specific element in Jsoup?

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.
As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1
This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}
This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}
Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

Categories