Missing Table Elements When Scraping

Missing Table Elements When Scraping - java

URL: https://stats.nba.com/player/1628381/defense-dash/
Attempting to get:
`<table>
<tbody>
<!----><tr data-ng-repeat="(i, row) in page" index="0">
<td class="player">Overall</td>
<td>45</td>
<td>45</td>
<td>5.7</td>
<td>12.3</td>
<td>46.6</td>
<td>100%</td>
<td>46.7</td>
<td>-0.1</td>
</tr><!---->
</tbody>
</table> `
My coding:
public static void getData(String url, String Name, int ID) throws
IOException
{
String html = Jsoup.connect(url).execute().body();
html = html.replaceAll("<!---->", "");
html = html.replaceAll("<!--", "");
html = html.replaceAll("-->", "");
Document doc = Jsoup.parse(html);
Elements tableElements = doc.select("table");
System.out.println("Elements " + tableElements);
for (Element tableElement : tableElements)
{
String tableId = tableElement.id();
if (tableId.isEmpty()) {
continue;
}
String fileName = "table" + Name + tableId + ID + ".csv";
System.out.println(fileName);
FileWriter writer = new FileWriter(new File("C:\\Users\\noman\\eclipse-workspace\\Senior Project\\src\\", fileName));
//System.out.println(doc);
Elements tableRowElements = tableElement.select(":not(thead) tr td");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if (j != rowItems.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
}
Problem is no elements are being found. this same code works on another site perfectly which (seemingly) no differences in how they store data
Is there something different with this website preventing web-scraping? or a subtle difference maybe?
Please note HTML code provided is a shorten version

As said at the comments, the data you are looking for is loaded dynamically, but, you can fetch it with a simple GET request from this link -
https://stats.nba.com/stats/playerdashptshotdefend?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerID=1628381&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision=
EDIT
To find this link I've used the browser's developer tools and checked for xhr requests.
You can see that the link includes several parameters, among them the playerID which is identical to the number that appears in your intial link. By changing its value you can get stats of other players.

Related

Parsing currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

I prepare the program and I wrote this code with helping but the first 10 times it works then it gives me NULL values,
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
//Document doc = Jsoup.parse(url);
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
int i = 0;
String[] currencyStr = new String[11];
String[] buyStr = new String[11];
String[] sellStr = new String[11];
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table.table-markets");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(i);
currencyStr[i] = curreny.text();
buyStr[i] = buy.text();
sellStr[i] = sell.text();
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
i++;
}
for(i = 0; i < 11; i++){
System.out.println("currency: " + currencyStr[i]);
System.out.println("buy: " + buyStr[i]);
System.out.println("sell: " + sellStr[i]);
}
here is the code, I guess it is a connection problem but I could not solve it I use Netbeans, Do I have to change the connection properties of Netbeans or should I have to add something more in the code
can you help me?

There's nothing wrong with the connection. Your query simply doesn't match the page structure.
Somewhere on your page, there's an element with class borsaMain, that has a direct child with class detL. And then somewhere in the descendants tree of detL, there is your table. You can write this as the following CSS element selector query:
.borsaMain > .detL table
There will be two tables in the result, but I suspect you are looking for the first one.
So basically, you want something like:
Element table = doc.selectFirst(".borsaMain > .detL table");
for (Element row : table.select("tr:has(td)")) {
// your existing loop code
}

Xpages - Passing <ahref> in Arraylist

I'm trying to add a href to Arraylist and this adds nicely to the Arraylist, but the link is broken. Everything after the question mark (?) in the URL is not included in the link.
Is there anything that I'm missing, code below:
private String processUpdate(Database dbCurrent) throws NotesException {
int intCountSuccessful = 0;
View vwLookup = dbCurrent.getView("DocsDistribution");
ArrayList<String> listArray = new ArrayList<String>();
Document doc = vwLookup.getFirstDocument();
while (doc != null) {
String paperDistro = doc.getItemValueString("DistroRecords");
if (paperDistro.equals("")) {
String ref = doc.getItemValueString("ref");
String unid = doc.getUniversalID();
// the link generated when adding to Arraylist is broken
listArray.add("" + ref + "");
}
Document tmppmDoc = vwLookup.getNextDocument(doc);
doc.recycle();
doc = tmppmDoc;
}
Collections.sort(listArray);
String listString = "";
for (String s : listArray) {
listString += s + ", \t";
}
return listString;
}

You have a problem with " escaping around unid value due to which you URL becomes gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId="+ unid + "&action=openDocument.
It would be easier to read if you use String.format() and single quotes to generate the a tag:
listArray.add(String.format(
"<a href='gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId=%s&action=openDocument'>%s</a>",
unid, ref));

convert XML to a custom Excel with Java

I need advise about how to convert XML to a custom Excel with Java
I need to convert XML to Excel with a custom layout. I found a POI and it seems like it can help with this task. But I don't have this experiences and as I understood POI works the best with in memory trees like DOM. I started to pars my XML(I can show a small part of, it's really big and goes deep)
<advantage>
<companies>
<name>Name1</name>
<name>Name2</name>
<name>Name3</name>
<name>Name4</name>
<name>Name6</name>
</companies>
<companyPreCode>
<PreCode>1</PreCode>
<PreCode>2</PreCode>
<PreCode>3</PreCode>
<PreCode>4</PreCode>
<PreCode>6</PreCode>
</companyPreCode>
by using DOM as I saw in one online tutorial like this
Document xmlDoc = getDocument("./src/xmlForTest.xml");
xmlDoc.getDocumentElement().normalize();
System.out.println("Root element of the doc is :\" "+ xmlDoc.getDocumentElement().getNodeName() + "\"");
NodeList listOfAdvantage = xmlDoc.getElementsByTagName("advantage"); //first we need to find total number of Advantage blocks
int totalAdvantage = listOfAdvantage.getLength();
System.out.println("Total no of advantage : " + totalAdvantage);
for (int s = 0; s < listOfAdvantage.getLength(); s++) //get into advantage
{
Node AdvantageNode = listOfAdvantage.item(s);
System.out.println("advantage number : " + s);
if (AdvantageNode.getNodeType() == Node.ELEMENT_NODE)
{
Element AdvantageElement = (Element) AdvantageNode;
NodeList CompanyList = AdvantageElement.getElementsByTagName("companies"); // find node companies
System.out.println("companies number : " + CompanyList.getLength());
for(int cl = 0; cl < CompanyList.getLength(); cl++) {
NodeList CompanyNameList = CompanyList.item(cl).getChildNodes(); //AdvantageElement.getElementsByTagName("name");
for (int j = 0; j < CompanyNameList.getLength(); j++) {
Node childNode = CompanyNameList.item(j);
if ("name".equals(childNode.getNodeName())) {
for (int nl = 0; nl < CompanyNameList.getLength(); nl++) {
Element CompanyNameElement = (Element) CompanyNameList.item(nl);
NodeList textFNList = CompanyNameElement.getChildNodes();
System.out.println("Company: " + nl + " :" + (textFNList.item(0)).getNodeValue().trim());
CompaniesNames.add((textFNList.item(0)).getNodeValue().trim());
}
}
}
}
}// end of if clause
}// end of for loop with s var
and now I have several questions
How to make this parsing easier? my file is big and in some places I Have the same tags for different things, like Name can be for company, product or a person. But it's getting hard to retrieve it one by one the way I did it
How to feed this data later into POI so I can start using this POI to build my Excel files? Because right now I have a set of ArrayLists with my data from different tags and I just don't know what I need to next with it

How to generate XPath query matching a specific element in Jsoup?

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.

As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1

This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}

This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}

Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

using Jsoup to extract a table inside several divs

I am trying to use jsoup so as to have access to a table embedded inside multiple div's of an html page.The table is under the outer division with id "content-top". I will give the inner divs leading to the table: content-top -> center -> middle-right-col -> result .
Under the div result; is table round. This is the table that i want to access and whose rows I need to traverse and print out the data contained in them. Below is the java code I have been trying to use but yielding no results :
Document doc = Jsoup.connect("http://www.calculator.com/#").data("express", "sin(x)").data("calculate","submit").post();
// give the application time to calculate result before retrieving result from results table
try {
Thread.sleep(10000);
}
catch(InterruptedException ex)
{
Thread.currentThread().interrupt();
}
Elements content = doc.select("div#result") ;
Element tables = content.get(0) ;
Elements table_rows = tables.select("tr") ;
Iterator iterRows = table_rows.iterator();
while (iterRows.hasNext()) {
Element tr = (Element)iterRows.next();
Elements table_data = tr.select("td");
Iterator iterData = table_data.iterator();
int tdCount = 0;
String f_x_value = null;
String result = null;
// process new line
while (iterData.hasNext()) {
Element td = (Element)iterData.next();
switch (tdCount++) {
case 1:
f_x_value = td.text();
f_x_value = td.select("a").text();
break;
case 2:
result = td.text();
result = td.select("a").text();
break;
}
}
System.out.println(f_x_value + " " + result ) ;
}
The above code crashes and hardly does what I want it to do. PLEASE CAN ANYONE PLEASE HELP ME !!!

public static String do_conversion (String str)
{
char c;
String output = "{";
for(int i = 0; i < str.length(); i++)
{
c = str.charAt(i);
if(c=='e')
output += "{mathrm{e}}";
else if(c=='(')
output += '{';
else if(c==')')
output += '}';
else if(c=='+')
output += "{cplus}";
else if(c=='-')
output += "{cminus}";
else if(c=='*')
output += "{cdot}";
else if(c=='/')
output += "{cdivide}";
else output += c; // else copy the character normally
}
output += ", mathrm{d}x}";
return output;
}
#Syam S

The page doesnt directly give you a table in a div with id as "result". It uses an ajax class to a php file and get the process done. So what you need to do here is to first build a json like
{"expression":"sin(x)","intVar":"x","upperBound":"","lowerBound":"","simplifyExpressions":false,"latex":"\\displaystyle\\int\\limits^{}_{}{\\sin\\left(x\\right)\\, \\mathrm{d}x}"}
The expression key hold the expression that you want to evaluate, the latex is a mathjax expression and then post it to int.php. This expects two arguments namely q which is the above json and v which seems to a constant value 1380119311. I didnt understand what this is.
Now this will return a response like
<html>
<head></head>
<body>
<table class="round">
<tbody>
<tr class="">
<th>$f(x) =$</th>
<td>$\sin\left(x\right)$</td>
</tr>
<tr class="sep odd">
<th>$\displaystyle\int{f(x)}\, \mathrm{d}x =$</th>
<td>$-\cos\left(x\right)$</td>
</tr>
</tbody>
</table>
<!-- Finished in 155 ms -->
<p id="share"> <img src="layout/32x32xshare.png.pagespeed.ic.i3iroHP5fI.png" width="32" height="32" /> <a id="share-link" href="http://www.integral-calculator.com/#expr=sin%28x%29" onclick="window.prompt("To copy this link to the clipboard, press Ctrl+C, Enter.", $("share-link").href); return false;">Direct link to this calculation (for sharing)</a> </p>
</body>
</html>
The table in this expression gives you the result and the site uses mathjax to display it like
A sample program would be
import java.io.IOException;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupParser6 {
public static void main(String[] args) {
try {
// Integral
String url = "http://www.integral-calculator.com/int.php";
String q = "{\"expression\":\"sin(4x) * e^(-x)\",\"intVar\":\"x\",\"upperBound\":\"\",\"lowerBound\":\"\",\"simplifyExpressions\":false,\"latex\":\"\\\\displaystyle\\\\int\\\\limits^{}_{}{\\\\sin\\\\left(4x\\\\right){\\\\cdot}{\\\\mathrm{e}}^{-x}\\\\, \\\\mathrm{d}x}\"}";
Document integralDoc = Jsoup.connect(url).data("q", q).data("v", "1380119311").post();
System.out.println(integralDoc);
System.out.println("\n*******************************\n");
//Differential
url = "http://www.derivative-calculator.net/diff.php";
q = "{\"expression\":\"sin(x)\",\"diffVar\":\"x\",\"diffOrder\":1,\"simplifyExpressions\":false,\"showSteps\":false,\"latex\":\"\\\\dfrac{\\\\mathrm{d}}{\\\\mathrm{d}x}\\\\left(\\\\sin\\\\left(x\\\\right)\\\\right)\"}";
Document differentialDoc = Jsoup.connect(url).data("q", q).data("v", "1380119305").post();
System.out.println(differentialDoc);
System.out.println("\n*******************************\n");
//Calculus
url = "http://calculus-calculator.com/calculation/integrate.php";
Document calculusDoc = Jsoup.connect(url).data("expression", "sin(x)").data("intvar", "x").post();
String outStr = StringEscapeUtils.unescapeJava(calculusDoc.toString());
Document formattedOutPut = Jsoup.parse(outStr);
formattedOutPut.body().html(formattedOutPut.select("div.isteps").toString());
System.out.println(formattedOutPut);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update based on comment.
The unescape works perfectly well. In MathJax you could right click and view the command. So if you go to your site http://calculus-calculator.com/ and try the sin(x) equation there and right click the result and view TexCommand like
The you could see the commands are exactly the ones which we get after unsescape. The demo site is not rendering it. May be a limitation of the demo site, thats all.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Missing Table Elements When Scraping - java

Related

Parsing currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

Xpages - Passing <ahref> in Arraylist

convert XML to a custom Excel with Java

How to generate XPath query matching a specific element in Jsoup?

using Jsoup to extract a table inside several divs

Categories

Resources