How to generate XPath query matching a specific element in Jsoup?

How to generate XPath query matching a specific element in Jsoup? - java

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.

As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1

This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}

This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}

Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

Related

Parsing currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

I prepare the program and I wrote this code with helping but the first 10 times it works then it gives me NULL values,
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
//Document doc = Jsoup.parse(url);
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
int i = 0;
String[] currencyStr = new String[11];
String[] buyStr = new String[11];
String[] sellStr = new String[11];
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table.table-markets");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(i);
currencyStr[i] = curreny.text();
buyStr[i] = buy.text();
sellStr[i] = sell.text();
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
i++;
}
for(i = 0; i < 11; i++){
System.out.println("currency: " + currencyStr[i]);
System.out.println("buy: " + buyStr[i]);
System.out.println("sell: " + sellStr[i]);
}
here is the code, I guess it is a connection problem but I could not solve it I use Netbeans, Do I have to change the connection properties of Netbeans or should I have to add something more in the code
can you help me?

There's nothing wrong with the connection. Your query simply doesn't match the page structure.
Somewhere on your page, there's an element with class borsaMain, that has a direct child with class detL. And then somewhere in the descendants tree of detL, there is your table. You can write this as the following CSS element selector query:
.borsaMain > .detL table
There will be two tables in the result, but I suspect you are looking for the first one.
So basically, you want something like:
Element table = doc.selectFirst(".borsaMain > .detL table");
for (Element row : table.select("tr:has(td)")) {
// your existing loop code
}

finding the number of open closed html tags in a string

I trying to figure out the best way to find the number of valid HTML tags in a string.
The assumption is that the tag is valid only if it has an opening and closing tag
this is an example of a test case
INPUT
"html": "<html><head></head><body><div><div></div></div>"
Output
"validTags":3

If you need to parse HTML
Do not do it yourself. There is no need to reinvent the wheel. There is a plethora of libraries for parsing HTML. Use the proper tool for the proper job.
Concentrate your efforts on the rest of your project. Sure, you could implement your own function that parses a string, looks for < and >, and acts appropriately. But HTML might be slightly more complex than you imagine, or you might end up needing more HTML parsing than just counting tags.
Maybe in the future you'llwant to count <br/> and <br /> as well. Or you'll want to find the depth of the HTML tree.
Maybe your homemade code doesn't account for all possible combinations of escaping characters, nested tags, etc. How many correct tags are there in the string:
<a><b><c><d e><f g="<h></h>"><i j="<k>" l="</k>"></i></f></e d></b></c></ a >
In a comment, user dbl linked to a similar question with links to libraries: How to validate HTML from java ?
If you want to count open-closed tag pairs as a learning project
Here is a proposed algorithm in pseudocode, as a recursive function:
function count_tags(s):
tag, remainder = find_next_tag(s)
found, inside, after = find_closing_tag(tag, remainder)
if (found)
return 1 + count_tags(inside) + count_tags(after)
else
return count_tags(inside)
Examples
on the string hello <a>world<c></c></a><b></b>, we will get:
tag = "<a>"
remainder = "world<c></c></a><b></b>"
found = true
inside = "world<c></c>"
after = "<b></b>"
return 1 + count_tags("world<c></c>") + count_tags("<b></b>")
on the string <html><head></head>:
tag = "<html>"
remainder = "<head></head>"
found = false
inside = "<head></head>"
after = ""
return count_tags("<head></head>")
on the string <a><b></a></b>:
tag = "<a>"
remainder = "<b></a></b>"
found = true
inside = "<b>"
after = "</b>"
return 1 + count_tags("<b>") + count_tags("</b>")

I wrote a function that would do exactly this.
static int checkValidTags(String html,String[] openTags, String[] closeTags) {
//openTags and closeTags must have the same length;
//This function keeps track of all opening tags.
//and removes the opening and closing tags if the tag is closed correctly
//It can even detect when there are labels added to the tags.
HashMap<Character,Integer> open = new HashMap<>();
HashMap<Character,Integer> close = new HashMap<>();
//Use a start character, this is 1 because 0 would be a string terminator.
int startChar = 1;
for(int i = 0; i < openTags.length; i++) {
open.put((char)startChar, i);
close.put((char)(startChar+1), i);
html = html.replaceAll(openTags[i],""+ (char)startChar);
html = html.replaceAll(closeTags[i],""+(char)(startChar+1));
startChar+=2;
}
List<List<Integer>> startIndexes = new ArrayList<>();
int validLabels = 0;
for(int i = 0; i < openTags.length; i++) {
startIndexes.add(new ArrayList<>());
}
for(int i = 0; i < html.length(); i++) {
char c = html.charAt(i);
if(open.get(c)!=null) {
startIndexes.get(open.get(c)).add(0,i);
}
if(close.get(c)!=null&&!startIndexes.get(close.get(c)).isEmpty()) {
String closed = html.substring(startIndexes.get(close.get(c)).get(0),i);
for(int k = 0; k < startIndexes.size(); k++) {
if(!startIndexes.get(k).isEmpty()) {
int p = startIndexes.get(k).get(0);
if(p > startIndexes.get(close.get(c)).get(0)) {
startIndexes.get(k).remove(0);
}
}
}
startIndexes.get(close.get(c)).remove(0);
html.replace(closed, "");
validLabels++;
}
}
return validLabels;
}
And to use it in your example you would do like this:
String html = "<html><head></head><body><div><div></div></div>";
int validTags = checkValidTags(html,new String[] {
//Add here all the tags you are looking for.
//Remove the trailing '>' so it can detect extra tags appended to it
"<html","<head","<body","<div"
}, new String[]{
"</html>","</head>","</body>","</div>"
});
System.out.println(validTags);
Output:
3

Missing Table Elements When Scraping

URL: https://stats.nba.com/player/1628381/defense-dash/
Attempting to get:
`<table>
<tbody>
<!----><tr data-ng-repeat="(i, row) in page" index="0">
<td class="player">Overall</td>
<td>45</td>
<td>45</td>
<td>5.7</td>
<td>12.3</td>
<td>46.6</td>
<td>100%</td>
<td>46.7</td>
<td>-0.1</td>
</tr><!---->
</tbody>
</table> `
My coding:
public static void getData(String url, String Name, int ID) throws
IOException
{
String html = Jsoup.connect(url).execute().body();
html = html.replaceAll("<!---->", "");
html = html.replaceAll("<!--", "");
html = html.replaceAll("-->", "");
Document doc = Jsoup.parse(html);
Elements tableElements = doc.select("table");
System.out.println("Elements " + tableElements);
for (Element tableElement : tableElements)
{
String tableId = tableElement.id();
if (tableId.isEmpty()) {
continue;
}
String fileName = "table" + Name + tableId + ID + ".csv";
System.out.println(fileName);
FileWriter writer = new FileWriter(new File("C:\\Users\\noman\\eclipse-workspace\\Senior Project\\src\\", fileName));
//System.out.println(doc);
Elements tableRowElements = tableElement.select(":not(thead) tr td");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if (j != rowItems.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
}
Problem is no elements are being found. this same code works on another site perfectly which (seemingly) no differences in how they store data
Is there something different with this website preventing web-scraping? or a subtle difference maybe?
Please note HTML code provided is a shorten version

As said at the comments, the data you are looking for is loaded dynamically, but, you can fetch it with a simple GET request from this link -
https://stats.nba.com/stats/playerdashptshotdefend?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerID=1628381&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision=
EDIT
To find this link I've used the browser's developer tools and checked for xhr requests.
You can see that the link includes several parameters, among them the playerID which is identical to the number that appears in your intial link. By changing its value you can get stats of other players.

using Jsoup to extract a table inside several divs

I am trying to use jsoup so as to have access to a table embedded inside multiple div's of an html page.The table is under the outer division with id "content-top". I will give the inner divs leading to the table: content-top -> center -> middle-right-col -> result .
Under the div result; is table round. This is the table that i want to access and whose rows I need to traverse and print out the data contained in them. Below is the java code I have been trying to use but yielding no results :
Document doc = Jsoup.connect("http://www.calculator.com/#").data("express", "sin(x)").data("calculate","submit").post();
// give the application time to calculate result before retrieving result from results table
try {
Thread.sleep(10000);
}
catch(InterruptedException ex)
{
Thread.currentThread().interrupt();
}
Elements content = doc.select("div#result") ;
Element tables = content.get(0) ;
Elements table_rows = tables.select("tr") ;
Iterator iterRows = table_rows.iterator();
while (iterRows.hasNext()) {
Element tr = (Element)iterRows.next();
Elements table_data = tr.select("td");
Iterator iterData = table_data.iterator();
int tdCount = 0;
String f_x_value = null;
String result = null;
// process new line
while (iterData.hasNext()) {
Element td = (Element)iterData.next();
switch (tdCount++) {
case 1:
f_x_value = td.text();
f_x_value = td.select("a").text();
break;
case 2:
result = td.text();
result = td.select("a").text();
break;
}
}
System.out.println(f_x_value + " " + result ) ;
}
The above code crashes and hardly does what I want it to do. PLEASE CAN ANYONE PLEASE HELP ME !!!

public static String do_conversion (String str)
{
char c;
String output = "{";
for(int i = 0; i < str.length(); i++)
{
c = str.charAt(i);
if(c=='e')
output += "{mathrm{e}}";
else if(c=='(')
output += '{';
else if(c==')')
output += '}';
else if(c=='+')
output += "{cplus}";
else if(c=='-')
output += "{cminus}";
else if(c=='*')
output += "{cdot}";
else if(c=='/')
output += "{cdivide}";
else output += c; // else copy the character normally
}
output += ", mathrm{d}x}";
return output;
}
#Syam S

The page doesnt directly give you a table in a div with id as "result". It uses an ajax class to a php file and get the process done. So what you need to do here is to first build a json like
{"expression":"sin(x)","intVar":"x","upperBound":"","lowerBound":"","simplifyExpressions":false,"latex":"\\displaystyle\\int\\limits^{}_{}{\\sin\\left(x\\right)\\, \\mathrm{d}x}"}
The expression key hold the expression that you want to evaluate, the latex is a mathjax expression and then post it to int.php. This expects two arguments namely q which is the above json and v which seems to a constant value 1380119311. I didnt understand what this is.
Now this will return a response like
<html>
<head></head>
<body>
<table class="round">
<tbody>
<tr class="">
<th>$f(x) =$</th>
<td>$\sin\left(x\right)$</td>
</tr>
<tr class="sep odd">
<th>$\displaystyle\int{f(x)}\, \mathrm{d}x =$</th>
<td>$-\cos\left(x\right)$</td>
</tr>
</tbody>
</table>
<!-- Finished in 155 ms -->
<p id="share"> <img src="layout/32x32xshare.png.pagespeed.ic.i3iroHP5fI.png" width="32" height="32" /> <a id="share-link" href="http://www.integral-calculator.com/#expr=sin%28x%29" onclick="window.prompt("To copy this link to the clipboard, press Ctrl+C, Enter.", $("share-link").href); return false;">Direct link to this calculation (for sharing)</a> </p>
</body>
</html>
The table in this expression gives you the result and the site uses mathjax to display it like
A sample program would be
import java.io.IOException;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupParser6 {
public static void main(String[] args) {
try {
// Integral
String url = "http://www.integral-calculator.com/int.php";
String q = "{\"expression\":\"sin(4x) * e^(-x)\",\"intVar\":\"x\",\"upperBound\":\"\",\"lowerBound\":\"\",\"simplifyExpressions\":false,\"latex\":\"\\\\displaystyle\\\\int\\\\limits^{}_{}{\\\\sin\\\\left(4x\\\\right){\\\\cdot}{\\\\mathrm{e}}^{-x}\\\\, \\\\mathrm{d}x}\"}";
Document integralDoc = Jsoup.connect(url).data("q", q).data("v", "1380119311").post();
System.out.println(integralDoc);
System.out.println("\n*******************************\n");
//Differential
url = "http://www.derivative-calculator.net/diff.php";
q = "{\"expression\":\"sin(x)\",\"diffVar\":\"x\",\"diffOrder\":1,\"simplifyExpressions\":false,\"showSteps\":false,\"latex\":\"\\\\dfrac{\\\\mathrm{d}}{\\\\mathrm{d}x}\\\\left(\\\\sin\\\\left(x\\\\right)\\\\right)\"}";
Document differentialDoc = Jsoup.connect(url).data("q", q).data("v", "1380119305").post();
System.out.println(differentialDoc);
System.out.println("\n*******************************\n");
//Calculus
url = "http://calculus-calculator.com/calculation/integrate.php";
Document calculusDoc = Jsoup.connect(url).data("expression", "sin(x)").data("intvar", "x").post();
String outStr = StringEscapeUtils.unescapeJava(calculusDoc.toString());
Document formattedOutPut = Jsoup.parse(outStr);
formattedOutPut.body().html(formattedOutPut.select("div.isteps").toString());
System.out.println(formattedOutPut);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update based on comment.
The unescape works perfectly well. In MathJax you could right click and view the command. So if you go to your site http://calculus-calculator.com/ and try the sin(x) equation there and right click the result and view TexCommand like
The you could see the commands are exactly the ones which we get after unsescape. The demo site is not rendering it. May be a limitation of the demo site, thats all.

JSoup parsing data from within a tag

I am managing to parse most of the data I need except for one as it is contained within the a href tag and I am needing the number that appears after "mmsi="
Sunsail 4013
my current parser fetches all the other data I need and is below. I tried a few things out the code commented out returns unspecified occasionally for an entry. Is there any way I can add to my code below so that when the data is returned the number "235083844" returns before the name "Sunsail 4013"?
try {
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements tables = doc.select("table.shipInfo");
for( Element element : tables )
{
Elements tdTags = element.select("td");
//Elements mmsi = element.select("a[href*=/showship.php?mmsi=]");
// Iterate over all 'td' tags found
for( Element td : tdTags ){
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Example of data parsed and html file here

You can use attr on an Element object to retrieve a particular attribute's value
Use substring to get the required value if the String pattern is consistent
Code
// Using just your anchor html tag
String html = "Sunsail 4013";
Document doc = Jsoup.parse(html);
// Just selecting the anchor tag, for your implementation use a generic one
Element link = doc.select("a").first();
// Get the attribute value
String url = link.attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ link.text());
Gives,
235083844 Sunsail 4013
Modified condition in your for loop from your code:
...
for (Element td : tdTags) {
// Print it's text if not empty
final String text = td.text();
if (text.isEmpty() == false) {
if (td.getElementsByTag("a").first() != null) {
// Get the attribute value
String url = td.getElementsByTag("a").first().attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ td.text());
}
else {
System.out.println(td.text());
}
}
}
...
The above code would print the desired output.

If you need value of attribute, you should use attr() method.
for( Element td : tdTags ){
Elements aList = td.select("a");
for(Element a : aList){
String val = a.attr("href");
if(StringUrils.isNotBlank(val)){
String yourId = val.substring(val.indexOf("=") + 1);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to generate XPath query matching a specific element in Jsoup? - java

Related

Parsing currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

finding the number of open closed html tags in a string

Missing Table Elements When Scraping

using Jsoup to extract a table inside several divs

JSoup parsing data from within a tag

Categories

Resources