Extract a specific line from a webpage using JSoup for Java - java

Hi I want to scrape some text from a website using the JSoup library. I have tried the following code, and that gives me the whole webpage, I want to just extract a specific line. Here is the code I am using:
Document doc = null;
try {
doc = Jsoup.connect("http://www.example.com").get();
} catch (IOException e) {
e.printStackTrace();
}
String text = doc.html();
System.out.println(text);
That prints out the following
<html>
<head></head>
<body>
Martin,James,28,London,20k
<br /> Sarah,Jackson,43,Glasgow,32k
<br /> Alex,Cook,22,Liverpool,18k
<br /> Jessica,Adams,34,London,27k
<br />
</body>
</html>
How can I extract just the 6th line that reads Alex,Cook,22,Liverpool,18k and put it into an array where each element is a word before a comma (eg: [0] = Alex, [1] = Cook, etc)

Maybe you have to format (?) the Result a bit:
Document doc = Jsoup.connect("http://www.example.com").get();
int count = 0; // Count Nodes
for( Node n : doc.body().childNodes() )
{
if( n instanceof TextNode )
{
if( count == 2 ) // Node 'Alex'
{
String t[] = n.toString().split(","); // you have an array with each word as string now
System.out.println(Arrays.toString(t)); // eg. output
}
count++;
}
}
Output:
[ Alex, Cook, 22, Liverpool, 18k ]
Edit:
Since you cant select TextNode's by its ccntent (only possible with Elements) you need a small workaround:
for( Node n : doc.body().childNodes() )
{
if( n instanceof TextNode )
{
str = n.toString().trim();
if( str.toLowerCase().startsWith("alex") ) // Node 'Alex'
{
String t[] = n.toString().split(","); // you have an array with each word as string now
System.out.println(Arrays.toString(t)); // eg. output
}
}
}

Related

How to generate XPath query matching a specific element in Jsoup?

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.
As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1
This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}
This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}
Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

Java Jsoup - Element isn't removed from Elements

I will start from beginning, there's html with pattern like this:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
those div's with styles and that table is optional, sometimes there's just
<div id="post">
Text i need
</div>
And i want to parse that text to String. Here;s the code I'm using
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
I added those print lines to check if it finds them and yes, it does find correct ones, but later when I'm parsing it to String:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
String contains all that removed stuff again for some reasons.
I tried removing them by iterators, or making full for and removing elements by indexes, buut the result is the same.
So you want to get Text i need. Use Element's ownText() method which Gets the text owned by this element only; does not get the combined text of all children.
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}

using Jsoup to extract a table inside several divs

I am trying to use jsoup so as to have access to a table embedded inside multiple div's of an html page.The table is under the outer division with id "content-top". I will give the inner divs leading to the table: content-top -> center -> middle-right-col -> result .
Under the div result; is table round. This is the table that i want to access and whose rows I need to traverse and print out the data contained in them. Below is the java code I have been trying to use but yielding no results :
Document doc = Jsoup.connect("http://www.calculator.com/#").data("express", "sin(x)").data("calculate","submit").post();
// give the application time to calculate result before retrieving result from results table
try {
Thread.sleep(10000);
}
catch(InterruptedException ex)
{
Thread.currentThread().interrupt();
}
Elements content = doc.select("div#result") ;
Element tables = content.get(0) ;
Elements table_rows = tables.select("tr") ;
Iterator iterRows = table_rows.iterator();
while (iterRows.hasNext()) {
Element tr = (Element)iterRows.next();
Elements table_data = tr.select("td");
Iterator iterData = table_data.iterator();
int tdCount = 0;
String f_x_value = null;
String result = null;
// process new line
while (iterData.hasNext()) {
Element td = (Element)iterData.next();
switch (tdCount++) {
case 1:
f_x_value = td.text();
f_x_value = td.select("a").text();
break;
case 2:
result = td.text();
result = td.select("a").text();
break;
}
}
System.out.println(f_x_value + " " + result ) ;
}
The above code crashes and hardly does what I want it to do. PLEASE CAN ANYONE PLEASE HELP ME !!!
public static String do_conversion (String str)
{
char c;
String output = "{";
for(int i = 0; i < str.length(); i++)
{
c = str.charAt(i);
if(c=='e')
output += "{mathrm{e}}";
else if(c=='(')
output += '{';
else if(c==')')
output += '}';
else if(c=='+')
output += "{cplus}";
else if(c=='-')
output += "{cminus}";
else if(c=='*')
output += "{cdot}";
else if(c=='/')
output += "{cdivide}";
else output += c; // else copy the character normally
}
output += ", mathrm{d}x}";
return output;
}
#Syam S
The page doesnt directly give you a table in a div with id as "result". It uses an ajax class to a php file and get the process done. So what you need to do here is to first build a json like
{"expression":"sin(x)","intVar":"x","upperBound":"","lowerBound":"","simplifyExpressions":false,"latex":"\\displaystyle\\int\\limits^{}_{}{\\sin\\left(x\\right)\\, \\mathrm{d}x}"}
The expression key hold the expression that you want to evaluate, the latex is a mathjax expression and then post it to int.php. This expects two arguments namely q which is the above json and v which seems to a constant value 1380119311. I didnt understand what this is.
Now this will return a response like
<html>
<head></head>
<body>
<table class="round">
<tbody>
<tr class="">
<th>$f(x) =$</th>
<td>$\sin\left(x\right)$</td>
</tr>
<tr class="sep odd">
<th>$\displaystyle\int{f(x)}\, \mathrm{d}x =$</th>
<td>$-\cos\left(x\right)$</td>
</tr>
</tbody>
</table>
<!-- Finished in 155 ms -->
<p id="share"> <img src="layout/32x32xshare.png.pagespeed.ic.i3iroHP5fI.png" width="32" height="32" /> <a id="share-link" href="http://www.integral-calculator.com/#expr=sin%28x%29" onclick="window.prompt("To copy this link to the clipboard, press Ctrl+C, Enter.", $("share-link").href); return false;">Direct link to this calculation (for sharing)</a> </p>
</body>
</html>
The table in this expression gives you the result and the site uses mathjax to display it like
A sample program would be
import java.io.IOException;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupParser6 {
public static void main(String[] args) {
try {
// Integral
String url = "http://www.integral-calculator.com/int.php";
String q = "{\"expression\":\"sin(4x) * e^(-x)\",\"intVar\":\"x\",\"upperBound\":\"\",\"lowerBound\":\"\",\"simplifyExpressions\":false,\"latex\":\"\\\\displaystyle\\\\int\\\\limits^{}_{}{\\\\sin\\\\left(4x\\\\right){\\\\cdot}{\\\\mathrm{e}}^{-x}\\\\, \\\\mathrm{d}x}\"}";
Document integralDoc = Jsoup.connect(url).data("q", q).data("v", "1380119311").post();
System.out.println(integralDoc);
System.out.println("\n*******************************\n");
//Differential
url = "http://www.derivative-calculator.net/diff.php";
q = "{\"expression\":\"sin(x)\",\"diffVar\":\"x\",\"diffOrder\":1,\"simplifyExpressions\":false,\"showSteps\":false,\"latex\":\"\\\\dfrac{\\\\mathrm{d}}{\\\\mathrm{d}x}\\\\left(\\\\sin\\\\left(x\\\\right)\\\\right)\"}";
Document differentialDoc = Jsoup.connect(url).data("q", q).data("v", "1380119305").post();
System.out.println(differentialDoc);
System.out.println("\n*******************************\n");
//Calculus
url = "http://calculus-calculator.com/calculation/integrate.php";
Document calculusDoc = Jsoup.connect(url).data("expression", "sin(x)").data("intvar", "x").post();
String outStr = StringEscapeUtils.unescapeJava(calculusDoc.toString());
Document formattedOutPut = Jsoup.parse(outStr);
formattedOutPut.body().html(formattedOutPut.select("div.isteps").toString());
System.out.println(formattedOutPut);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update based on comment.
The unescape works perfectly well. In MathJax you could right click and view the command. So if you go to your site http://calculus-calculator.com/ and try the sin(x) equation there and right click the result and view TexCommand like
The you could see the commands are exactly the ones which we get after unsescape. The demo site is not rendering it. May be a limitation of the demo site, thats all.

JSoup parsing data from within a tag

I am managing to parse most of the data I need except for one as it is contained within the a href tag and I am needing the number that appears after "mmsi="
Sunsail 4013
my current parser fetches all the other data I need and is below. I tried a few things out the code commented out returns unspecified occasionally for an entry. Is there any way I can add to my code below so that when the data is returned the number "235083844" returns before the name "Sunsail 4013"?
try {
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements tables = doc.select("table.shipInfo");
for( Element element : tables )
{
Elements tdTags = element.select("td");
//Elements mmsi = element.select("a[href*=/showship.php?mmsi=]");
// Iterate over all 'td' tags found
for( Element td : tdTags ){
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Example of data parsed and html file here
You can use attr on an Element object to retrieve a particular attribute's value
Use substring to get the required value if the String pattern is consistent
Code
// Using just your anchor html tag
String html = "Sunsail 4013";
Document doc = Jsoup.parse(html);
// Just selecting the anchor tag, for your implementation use a generic one
Element link = doc.select("a").first();
// Get the attribute value
String url = link.attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ link.text());
Gives,
235083844 Sunsail 4013
Modified condition in your for loop from your code:
...
for (Element td : tdTags) {
// Print it's text if not empty
final String text = td.text();
if (text.isEmpty() == false) {
if (td.getElementsByTag("a").first() != null) {
// Get the attribute value
String url = td.getElementsByTag("a").first().attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ td.text());
}
else {
System.out.println(td.text());
}
}
}
...
The above code would print the desired output.
If you need value of attribute, you should use attr() method.
for( Element td : tdTags ){
Elements aList = td.select("a");
for(Element a : aList){
String val = a.attr("href");
if(StringUrils.isNotBlank(val)){
String yourId = val.substring(val.indexOf("=") + 1);
}
}

how to extract a value of variable from a javascript function in the html page while parsing an html page

I have to parse an html page. I have to extract the value of the name element in the below html which is assigned to a javascript function. How do I do it using JSoup.
<input type="hidden" name="fields.DEPTID.value"/>
JS:
departmentId.onChange = function(value) {
var departmentId = dijit.byId("departmentId");
if (value == null || value == "") {
document.transferForm.elements["fields.DEPTID.value"].value = "";
document.transferForm.elements["fields.DEPTID_DESC.value"].value = "";
} else {
document.transferForm.elements["fields.DEPTID.value"].value = value;
document.transferForm.elements["fields.DEPTID_DESC.value"].value = departmentId.getDisplayedValue();
var locationID = departmentId.store.getValue(departmentId.item, "loctID");
var locationDesc = departmentId.store.getValue(departmentId.item, "loct");
locationComboBox = dijit.byId("locationId");
if (locationComboBox != null) {
if (locationID != "") {
setLocationComboBox(locationID, locationDesc);
} else {
setLocationComboBox("AMFL", "AMFL - AMY FLORIDA");
}
}
}
};
I'll try to teach you form the top:
//Connect to the url, and get its source html
Document doc = Jsoup.connect("url").get();
//Get ALL the elements in the page that meet the query
//you passed as parameter.
//I'm querying for all the script tags that have the
//name attribute inside it
Elements elems = doc.select("script[name]");
//That Elements variable is a collection of
//Element. So now, you'll loop through it, and
//get all the stuff you're looking for
for (Element elem : elems) {
String name = elem.attr("name");
//Now you have the name attribute
//Use it to whatever you need.
}
Now if you want some help with the Jsoup querys to get any other elements you might want, here you go the API documentation to help: Jsoup selector API
Hope that helped =)

Categories