Get all <p> texts after <div> and between <h2> by using Jsoup

Get all <p> texts after <div> and between <h2> by using Jsoup - java

<h2><span class="mw-headline" id="The_battle">The battle</span></h2>
<div class="thumb tright"></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2>Second Title I want to stop collecting p tags after</h2>
I am learning Jsoup by trying to scrap all the p tags, arranged by title from wikipedia site. I can scrap all the p tags between h2, from the help of this question:
extract unidentified html content from between two tags, using jsoup? regex?
by using
Elements elements = docx.select("span.mw-headline, h2 ~ p");
but I can't scrap it when there is a <div> between them. Here is the wikipedia site I am working on:
https://simple.wikipedia.org/wiki/Battle_of_Hastings
How can I grab all the p tags where they are between two specific h2 tags?
Preferably ordered by id.

Try this option : Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
sample code :
package jsoupex;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class stackoverflw {
public static void main(String[] args) throws IOException {
//Validate.isTrue(args.length == 1, "usage: supply url to fetch");
//String url = "http://localhost/stov_wiki.html";
String url = "https://simple.wikipedia.org/wiki/Battle_of_Hastings ";
//args[0];
System.out.println("Fetching %s..." + url);
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
for (Element elem : elements) {
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
}
System.out.println(elem.text());
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
} else {
System.out.println("");
}
}
}
}

public static void main(String[] args) {
String entity =
"<h2><span class=\"mw-headline\" id=\"The_battle\">The battle</span></h2>" +
"<div class=\"thumb tright\"></h2>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<h2>Second Title I want to stop collecting p tags after</h2>";
Document element = org.jsoup.Jsoup.parse(entity,"", Parser.xmlParser());
element.outputSettings().prettyPrint(false);
element.outputSettings().outline(false);
List<TextNode>text=getAllTextNodes(element);
}
private static List<TextNode> getAllTextNodes(Element newElementValue) {
List<TextNode>textNodes = new ArrayList<>();
Elements elements = newElementValue.getAllElements();
for (Element e : elements){
for (TextNode t : e.textNodes()){
textNodes.add(t);
}
}
return textNodes;
}

Related

How to find specific elements within a larger element in HTML Java

Document doc = Jsoup.parse(url1, 3*1000);
String subHead = "A h2 heading"; //note that at this point I have already parsed the html and found all the H2 headings and analysed them, But now I want to go further and analyse all H4 headings within the H2 section
print("Printing h4 titles of : " + subHead);
Elements sibHead; //variable that stores all elements between this H2 title and the next
String bodySelect = ("h2");
Elements kpageE = kpage.select(bodySelect);
for (Element e : kpageE) {
String estring = e.text();
print(estring + "--------------------------------------------");
if (estring.contentEquals(subHead)) {
sibHead = e.nextElementSiblings(); //this prints all elements in the h2 title section but i want only the h4 titles
for(Element ei : sibHead) {
String eistr = ei.text();
print(eistr);
}
}
I have already parsed the HTML and have got a list of all H2 elements, now I want specific elements between one H2 element and the next, more specifically I want all H4 elements.

With Jsoup you can use the .getElementsByTag method of the Document class, which allows you to retrieve all the elements according to their tagName.
Here is an example of use:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class App {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://inscription.devlab.umontp.fr/").get();
Elements h4elements = doc.getElementsByTag("h4");
for (Element h4 : h4elements) {
System.out.println(h4.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Select a particular HTML table with JSOUP

I have my code as:
public static void main(String[] args) throws IOException {
org.jsoup.nodes.Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
org.jsoup.select.Elements rows = doc.select("tr");
for (org.jsoup.nodes.Element row : rows) {
org.jsoup.select.Elements columns = row.select("td");
for (org.jsoup.nodes.Element column : columns) {
System.out.print(column.text());
}
System.out.println();
}
}
It is printing out all the table rows that on the webpage, is it possible if I just want to print out a selected table in the website?

Try to select a particular table element first and then loop over its nested elements.
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
Element table = doc.select("table.wikitable").get(1);
Elements body = table.select("tbody");
Elements rows = body.select("tr");
for (Element row : rows) {
System.out.print(row.select("th").text());
System.out.print(row.select("td").text());
System.out.println();
}
}
Output:
Ibu negaraKuala Lumpur
Pusat pentadbiranPutrajaya
Tarikh Hari Kebangsaan31 Ogos 1957
Cogan Kata NegaraBersekutu Bertambah Mutu
BenuaAsia, Asia Tenggara
Koordinat Geografi2 30 U, 112 30 T
Jumlah hujan tahunan2000mm ~ 2500mm
IklimTropika dengan suhu 24–35 Darjah Celsius
Bunga kebangsaanBunga Raya
Binatang rasmiHarimau
Puncak tertinggiGunung Kinabalu, Banjaran Crocker (4175m)
Puncak tertinggi SemenanjungGunung Tahan, Banjaran Tahan (2187 m)
Banjaran terpanjangBanjaran Titiwangsa (500 km)
Sungai terpanjangSungai Rajang, Sarawak (563 km)
Sungai terpanjang di SemenanjungSungai Pahang (475 km)
Jambatan terpanjangJambatan Pulau Pinang (13.5 km)
Gua terbesarGua Niah, Sarawak
Bangunan tertinggiMenara Berkembar Petronas (452m)
Negeri terbesarSarawak (124,450 km persegi)
Negeri terkecilPerlis (810 km persegi)
Tempat paling lembapBukit Larut (lebih 5080 mm)
Tempat paling keringJelebu (kurang daripada 1500 mm)
Kawasan paling padatKuala Lumpur (6074/km², 15,543/batu persegi)
Penanaman eksport utamaKelapa sawit dan getah
Read more documentation here about JSOUP.

The best way to do this is grab the table by its title. Since the title is embedded in a cousin element of the table, and CSS has no parent selector, you can use a combination of CSS and Jsoup API calls to achieve this.
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
Element table = doc.select("span#Trivia").parents().first().nextElementSibling();
Elements rows = table.select("tr");
for (Element row : rows) {
String header = row.select("th").text();
String value = row.select("td").text();
System.out.println(header + ": " + value);
}
}

Jsoup casting Element as TextNode causes exception

what am trying to parse using jsoup is the following Numéro d'arrêt : 5216 and Numéro NOR : 63066, but nothing seems to works any advice and suggestions will be greatly appreciated:
`
<div class="1st">
<p>
<h3>Numérotation : </h3>
Numéro d'arrêt : 5216
<br />
Numéro NOR : 63066
</div>
`
UPDATE :
i got this code to work but it keep giving me this exception org.jsoup.nodes.Element cannot be cast to org.jsoup.nodes.TextNode :
Document tunisie = Jsoup.connect("http://www.juricaf.org/arret/TUNISIE-COURDECASSATION-20060126-5216").get();
for (Element titres : tunisie.select("div.arret")){
String titre = titres.select("h1#titre").first().text();
System.out.println(titre);
System.out.println("\n");
}
for (Element node : tunisie.select("h3")) {
TextNode numérodarrêt = (TextNode) node.nextSibling();
System.out.println(" " + numérodarrêt.text());
System.out.println("\n");
}
//NuméroNOR et Identifiant URN LEX
for (Element element2 : tunisie.select("br")) {
TextNode NuméroNOR_IdentifiantURNLEX = (TextNode) element2.nextSibling();
System.out.println(" " + NuméroNOR_IdentifiantURNLEX.text());
System.out.println("\n");
}
UPDATE :
here is what am trying to parse image link is below.
Parsing text outside html tags

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupTest {
public static void main(String[] args) throws IOException {
Document tunisie = Jsoup.connect("http://www.juricaf.org/arret/TUNISIE-COURDECASSATION-20060126-5216").get();
// get the first div in class arret
Element arret = tunisie.select("div.arret").first();
// select h1 tag by its ID to get the title
String titre = arret.select("#titre").text();
System.out.println(titre);
// to get the text after h3 select h3 and go to next sibling
String txtAfterFirstH3 = arret.select("h3").first().nextSibling().toString();
System.out.println(txtAfterFirstH3);
// select first br by its index; note first br has the index 0; and call nextSibling to get the text after the br tag
String txtAfterFirstBr = arret.getElementsByTag("br").get(0).nextSibling().toString();
System.out.println(txtAfterFirstBr);
// the same as above only with next index
String txtAfterSecondBr = arret.getElementsByTag("br").get(1).nextSibling().toString();
System.out.println(txtAfterSecondBr);
}
}

String html = "<div class=\"1st\">\n" +
"<p>\n" +
"<h3>Numérotation : </h3>\n" +
"Numéro d'arrêt : 5216\n" +
"<br />\n" +
"Numéro NOR : 63066\n" +
"</div>";
Document doc = Jsoup.parse(html);
Elements divs = doc.select("div.1st");
for(Element e : divs){
System.out.println(e.ownText());
}

using Jsoup to extract a table inside several divs

I am trying to use jsoup so as to have access to a table embedded inside multiple div's of an html page.The table is under the outer division with id "content-top". I will give the inner divs leading to the table: content-top -> center -> middle-right-col -> result .
Under the div result; is table round. This is the table that i want to access and whose rows I need to traverse and print out the data contained in them. Below is the java code I have been trying to use but yielding no results :
Document doc = Jsoup.connect("http://www.calculator.com/#").data("express", "sin(x)").data("calculate","submit").post();
// give the application time to calculate result before retrieving result from results table
try {
Thread.sleep(10000);
}
catch(InterruptedException ex)
{
Thread.currentThread().interrupt();
}
Elements content = doc.select("div#result") ;
Element tables = content.get(0) ;
Elements table_rows = tables.select("tr") ;
Iterator iterRows = table_rows.iterator();
while (iterRows.hasNext()) {
Element tr = (Element)iterRows.next();
Elements table_data = tr.select("td");
Iterator iterData = table_data.iterator();
int tdCount = 0;
String f_x_value = null;
String result = null;
// process new line
while (iterData.hasNext()) {
Element td = (Element)iterData.next();
switch (tdCount++) {
case 1:
f_x_value = td.text();
f_x_value = td.select("a").text();
break;
case 2:
result = td.text();
result = td.select("a").text();
break;
}
}
System.out.println(f_x_value + " " + result ) ;
}
The above code crashes and hardly does what I want it to do. PLEASE CAN ANYONE PLEASE HELP ME !!!

public static String do_conversion (String str)
{
char c;
String output = "{";
for(int i = 0; i < str.length(); i++)
{
c = str.charAt(i);
if(c=='e')
output += "{mathrm{e}}";
else if(c=='(')
output += '{';
else if(c==')')
output += '}';
else if(c=='+')
output += "{cplus}";
else if(c=='-')
output += "{cminus}";
else if(c=='*')
output += "{cdot}";
else if(c=='/')
output += "{cdivide}";
else output += c; // else copy the character normally
}
output += ", mathrm{d}x}";
return output;
}
#Syam S

The page doesnt directly give you a table in a div with id as "result". It uses an ajax class to a php file and get the process done. So what you need to do here is to first build a json like
{"expression":"sin(x)","intVar":"x","upperBound":"","lowerBound":"","simplifyExpressions":false,"latex":"\\displaystyle\\int\\limits^{}_{}{\\sin\\left(x\\right)\\, \\mathrm{d}x}"}
The expression key hold the expression that you want to evaluate, the latex is a mathjax expression and then post it to int.php. This expects two arguments namely q which is the above json and v which seems to a constant value 1380119311. I didnt understand what this is.
Now this will return a response like
<html>
<head></head>
<body>
<table class="round">
<tbody>
<tr class="">
<th>$f(x) =$</th>
<td>$\sin\left(x\right)$</td>
</tr>
<tr class="sep odd">
<th>$\displaystyle\int{f(x)}\, \mathrm{d}x =$</th>
<td>$-\cos\left(x\right)$</td>
</tr>
</tbody>
</table>
<!-- Finished in 155 ms -->
<p id="share"> <img src="layout/32x32xshare.png.pagespeed.ic.i3iroHP5fI.png" width="32" height="32" /> <a id="share-link" href="http://www.integral-calculator.com/#expr=sin%28x%29" onclick="window.prompt("To copy this link to the clipboard, press Ctrl+C, Enter.", $("share-link").href); return false;">Direct link to this calculation (for sharing)</a> </p>
</body>
</html>
The table in this expression gives you the result and the site uses mathjax to display it like
A sample program would be
import java.io.IOException;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupParser6 {
public static void main(String[] args) {
try {
// Integral
String url = "http://www.integral-calculator.com/int.php";
String q = "{\"expression\":\"sin(4x) * e^(-x)\",\"intVar\":\"x\",\"upperBound\":\"\",\"lowerBound\":\"\",\"simplifyExpressions\":false,\"latex\":\"\\\\displaystyle\\\\int\\\\limits^{}_{}{\\\\sin\\\\left(4x\\\\right){\\\\cdot}{\\\\mathrm{e}}^{-x}\\\\, \\\\mathrm{d}x}\"}";
Document integralDoc = Jsoup.connect(url).data("q", q).data("v", "1380119311").post();
System.out.println(integralDoc);
System.out.println("\n*******************************\n");
//Differential
url = "http://www.derivative-calculator.net/diff.php";
q = "{\"expression\":\"sin(x)\",\"diffVar\":\"x\",\"diffOrder\":1,\"simplifyExpressions\":false,\"showSteps\":false,\"latex\":\"\\\\dfrac{\\\\mathrm{d}}{\\\\mathrm{d}x}\\\\left(\\\\sin\\\\left(x\\\\right)\\\\right)\"}";
Document differentialDoc = Jsoup.connect(url).data("q", q).data("v", "1380119305").post();
System.out.println(differentialDoc);
System.out.println("\n*******************************\n");
//Calculus
url = "http://calculus-calculator.com/calculation/integrate.php";
Document calculusDoc = Jsoup.connect(url).data("expression", "sin(x)").data("intvar", "x").post();
String outStr = StringEscapeUtils.unescapeJava(calculusDoc.toString());
Document formattedOutPut = Jsoup.parse(outStr);
formattedOutPut.body().html(formattedOutPut.select("div.isteps").toString());
System.out.println(formattedOutPut);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update based on comment.
The unescape works perfectly well. In MathJax you could right click and view the command. So if you go to your site http://calculus-calculator.com/ and try the sin(x) equation there and right click the result and view TexCommand like
The you could see the commands are exactly the ones which we get after unsescape. The demo site is not rendering it. May be a limitation of the demo site, thats all.

Jsoup selecting and replacing multiple <a> elements

So I am just trying out the Jsoup API and have a simple question. I have a string and would like to keep the string in tact except when passed through my method. I want the string to pass through this method and take out the elements that wrap the links. Right now I have:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements select = Jsoup.parse(html).select("a");
String linkHref = select.attr("href");
System.out.println(linkHref);
}}
This returns the first URL unwrapped only. I would like all URLs unwrapped as well as the original string. Thanks in advance
EDIT: SOLUTION:
Thanks alot for the answer and I edited it only slightly to get the results I wanted. Here is the solution in full that I am using:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
for (Element link : links) {
doc.select("a").unwrap();
}
System.out.println(doc.text());
}
}
Thanks again

Here's the corrected code:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements links = Jsoup.parse(html).select("a[href]"); // a with href;
for (Element link : links) {
//Do whatever you want here
System.out.println("Link Attr : " + link.attr("abs:href"));
System.out.println("Link Text : " + link.text());
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get all <p> texts after <div> and between <h2> by using Jsoup - java

Related

How to find specific elements within a larger element in HTML Java

Select a particular HTML table with JSOUP

Jsoup casting Element as TextNode causes exception

using Jsoup to extract a table inside several divs

Jsoup selecting and replacing multiple <a> elements

Categories

Resources