what am trying to parse using jsoup is the following Numéro d'arrêt : 5216 and Numéro NOR : 63066, but nothing seems to works any advice and suggestions will be greatly appreciated:
`
<div class="1st">
<p>
<h3>Numérotation : </h3>
Numéro d'arrêt : 5216
<br />
Numéro NOR : 63066
</div>
`
UPDATE :
i got this code to work but it keep giving me this exception org.jsoup.nodes.Element cannot be cast to org.jsoup.nodes.TextNode :
Document tunisie = Jsoup.connect("http://www.juricaf.org/arret/TUNISIE-COURDECASSATION-20060126-5216").get();
for (Element titres : tunisie.select("div.arret")){
String titre = titres.select("h1#titre").first().text();
System.out.println(titre);
System.out.println("\n");
}
for (Element node : tunisie.select("h3")) {
TextNode numérodarrêt = (TextNode) node.nextSibling();
System.out.println(" " + numérodarrêt.text());
System.out.println("\n");
}
//NuméroNOR et Identifiant URN LEX
for (Element element2 : tunisie.select("br")) {
TextNode NuméroNOR_IdentifiantURNLEX = (TextNode) element2.nextSibling();
System.out.println(" " + NuméroNOR_IdentifiantURNLEX.text());
System.out.println("\n");
}
UPDATE :
here is what am trying to parse image link is below.
Parsing text outside html tags
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupTest {
public static void main(String[] args) throws IOException {
Document tunisie = Jsoup.connect("http://www.juricaf.org/arret/TUNISIE-COURDECASSATION-20060126-5216").get();
// get the first div in class arret
Element arret = tunisie.select("div.arret").first();
// select h1 tag by its ID to get the title
String titre = arret.select("#titre").text();
System.out.println(titre);
// to get the text after h3 select h3 and go to next sibling
String txtAfterFirstH3 = arret.select("h3").first().nextSibling().toString();
System.out.println(txtAfterFirstH3);
// select first br by its index; note first br has the index 0; and call nextSibling to get the text after the br tag
String txtAfterFirstBr = arret.getElementsByTag("br").get(0).nextSibling().toString();
System.out.println(txtAfterFirstBr);
// the same as above only with next index
String txtAfterSecondBr = arret.getElementsByTag("br").get(1).nextSibling().toString();
System.out.println(txtAfterSecondBr);
}
}
String html = "<div class=\"1st\">\n" +
"<p>\n" +
"<h3>Numérotation : </h3>\n" +
"Numéro d'arrêt : 5216\n" +
"<br />\n" +
"Numéro NOR : 63066\n" +
"</div>";
Document doc = Jsoup.parse(html);
Elements divs = doc.select("div.1st");
for(Element e : divs){
System.out.println(e.ownText());
}
Related
<h2><span class="mw-headline" id="The_battle">The battle</span></h2>
<div class="thumb tright"></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2>Second Title I want to stop collecting p tags after</h2>
I am learning Jsoup by trying to scrap all the p tags, arranged by title from wikipedia site. I can scrap all the p tags between h2, from the help of this question:
extract unidentified html content from between two tags, using jsoup? regex?
by using
Elements elements = docx.select("span.mw-headline, h2 ~ p");
but I can't scrap it when there is a <div> between them. Here is the wikipedia site I am working on:
https://simple.wikipedia.org/wiki/Battle_of_Hastings
How can I grab all the p tags where they are between two specific h2 tags?
Preferably ordered by id.
Try this option : Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
sample code :
package jsoupex;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class stackoverflw {
public static void main(String[] args) throws IOException {
//Validate.isTrue(args.length == 1, "usage: supply url to fetch");
//String url = "http://localhost/stov_wiki.html";
String url = "https://simple.wikipedia.org/wiki/Battle_of_Hastings ";
//args[0];
System.out.println("Fetching %s..." + url);
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
for (Element elem : elements) {
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
}
System.out.println(elem.text());
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
} else {
System.out.println("");
}
}
}
}
public static void main(String[] args) {
String entity =
"<h2><span class=\"mw-headline\" id=\"The_battle\">The battle</span></h2>" +
"<div class=\"thumb tright\"></h2>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<h2>Second Title I want to stop collecting p tags after</h2>";
Document element = org.jsoup.Jsoup.parse(entity,"", Parser.xmlParser());
element.outputSettings().prettyPrint(false);
element.outputSettings().outline(false);
List<TextNode>text=getAllTextNodes(element);
}
private static List<TextNode> getAllTextNodes(Element newElementValue) {
List<TextNode>textNodes = new ArrayList<>();
Elements elements = newElementValue.getAllElements();
for (Element e : elements){
for (TextNode t : e.textNodes()){
textNodes.add(t);
}
}
return textNodes;
}
_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.
As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1
This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}
This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}
Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}
Am I missing something? Is there a better way to do this?
INPUT:
<span style="FONT-FAMILY: 'Lucida Sans','sans-serif'; COLOR: #003572; FONT-SIZE: 9pt;
mso-fareast-font-family: Calibri; mso-ansi-language: EN-US; mso-fareast-language: EN-US;
mso-bidi-language: AR-SA; mso-fareast-theme-font: minor-latin">Dr. Who is
<u>usually</u> available for consultations Mon - Thurs afternoons and Friday 9a-
12p at 555-1212. </span>
DESIRED OUTPUT:
<span style="COLOR: #003572; FONT-SIZE: 9pt;">Dr. Who is
<u>usually</u> available for consultations Mon - Thurs
afternoons and Friday 9a-12p at 555-1212. </span>
MY CODE SO FAR:
//cleans the HTML within the Week Long note before writing to the DB
Whitelist wl = new Whitelist();
wl = Whitelist.simpleText();
wl.addTags("br");
wl.addTags("p");
wl.addTags("span");
wl.addAttributes(":all","style");
Document doc =
Jsoup.parse(
"<html><head></head><body>"+ds.getWeeklongNote()+"</body></html>");
Elements e = doc.select("*");
for (Element el : e){
for (Attribute attr : el.attributes()){
if (attr.getKey().equals("span")){
String newValue = "";
String s = attr.getValue();
String[] values = s.split(";");
for (String value : values){
if (value.startsWith("COLOR")||value.startsWith("FONT-SIZE")){
newValue += attr.getKey()+"="+attr.getValue()+";";
}
}
attr.setValue(newValue);
}
}
}
doc.html(e.outerHtml());
ds.setWeekLongNote(Jsoup.clean(doc.body().outerHtml(), wl));
Try this:
Document doc = Jsoup.parse(html);
Elements e = doc.getElementsByTag("body");
Log.i("Span element: "+e.get(0).nodeName(), ""+e.get(0).nodeName());
e = e.get(0).getElementsByTag("span");
Attributes styleAtt = e.get(0).attributes();
Attribute a = styleAtt.asList().get(0);
if(a.getKey().equals("style")){
String[] items = a.getValue().trim().split(";");
String newValue = "";
for(String item: items){
if(item.contains("COLOR:")||item.contains("FONT-SIZE:")){
Log.i("Style Item: ", ""+item);
newValue = newValue.concat(item).concat(";");
}
}
a.setValue(newValue);
Log.i("New Atrrbute: ",""+newValue);
}
Log.i("FINAL HTML: ",""+e.outerHtml());
doc.html(e.outerHtml());
}
Output:
08-17 18:28:07.692: I/FINAL HTML:(8148): <span style=" COLOR: #003572; FONT-SIZE: 9pt;">Dr. Who is <u>usually</u> available for consultations Mon - Thurs afternoons and Friday 9a- 12p at 555-1212. </span>
Cheers,
If you have more than one span element you can use this code snippet:
Document document = Jsoup.parse(html);
Vector<String> allowedItems = new Vector<String>();
allowedItems.add("color");
allowedItems.add("font-size");
Elements e = document.getElementsByTag("span");
for (Element element : e) {
String[] styles = element.attr("style").split(";");
Vector<String> filteredItems = new Vector<String>();
for (String item : styles) {
String key = (item.split(":"))[0].trim().toLowerCase();
if ( allowedItems.contains(key) ){
filteredItems.add(item);
}
}
if( filteredItems.size() == 0 ){
element.removeAttr("style");
}else{
element.attr("style",StringUtils.join(filteredItems, ";"));
}
}
//remove style attribute
Elements elms = doc.select("*").not("img");
for (Element e : elms) {
String attr = e.attr("style");
if(!"".equals(attr) || null!=attr){
e.attr("style", "");
}
}
I use a Jsoup to get the Elements from a web:
Elements addresses = doc.select("address > div");
and the result is like this:
<address>
<div>
7135 S Kingery Hwy<br>Willowbrook, IL 60527
</div>
<div class="phone">
(630) 288-6635
</div>
</address>
I have a hard time to retrieve the address from the tag. I use a text() method:
for (Element address : addresses) {
Log.i("addresses", address.text() );
}
and the result is:
7135 S Kingery Hwy Willowbrook, IL 60527
(630) 288-6635
How can I filter it to retrieve the address only and also replace br tag with newline? Expected result:
7135 S Kingery Hwy
Willowbrook, IL 60527
You can try this,
Elements addresses = doc.select("address > :not(div[class=phone])");
for (Element address : addresses) {
for (Node node : address.childNodes()) {
if (node.nodeName().equals("br")) {
continue;
}
String text = node.toString().trim();
System.out.println(text);
}
}
So I am just trying out the Jsoup API and have a simple question. I have a string and would like to keep the string in tact except when passed through my method. I want the string to pass through this method and take out the elements that wrap the links. Right now I have:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements select = Jsoup.parse(html).select("a");
String linkHref = select.attr("href");
System.out.println(linkHref);
}}
This returns the first URL unwrapped only. I would like all URLs unwrapped as well as the original string. Thanks in advance
EDIT: SOLUTION:
Thanks alot for the answer and I edited it only slightly to get the results I wanted. Here is the solution in full that I am using:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
for (Element link : links) {
doc.select("a").unwrap();
}
System.out.println(doc.text());
}
}
Thanks again
Here's the corrected code:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements links = Jsoup.parse(html).select("a[href]"); // a with href;
for (Element link : links) {
//Do whatever you want here
System.out.println("Link Attr : " + link.attr("abs:href"));
System.out.println("Link Text : " + link.text());
}
}
}