dom4j XPath not working parsing xhtml document

dom4j XPath not working parsing xhtml document - java

I'm trying to use dom4j to parse an xhtml document. If I simply print out the document I can see the entire document so I know it is being loaded correctly. The two divs that I'm trying to select are at the exact same level in the document.
html
body
div
table
tbody
tr
td
table
tbody
tr
td
div class="definition"
div class="example"
My code is
List<Element> list = document.selectNodes("//html/body/div/table/tbody/tr/td/table/tbody/tr/td");
but the list is empty when i do System.out.println(list);
If i only do List<Element> list = document.selectNodes("//html"); it does actually return a list with one element in it. So I'm confused about whats wrong with my xpath and why it won't find those divs

Try declaring the xhtml namespace to the xpath, e.g. bind it to the prefix x and use //x:html/x:body... as XPath expression (see also this article which is however for Groovy, not for plain Java). Probably something like the following should do it in Java:
DefaultXPath xpath = new DefaultXPath("//x:html/x:body/...");
Map<String,String> namespaces = new TreeMap<String,String>();
namespaces.put("x","http://www.w3.org/1999/xhtml");
xpath.setNamespaceURIs(namespaces);
list = xpath.selectNodes(document);
(untested)

What about just "//div"? Or "//html/body/div/table/tbody"? I've found long literal XPath expressions hard to debug, as it's easy for my eyes to get tricked... so I break them down until it DOES work and then build back up again.

An alternative could be: -
//div[#class='definition' or #class='example']
This searches for "div" elements, anywhere in the document with "class" attributes values equal to "definition" or "example".
I find this approach more clearly illustrates what you are trying to retrieve from the page. An added benefit is if the structure of the page changes, but the div classes stay the same, then your xpath doesn't need to be updated.
You can also check your xpath works against an HTML document using the following firefox plugin which is very useful.
Firefox Plugin - XPath Checker 0.4.4

Related

write xpath for randomly generated name

I trying to write xpath for a randomly generated name or number(which is usually the project name), which produces xpath like below:
//*[#id="job_10"]/td[3]/a
//*[#id="job_11"]/td[3]/a
//*[#id="job_12"]/td[3]/a
10,11,12 are the project numbers and can be words too.
Any suggestions?

Try to use below XPath expression:
//*[starts-with(#id, "job_")]/td[3]/a
This should allow to match elements with id attribute which starts with job_ and ends with string or number...whatever

Are you trying to select an html element? If so, you might be better of using a css selector instead of xpath.
If you have the html as a String, you can use JSoup to parse the document, then use a css selector to find the last row in the table.
String htmlDocument = ...
Document doc = Jsoup.parse(htmlDocument);
Element anchor = doc.select("tr[id^=job_]:last-child td:nth-child(3) a");

JSoup check if <HTML>,<HEAD> and <BODY> tags are present

Hi I am using JSoup to parse a HTML file. After parsing, I want to check if the file contains the tag. I am using the following code to check that,
htmlDom = parser.parse("<p>My First Heading</p>clk");
Elements pe = htmlDom.select("html");
System.out.println("size "+pe.size());
The output I get is "size 1" even though there is no HTML tag present. My guess is that it is because the HTML tag is not mandatory and that it is implicit. Same is the case for Head and Body tag. Is there any way I could check for sure if these tags are present in the input file?
Thank you.

It does not return 1 because the tag is implicit, but because it is present in the Document object htmlDom after you have parsed the custom HTML.
That is because Jsoup will try to conform the HTML5 Parsing Rules, and thus adds missing elements and tries to fix a broken document structure. I'm quite sure you would get a 1 in return if you were to run the following aswell:
Elements pe = htmlDom.select("head");
System.out.println("size "+pe.size());
To parse the HTML without Jsoup trying to clean or make your HTML valid, you can instead use the included XMLParser, as below, which will parse the HTML as it is.
String customHtml = "<p>My First Heading</p>clk";
Document customDoc = Jsoup.parse(customHtml, "", Parser.xmlParser());
So, as opposed to your assumption in the comments of the question, this is very much possible to do with Jsoup.

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags
For example, if my html text is:
String html = "<div> <p> some text some link text </p> </div>"
Now I can parse the above html and select for a tag in jsoup like this,
Document doc = Jsoup.parse(inputHtml);
//this would give me all elements which have anchor tag
Elements elements = doc.select("a");
and I can remove all of them by,
element.remove()
But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags.
Also, Please Note : I know there are methods to get outerHTML() and
innerHTML() from the element, but those methods only give me ways to
retrieve the text, the remove() method removes the complete html of
the tag. Is there any way in which I can only remove the outer tags
and preserve the innerHTML ?
Thanks a lot in advance and appreciate your help.
--Rajesh

use unwrap, it preserves the inner html
doc.select("a").unwrap();
check the api-docs for more info:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29

How about extracting the inner HTML first, adding it to the DOM and then removing your tags? This code is untested, but should do the trick:
Edit:
I updated the code to use replaceWith(), making the code more intuitive and probably more efficient; thanks to A.J.'s hint in the comments.
Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
Node linkText = new TextNode(link.html(), baseUri);
// optionally wrap it in a tag instead:
// Element linkText = doc.createElement("span");
// linkText.html(link.html());
link.replaceWith(linkText);
}
Instead of using a text node, you can wrap the inner html in anything you want; you might even have to, if there's not just text inside your links.

Java Jericho hyperlink parsing

I'm trying to figure out a way to get all hyperlinks in a webpage - except if they are in an anchor tag().
For this I'm using the Jericho parser.
My initial approach was to take the difference between
List<Element> elementList = source.getAllElements(); and
getAllElements(HTMLElementName.A), but other elements might also contain an anchor link within them, so I don't think that's the right approach.

I recommend you Jsoup for Html processing.
Here's an example how you can get all links (= a-tag with href-attribute):
Document doc = Jsoup.connect("http:// - link here -").get(); // Connect to website and parse its html
Elements links = doc.select("a[href]"); // Select all 'a'-tags' with 'href'-attribute
for( Element element : links ) // iterate over all links (example)
{
// process element
}
Documentation:
Selector API (DOM API is available too)
Cookbook (Examples)
list links (Example)
JavaDoc
Btw. can you explain this a bit more?
except if they are in an anchor tag

How to parse following String present in HTML and build DOM Tree in Java?

I have below string in html and I want to build Dom tree and get name value pair. How i can do this using html parser or xml parser or REGEXP. any code snippet will be useful. Thanks
<$$TagStarts>
<==0>Name0</==0><##0>Value0</##0>
<==1>Name1</==1><##1>Value1</##1>
<==2>Name2</==2><##2>Value2</##2>
<==3>Name3</==3><##3>Value3</##3>
<==4>Name4</==4><##4>Value4</##4>
<==5>Name5</==5><##5>Value5</##5>
</$$TagStarts>

Assuming the tag names are just for sample.... and you will have some meaningful tag names...
Try using any of the following HTML parsers...
http://home.ccil.org/~cowan/XML/tagsoup/
http://nekohtml.sourceforge.net/
http://jtidy.sourceforge.net/
They will give you the W3 compliant document object.... After this it is just a game of getElementsByTagName or getElementById or Use XPath or Xquery to get the elements from the DOM.
Otherwise you can use the following... They have their own document object implementation...
http://htmlcleaner.sourceforge.net/ [It also has some basic XPath support]
http://jsoup.org/ [It has jquery like query API]
ADD
Check this...
http://jsoup.org/cookbook/extracting-data/selector-syntax
I will recommend ... Either JSoup or Nekohtml

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

dom4j XPath not working parsing xhtml document - java

What about just "//div"? Or "//html/body/div/table/tbody"? I've found long literal XPath expressions hard to debug, as it's easy for my eyes to get tricked... so I break them down until it DOES work and then build back up again.

Related

write xpath for randomly generated name

JSoup check if <HTML>,<HEAD> and <BODY> tags are present

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

Java Jericho hyperlink parsing

How to parse following String present in HTML and build DOM Tree in Java?

Categories

Resources