How to find specific elements within a larger element in HTML Java - java

Document doc = Jsoup.parse(url1, 3*1000);
String subHead = "A h2 heading"; //note that at this point I have already parsed the html and found all the H2 headings and analysed them, But now I want to go further and analyse all H4 headings within the H2 section
print("Printing h4 titles of : " + subHead);
Elements sibHead; //variable that stores all elements between this H2 title and the next
String bodySelect = ("h2");
Elements kpageE = kpage.select(bodySelect);
for (Element e : kpageE) {
String estring = e.text();
print(estring + "--------------------------------------------");
if (estring.contentEquals(subHead)) {
sibHead = e.nextElementSiblings(); //this prints all elements in the h2 title section but i want only the h4 titles
for(Element ei : sibHead) {
String eistr = ei.text();
print(eistr);
}
}
I have already parsed the HTML and have got a list of all H2 elements, now I want specific elements between one H2 element and the next, more specifically I want all H4 elements.

With Jsoup you can use the .getElementsByTag method of the Document class, which allows you to retrieve all the elements according to their tagName.
Here is an example of use:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class App {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://inscription.devlab.umontp.fr/").get();
Elements h4elements = doc.getElementsByTag("h4");
for (Element h4 : h4elements) {
System.out.println(h4.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Related

XMLParsing, Dynamic Structure, Content

Want to Achive:
Get an unknown XML file's Elements (Element Name, How many elements are there in the xml file).
Then get all the attributes and their name and values to use it later (eg Comparison to other xml file)
element_vs_attribute
Researched:
1. 2. 3. 4. 5.
And many more
Does Anyone have any idea for this?
I dont want to pre define more then 500 table like in the previous code snippet, somehow i should be able to get the number of elements and the element names itself dynamically.
EDIT!
Example1
<Root Attri1="" Attri2="">
<element1 EAttri1="" EAttri2=""/>
<Element2 EAttri1="" EAttri2="">
<nestedelement3 NEAttri1="" NEAttri2=""/>
</Element2>
</Root>
Example2
<Root Attri1="" Attri2="" Attr="" At="">
<element1 EAttri1="" EAttri2="">
<nestedElement2 EAttri1="" EAttri2="">
<nestedelement3 NEAttri1="" NEAttri2=""/>
</nestedElement2>
</element1>
</Root>
Program Snipet:
String Example1[] = {"element1","Element2","nestedelement3"};
String Example2[] = {"element1","nestedElement2","nestedelement3"};
for(int i=0;i<Example1.length;++){
NodeList Elements = oldDOC.getElementsByTagName(Example1[i]);
for(int j=0;j<Elements.getLength();j++) {
Node nodeinfo=Elements.item(j);
for(int l=0;l<nodeinfo.getAttributes().getLength();l++) {
.....
}
}
Output:
The expected result is to get all the Element and all the Attributes out from the XML file without pre defining anything.
eg:
Elements: element1 Element2 nestedelement3
Attributes: Attri1 Attri2 EAttri1 EAttri2 EAttri1 EAttri2 NEAttri1 NEAttri2
The right tool for this job is xpath
It allows you to collect all or some elements and attributes based on various criteria. It is the closest you will get to a "universal" xml parser.
Here is the solution that I came up with. The solution first finds all element names in the given xml doc, then for each element, it counts the element's occurrences, then collect it all to a map. same for attributes.
I added inline comments and method/variable names should be self explanatory.
import java.io.*;
import java.nio.file.*;
import java.util.*;
import java.util.function.*;
import java.util.stream.*;
import org.w3c.dom.*;
import javax.xml.parsers.*;
import javax.xml.xpath.*;
public class TestXpath
{
public static void main(String[] args) {
XPath xPath = XPathFactory.newInstance().newXPath();
try (InputStream is = Files.newInputStream(Paths.get("C://temp/test.xml"))) {
// parse file into xml doc
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDocument = builder.parse(is);
// find all element names in xml doc
Set<String> allElementNames = findNames(xmlDocument, xPath.compile("//*[name()]"));
// for each name, count occurrences, and collect to map
Map<String, Integer> elementsAndOccurrences = allElementNames.stream()
.collect(Collectors.toMap(Function.identity(), name -> countElementOccurrences(xmlDocument, name)));
System.out.println(elementsAndOccurrences);
// find all attribute names in xml doc
Set<String> allAttributeNames = findNames(xmlDocument, xPath.compile("//#*"));
// for each name, count occurrences, and collect to map
Map<String, Integer> attributesAndOccurrences = allAttributeNames.stream()
.collect(Collectors.toMap(Function.identity(), name -> countAttributeOccurrences(xmlDocument, name)));
System.out.println(attributesAndOccurrences);
} catch (Exception e) {
e.printStackTrace();
}
}
public static Set<String> findNames(Document xmlDoc, XPathExpression xpathExpr) {
try {
NodeList nodeList = (NodeList)xpathExpr.evaluate(xmlDoc, XPathConstants.NODESET);
// convert nodeList to set of node names
return IntStream.range(0, nodeList.getLength())
.mapToObj(i -> nodeList.item(i).getNodeName())
.collect(Collectors.toSet());
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return new HashSet<>();
}
public static int countElementOccurrences(Document xmlDoc, String elementName) {
return countOccurrences(xmlDoc, elementName, "count(//*[name()='" + elementName + "'])");
}
public static int countAttributeOccurrences(Document xmlDoc, String attributeName) {
return countOccurrences(xmlDoc, attributeName, "count(//#*[name()='" + attributeName + "'])");
}
public static int countOccurrences(Document xmlDoc, String name, String xpathExpr) {
XPath xPath = XPathFactory.newInstance().newXPath();
try {
Number count = (Number)xPath.compile(xpathExpr).evaluate(xmlDoc, XPathConstants.NUMBER);
return count.intValue();
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return 0;
}
}

Get all <p> texts after <div> and between <h2> by using Jsoup

<h2><span class="mw-headline" id="The_battle">The battle</span></h2>
<div class="thumb tright"></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2>Second Title I want to stop collecting p tags after</h2>
I am learning Jsoup by trying to scrap all the p tags, arranged by title from wikipedia site. I can scrap all the p tags between h2, from the help of this question:
extract unidentified html content from between two tags, using jsoup? regex?
by using
Elements elements = docx.select("span.mw-headline, h2 ~ p");
but I can't scrap it when there is a <div> between them. Here is the wikipedia site I am working on:
https://simple.wikipedia.org/wiki/Battle_of_Hastings
How can I grab all the p tags where they are between two specific h2 tags?
Preferably ordered by id.
Try this option : Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
sample code :
package jsoupex;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class stackoverflw {
public static void main(String[] args) throws IOException {
//Validate.isTrue(args.length == 1, "usage: supply url to fetch");
//String url = "http://localhost/stov_wiki.html";
String url = "https://simple.wikipedia.org/wiki/Battle_of_Hastings ";
//args[0];
System.out.println("Fetching %s..." + url);
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
for (Element elem : elements) {
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
}
System.out.println(elem.text());
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
} else {
System.out.println("");
}
}
}
}
public static void main(String[] args) {
String entity =
"<h2><span class=\"mw-headline\" id=\"The_battle\">The battle</span></h2>" +
"<div class=\"thumb tright\"></h2>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<h2>Second Title I want to stop collecting p tags after</h2>";
Document element = org.jsoup.Jsoup.parse(entity,"", Parser.xmlParser());
element.outputSettings().prettyPrint(false);
element.outputSettings().outline(false);
List<TextNode>text=getAllTextNodes(element);
}
private static List<TextNode> getAllTextNodes(Element newElementValue) {
List<TextNode>textNodes = new ArrayList<>();
Elements elements = newElementValue.getAllElements();
for (Element e : elements){
for (TextNode t : e.textNodes()){
textNodes.add(t);
}
}
return textNodes;
}

Java - HTML code: extract part of the tag

I have to extract some integers from a tag of a html code.
For example if I have:
< tag blabla="title"><a href="/test/tt123> TEST 1 < tag >
I did that removing all the chars and leaving only the digits and it worked until in the title name there was another digit, so i got "1231".
str.replaceAll("[^\\d.]", "");
How can I do to extract only the "123" integer?? Thanks for your help!
Jsoup is a good api to play around with html. Using that you could do like
String html = "<tag blabla=\"title\"><a href=\"/test/tt123\"> TEST 1 <tag>";
Document doc = Jsoup.parseBodyFragment(html);
String value = doc.select("a").get(0).attr("href").replaceAll("[^\\d.]", "");
System.out.println(value);
You could do this (a method that removes all duplicates in any number):
int[] foo = new int[str.length];
for(int i = 0; i < str.length; i++) {
foo[i] = Integer.parseInt(str.charAt(i));
}
Set<Integer> set = new HashSet<Integer>();
for(int i = 0; i < foo.length; i++){
set.add(foo[i]);
}
Now you have a set where all duplicate numbers from any string are removed. I saw your last comment not. So this answer might not be very useful to you. What you could do is that the three first digits in the foo array as well, which will give you 123.
First use XPath to parse out only the href value, then apply your replaceAll to achieve what you desired.
And you don't have to download any additional frameworks or libraries for this to work.
Here's a quick demo class on how this works:
package com.example.test;
import java.io.StringReader;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.xml.sax.InputSource;
public class Test {
public static void main(String[]args){
String xml = "<tag blabla=\"title\"> TEST 1 </tag>";
XPath xPath = XPathFactory.newInstance().newXPath();
InputSource source = new InputSource(new StringReader(xml));
String hrefValue = null;
try {
hrefValue = (String) xPath.evaluate("//#href", source, XPathConstants.STRING);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
String numbers = hrefValue.replaceAll("[^\\d.]", "");
System.out.println(numbers);
}
}

get all the children for a given xml in java

I am basically following the example here
http://www.mkyong.com/java/how-to-read-xml-file-in-java-jdom-example/
So rather than doing something like
node.getChildText("firstname")
right??
this works fine..
But is there a way to get all the "keys" and then I can query that to get values?
Just like we do in parsing json..
JSONObject json = (JSONObject) parser.parse(value);
for (Object key : json.keySet()) {
Object val = json.get(key);
}
rather than hardcoding keys and values?
Thanks
Code for reference:
package org.random_scripts;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
public class XMLReader {
public static void main(String[] args) {
SAXBuilder builder = new SAXBuilder();
File xmlFile = new File("data.xml");
try {
Document document = (Document) builder.build(xmlFile);
Element rootNode = document.getRootElement();
List list = rootNode.getChildren("staff");
List children = rootNode.getChildren();
System.out.println(children);
for (int i = 0; i < list.size(); i++) {
Element node = (Element) list.get(i);
System.out.println("First Name : " + node.getChildText("firstname"));
System.out.println("Last Name : " + node.getChildText("lastname"));
System.out.println("Nick Name : " + node.getChildText("nickname"));
System.out.println("Salary : " + node.getChildText("salary"));
}
} catch (IOException io) {
System.out.println(io.getMessage());
} catch (JDOMException jdomex) {
System.out.println(jdomex.getMessage());
}
}
}
Well, if you wanted to write out all of the children of the node, you could do something like this:
List children = rootNode.getChildren();
for (int i = 0; i < list.size(); i++) {
Element node = (Element) list.get(i);
List dataNodes = node.getChildren();
for (int j = 0; j < dataNodes.size(); ++j) {
Element dataNode = (Element) dataNodes.get(j);
System.out.println(dataNode.getName() + " : " + dataNode.getText());
}
}
This would let you write out all of the children without knowing the names, with the only downside being that you wouldn't have "pretty" names for the fields (i.e. "First Name" instead of "firstname"). Of course, you'd have the same limitation in JSON - I don't know of an easy way to get pretty names for the fields unless your program has some knowledge about what the children are, which is the thing you seem to be trying to avoid.
The above code only provides the list of 1st level child under the tag.
For example::
<parent>
<child1>
<childinternal></childinternal>
</child1>
<child2></child2>
</parent>
The above code only prints child1 and child2, if you want to print even the internal nodes in depth you have to do recursive call.
To find a child has more nodes in it use, jdom api child.getContentSize(), if its greater than 1 menas it has more nodes.

How to get a table from an html page using JAVA

I am working on a project where I am trying to fetch financial statements from the internet and use them in a JAVA application to automatically create ratios, and charts.
The site I am using uses a login and password to get to the tables.
The Tag is TBODY, but there are 2 other TBODY's in the html.
How can I use java to print my table to a txt file where I can then use in my application?
What would the best way to go about this, and what should I read up on?
If this were my project, I'd look into using an HTML parser, something like jsoup (although others are available). The jsoup site has a tutorial, and after playing with it a while, you'll likely find it pretty easy to use.
For example, for an HTML table like so:
jsoup could parse it like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TableEg {
public static void main(String[] args) {
String html = "http://publib.boulder.ibm.com/infocenter/iadthelp/v7r1/topic/" +
"com.ibm.etools.iseries.toolbox.doc/htmtblex.htm";
try {
Document doc = Jsoup.connect(html).get();
Elements tableElements = doc.select("table");
Elements tableHeaderEles = tableElements.select("thead tr th");
System.out.println("headers");
for (int i = 0; i < tableHeaderEles.size(); i++) {
System.out.println(tableHeaderEles.get(i).text());
}
System.out.println();
Elements tableRowElements = tableElements.select(":not(thead) tr");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
System.out.println("row");
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
System.out.println(rowItems.get(j).text());
}
System.out.println();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Resulting in the following output:
headers
ACCOUNT
NAME
BALANCE
row
0000001
Customer1
100.00
row
0000002
Customer2
200.00
row
0000003
Customer3
550.00

Categories