How to get node contents from JDOM - java

I'm writing an application in java using import org.jdom.*;
My XML is valid,but sometimes it contains HTML tags. For example, something like this:
<program-title>Anatomy & Physiology</program-title>
<overview>
<content>
For more info click here
<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>
</content>
</overview>
<key-information>
<category>Health & Human Services</category>
So my problem is with the < p > tags inside the overview.content node.
I was hoping that this code would work :
Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
System.out.println(content.getText());
but it returns blank.
How do I return all the text ( nested tags and all ) from the overview.content node ?
Thanks

content.getText() gives immediate text which is only useful fine with the leaf elements with text content.
Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )
public static void main(String[] args) throws Exception {
SAXBuilder builder = new SAXBuilder();
String xmlFileName = "a.xml";
Document doc = builder.build(xmlFileName);
Element root = doc.getRootElement();
Element overview = root.getChild("overview");
Element content = overview.getChild("content");
XMLOutputter outp = new XMLOutputter();
outp.setFormat(Format.getCompactFormat());
//outp.setFormat(Format.getRawFormat());
//outp.setFormat(Format.getPrettyFormat());
//outp.getFormat().setTextMode(Format.TextMode.PRESERVE);
StringWriter sw = new StringWriter();
outp.output(content.getContent(), sw);
StringBuffer sb = sw.getBuffer();
System.out.println(sb.toString());
}
Output
For more info clickhere<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>
Do explore other formatting options and modify above code to your need.
"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "

You could try using method getValue() for the closest approximation, but what this does is concatenate all text within the element and descendants together. This won't give you the <p> tag in any form. If that tag is in your XML like you've shown, it has become part of the XML markup. It'd need to be included as <p> or embedded in a CDATA section to be treated as text.
Alternatively, if you know all elements that either may or may not appear in your XML, you could apply an XSLT transformation that turns stuff which isn't intended as markup into plain text.

Well, maybe that's what you need:
import java.io.StringReader;
import org.custommonkey.xmlunit.XMLTestCase;
import org.custommonkey.xmlunit.XMLUnit;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import org.testng.annotations.Test;
import org.xml.sax.InputSource;
public class HowToGetNodeContentsJDOM extends XMLTestCase
{
private static final String XML = "<root>\n" +
" <program-title>Anatomy & Physiology</program-title>\n" +
" <overview>\n" +
" <content>\n" +
" For more info click here\n" +
" <p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>\n" +
" </content>\n" +
" </overview>\n" +
" <key-information>\n" +
" <category>Health & Human Services</category>\n" +
" </key-information>\n" +
"</root>";
private static final String EXPECTED = "For more info click here\n" +
"<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>";
#Test
public void test() throws Exception
{
XMLUnit.setIgnoreWhitespace(true);
Document document = new SAXBuilder().build(new InputSource(new StringReader(XML)));
List<Content> content = document.getRootElement().getChild("overview").getChild("content").getContent();
String out = new XMLOutputter().outputString(content);
assertXMLEqual("<root>" + EXPECTED + "</root>", "<root>" + out + "</root>");
}
}
Output:
PASSED: test on instance null(HowToGetNodeContentsJDOM)
===============================================
Default test
Tests run: 1, Failures: 0, Skips: 0
===============================================
I am using JDom with generics: http://www.junlu.com/list/25/883674.html
Edit: Actually that's not that much different from Prashant Bhate's answer. Maybe you need to tell us what you are missing...

If you're also generating the XML file you should be able to encapsulate your html data in <![CDATA[]]> so that it isn't parsed by the XML parser.

The problem is that the <content> node doesn't have a text child; it has a <p> child that happens to contain text.
Try this:
Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
Element p = content.getChild("p");
System.out.println(p.getText());
If you want all the immediate child nodes, call p.getChildren(). If you want to get ALL the child nodes, you'll have to call it recursively.

Not particularly pretty but works fine (using JDOM API):
public static String getRawText(Element element) {
if (element.getContent().size() == 0) {
return "";
}
StringBuffer text = new StringBuffer();
for (int i = 0; i < element.getContent().size(); i++) {
final Object obj = element.getContent().get(i);
if (obj instanceof Text) {
text.append( ((Text) obj).getText() );
} else if (obj instanceof Element) {
Element e = (Element) obj;
text.append( "<" ).append( e.getName() );
// dump all attributes
for (Attribute attribute : (List<Attribute>)e.getAttributes()) {
text.append(" ").append(attribute.getName()).append("=\"").append(attribute.getValue()).append("\"");
}
text.append(">");
text.append( getRawText( e )).append("</").append(e.getName()).append(">");
}
}
return text.toString();
}
Prashant Bhate's solution is nicer though!

If you want to output the content of some JSOM node just use
System.out.println(new XMLOutputter().outputString(node))

Related

How to get the values of child nodes in JDOM

I am trying to get a value by parsing an XML document using the JDOM library.
I want to get the values of the driverJar tags, which are child nodes based on the driverJars tag, but I can't get the values.
<connection>
<driverJars>
<driverJar>ojdbc11.jar</driverJar>
<driverJar>orai18n.jar</driverJar>
<driverJar>test.jar</driverJar>
</driverJars>
</connection>
I tried:
(It's done until the document is already loaded.)
if (element.getChild(DRIVER_JARS) != null) {
Element driverJarsElement = element.getChild(DRIVER_JARS);
List<Element> driverJarElementList = driverJarsElement.getChildren(DRIVER_JAR);
for (int i = 0; i < driverJarElementList.size(); i++) {
Element driverJarElement = driverJarElementList.get(i);
System.out.println(driverJarElement.getText()); // [Element: <driverJar/>]
}
}
If it is a child, you can get a value, but since it is a child, if you loop through the value, you cannot get the value of children by each index.
What I tried is the value (marked as a comment) that comes out when I print it with System.out.println.
How can I get the value?
What I want to get from the xml above is the String values of ojdbc11.jar, orai18n.jar, and test.jar.
Full code example
<connection>
<productId>oracle_10g</productId>
<productName>Oracle 9i ~ 21c</productName>
<driverJars>
<driverJar>ojdbc8.jar</driverJar>
<driverJar>orai18n.jar</driverJar>
</driverJars>
</connection>
String productId = element.getChildTextTrim(PRODUCT_ID); // oracle_10g
String productName = element.getChildTextTrim(PRODUCT_NAME); // Oracle 9i ~ 21c
Element driverJarsElement = element.getChild(DRIVER_JARS);
List<Element> driverJarElementList = driverJarsElement.getChildren(DRIVER_JAR);
if (element.getChild(DRIVER_JARS) != null) {
for (int i = 0; i < driverJarElementList.size(); i++) {
description.setDriverJars(new ArrayList<String (Arrays.asList(driverJarElementList.get(i).toString())));
}
}
(The reason I wrote that in setDriverJars is because it is a List.)
the code above is
(1) After loading the document, insert values into the fields declared in the object description.
(2) And make a copy of the object.
(3) Analyze the element and reconstruct the description using the copy.
(The method used to reconstruct the description has a different logic from the method in (1).)
In (1), I want to get values from xml, but I can't get values for multiple child nodes.
While the code in your question is not a minimal, reproducible example, the code below is essentially the same. One difference is that in the below code, I first get the root element from the DOM that is created from the XML file.
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
public class JdomTest {
private static final String DRIVER_JARS = "driverJars";
private static final String DRIVER_JAR = "driverJar";
public static void main(String[] args) {
File xmlFile = new File("connects.xml");
SAXBuilder saxBuilder = new SAXBuilder();
try {
Document doc = saxBuilder.build(xmlFile);
Element root = doc.getRootElement();
Element driverJarsElement = root.getChild(DRIVER_JARS);
List<Element> driverJarElementList = driverJarsElement.getChildren(DRIVER_JAR);
for (int i = 0; i < driverJarElementList.size(); i++) {
Element driverJarElement = driverJarElementList.get(i);
System.out.println(driverJarElement.getText());
}
}
catch (JDOMException | IOException x) {
x.printStackTrace();
}
}
}
Here are the contents of file connects.xml
<connection>
<productId>oracle_10g</productId>
<productName>Oracle 9i ~ 21c</productName>
<driverJars>
<driverJar>ojdbc11.jar</driverJar>
<driverJar>orai18n.jar</driverJar>
</driverJars>
</connection>
And here is the output I get when I run the above code:
ojdbc11.jar
orai18n.jar
My environment is JDK 17.0.4 on Windows 10 (64 bit) and JDOM 2.0.6

stax xml confusion with getname function

I have a xml file like this:
<comment type="PTM">
<text evidence="19">Sumoylated following its interaction with PIAS1 and UBE2I.</text>
</comment>
<comment type="PTM">
<text evidence="17">Ubiquitinated, leading to proteasomal degradation.</text>
</comment>
<comment type="disease">
<text>A chromosomal aberration involving ZMYND11 is a cause of acute poorly differentiated myeloid leukemia. Translocation (10;17)(p15;q21) with MBTD1.</text>
</comment>
<comment type="disease" evidence="23">
<disease id="DI-04257">
<name>Mental retardation, autosomal dominant 30</name>
<acronym>MRD30</acronym>
<description>A disorder characterized by significantly below average general intellectual functioning associated with impairments in adaptive behavior and manifested during the developmental period. MRD30 patients manifest mild intellectual disability and subtle facial dysmorphisms, including hypertelorism, ptosis, and a wide mouth.</description>
<dbReference type="MIM" id="616083"/>
</disease>
<text>The disease is caused by mutations affecting the gene represented in this entry.</text>
</comment>
<comment type="similarity">
<text evidence="8">Contains 1 bromo domain.</text>
</comment>
<comment type="similarity">
<text evidence="9">Contains 1 MYND-type zinc finger.</text>
</comment>
I use stax to extract the disease information. This is part of my code:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader( new FileReader(p));
while(eventReader.hasNext()){
XMLEvent event = eventReader.nextEvent();
switch(event.getEventType()){
case XMLStreamConstants.START_ELEMENT:
StartElement startElement = event.asStartElement();
String qName = startElement.getName().getLocalPart();
if (qName.equalsIgnoreCase("comment")) {
System.out.println("Start Element : comment");
Iterator<Attribute> attributes = startElement.getAttributes();
Attribute a = attributes.next();
System.out.println("ATRIBUTES " + a.getName());
type = a.getValue();
System.out.println("Roll No : " + type);
} else if(qName.equalsIgnoreCase("text") && type.equals("disease")){ text = true; }
break;
case XMLStreamConstants.CHARACTERS:
Characters characters = event.asCharacters();
if(text){ res = res + " " + characters.getData();
//System.out.println("TEXT: " + res);
text = false;
}
break;
case XMLStreamConstants.END_ELEMENT:
EndElement endElement = event.asEndElement();
if(endElement.getName().getLocalPart().equalsIgnoreCase("comment")){
//System.out.println("End Element : comment");
//System.out.println();
}
break;
For this type of line:
<comment type="disease">
I can extract the info correctly, but when I try to find comment type "disease" in this line:
<comment type="disease" evidence="23">
it gives me type=evidence and not type=disease as it should be. Therefore it doesn't save anything from this kind of line.
First of all can we please get in the habit of using useful variable names, you have the following variables with their type: a(node), text(boolean), qName(String)... These variables leave me scratching my head and wondering what they are:
a - Just not a useful name, it should really be something like typeAttr or something noting that it should be the type="" attribute
text - its a boolean?! maybe collectText would be more appropriate since it designates that you should collect the next text events value.
qName - its a string which is the localPart of a QName, if its not a QName then dont name it as one..
But thats enough ranting you get the idea. Your problem lies in where you get the attribute. In XML attributes have no specific order and will not and should not be expected to return in the order which they are defined. In your code you have the following
Iterator<Attribute> attributes = startElement.getAttributes();
Attribute a = attributes.next();
System.out.println("ATRIBUTES " + a.getName());
type = a.getValue();
Here you get the first attribute from the element and set the type equal to its value. As I mentioned the XML attributes have no specific order so you are getting the evidence attribute. You should be getting the attribute by name:
Attribute a = startElement.getAttributeByName(QName.valueOf("type"));
System.out.println("ATRIBUTES " + a.getName());
type = a.getValue();
Sorry no direct answer but a comment on how to use StaX or XmlPull effectively: Streaming XML parsers are designed to be friendly for recursive descent parsing (avoiding explicit state modeling, something you'd often need with a SAX parser) -- in your case i'd expect the following methods (rejecting or ignoring all unexpected content):
Comment parseComment(XMLEventReader eventReader) {
// call parseText and parseDisease for the corresponding element starts
}
Text parseText(XMLEventReader eventReader) {
}
Disease parseDisease(XmlEventReader eventReader) {
}
That said, there is a tradeoff: If you don't need the streaming aspect (performance), you may be better of with just parsing to a DOM and then extracting the information as needed by walking or peeking into the DOM, avoiding a low level XML API altogether.
By using Stax I assume you are dealing with a large document, or a platform with limited resources... the fact is that memory overhead is largely a DOM related issue. VTD-XML on the other hand is far more efficient than DOM while retaining vitually all benefits of DOM style of coding... please read this latest research paper for more info
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
import com.ximpleware.*;
public class queryAttr {
public static void main(String[] s) throws VTDException{
VTDGen vg = new VTDGen();
vg.selectLcDepth(5);// improve XPath performance for deep document
if (!vg.parseFile("input.xml", false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/root/comment[#type='disease' and #evidence='23']");
int i=0,j=0;
while((i=ap.evalXPath())!=-1){
if (vn.toElement(VTDNav.FIRST_CHILD)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println(""+vn.toString(i));
if (vn.toElement(VTDNav.NS)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println("text node==>"+vn.toString(i));
}
if (vn.toElement(VTDNav.NS)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println("text node==>"+vn.toString(i));
}
if (vn.toElement(VTDNav.NS)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println("text node==>"+vn.toString(i));
}
vn.toElement(VTDNav.PARENT);
}
}
}
}

Compare two XML strings ignoring element order

Suppose I have two xml strings
<test>
<elem>a</elem>
<elem>b</elem>
</test>
<test>
<elem>b</elem>
<elem>a</elem>
</test>
How to write a test that compares those two strings and ignores the element order?
I want the test to be as short as possible, no place for 10-line XML parsing etc. I'm looking for a simple assertion or something similar.
I have this (which doesn't work)
Diff diff = XMLUnit.compareXML(expectedString, actualString);
XMLAssert.assertXMLEqual("meh", diff, true);
For xmlunit 2.0 (I was looking for this) it is now done, by using DefaultNodeMatcher
Diff diff = Diffbuilder.compare(Input.fromFile(control))
.withTest(Input.fromFile(test))
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byNameAndText))
.build()
Hope this helps this helps other people googling...
XMLUnit will do what you want, but you have to specify the elementQualifier. With no elementQualifier specified it will only compare the nodes in the same position.
For your example you want an ElementNameAndTextQualifer, this considers a node similar if one exists that matches the element name and it's text value, something like :
Diff diff = new Diff(control, toTest);
// we don't care about ordering
diff.overrideElementQualifier(new ElementNameAndTextQualifier());
XMLAssert.assertXMLEqual(diff, true);
You can read more about it here: http://xmlunit.sourceforge.net/userguide/html/ar01s03.html#ElementQualifier
My original answer is outdated. If I would have to build it again i would use xmlunit 2 and xmlunit-matchers. Please note that for xml unit a different order is always 'similar' not equals.
#Test
public void testXmlUnit() {
String myControlXML = "<test><elem>a</elem><elem>b</elem></test>";
String expected = "<test><elem>b</elem><elem>a</elem></test>";
assertThat(myControlXML, isSimilarTo(expected)
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byNameAndText)));
//In case you wan't to ignore whitespaces add ignoreWhitespace().normalizeWhitespace()
assertThat(myControlXML, isSimilarTo(expected)
.ignoreWhitespace()
.normalizeWhitespace()
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byNameAndText)));
}
If somebody still want't to use a pure java implementation here it is. This implementation extracts the content from xml and compares the list ignoring order.
public static Document loadXMLFromString(String xml) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xml));
return builder.parse(is);
}
#Test
public void test() throws Exception {
Document doc = loadXMLFromString("<test>\n" +
" <elem>b</elem>\n" +
" <elem>a</elem>\n" +
"</test>");
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//test//elem");
NodeList all = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
List<String> values = new ArrayList<>();
if (all != null && all.getLength() > 0) {
for (int i = 0; i < all.getLength(); i++) {
values.add(all.item(i).getTextContent());
}
}
Set<String> expected = new HashSet<>(Arrays.asList("a", "b"));
assertThat("List equality without order",
values, containsInAnyOrder(expected.toArray()));
}
Cross-posting from Compare XML ignoring order of child elements
I had a similar need this evening, and couldn't find something that fit my requirements.
My workaround was to sort the two XML files I wanted to diff, sorting alphabetically by the element name. Once they were both in a consistent order, I could diff the two sorted files using a regular visual diff tool.
If this approach sounds useful to anyone else, I've shared the python script I wrote to do the sorting at http://dalelane.co.uk/blog/?p=3225
Just as an example of how to compare more complex xml elements matching based on equality of attribute name. For instance:
<request>
<param name="foo" style="" type="xs:int"/>
<param name="Cookie" path="cookie" style="header" type="xs:string" />
</request>
vs.
<request>
<param name="Cookie" path="cookie" style="header" type="xs:string" />
<param name="foo" style="query" type="xs:int"/>
</request>
With following custom element qualifier:
final Diff diff = XMLUnit.compareXML(controlXml, testXml);
diff.overrideElementQualifier(new ElementNameAndTextQualifier() {
#Override
public boolean qualifyForComparison(final Element control, final Element test) {
// this condition is copied from super.super class
if (!(control != null && test != null
&& equalsNamespace(control, test)
&& getNonNamespacedNodeName(control).equals(getNonNamespacedNodeName(test)))) {
return false;
}
// matching based on 'name' attribute
if (control.hasAttribute("name") && test.hasAttribute("name")) {
if (control.getAttribute("name").equals(test.getAttribute("name"))) {
return true;
}
}
return false;
}
});
XMLAssert.assertXMLEqual(diff, true);
For me, I also needed to add the method : checkForSimilar() on the DiffBuilder.
Without it, the assert was in error saying that the sequence of the nodes was not the same (the position in the child list was not the same)
My code was :
Diff diff = Diffbuilder.compare(Input.fromFile(control))
.withTest(Input.fromFile(test))
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byNameAndText))
.checkForSimilar()
.build()
I don't know what versions they took for the solutions, but nothing worked (or was simple at least) so here's my solution for who had the same pains.
P.S. I hate people to miss the imports or the FQN class names of static methods
#Test
void given2XMLS_are_different_elements_sequence_with_whitespaces(){
String testXml = "<struct><int>3</int> <boolean>false</boolean> </struct>";
String expected = "<struct><boolean>false</boolean><int>3</int></struct>";
XmlAssert.assertThat(testXml).and(expected)
.ignoreWhitespace()
.normalizeWhitespace()
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byNameAndText))
.areSimilar();
}
OPTION 1
If the XML code is simple, try this:
String testString = ...
assertTrue(testString.matches("(?m)^<test>(\\s*<elem>(a|b)</elem>\\s*){2}</test>$"));
OPTION 2
If the XML is more elaborate, load it with an XML parser and compare the actual nodes found with you reference nodes.
I end up rewriting the xml and comparing it back. Let me know if it helps any of you who stumbled on this similar issue.
import org.apache.commons.lang3.StringUtils;
import org.jdom2.Attribute;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.stream.Collectors;
public class XmlRewriter {
private static String rewriteXml(String xml) throws Exception {
SAXBuilder builder = new SAXBuilder();
Document document = builder.build(new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8)));
Element root = document.getRootElement();
XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
root.sortChildren((o1, o2) -> {
if(!StringUtils.equals(o1.getName(), o2.getName())){
return o1.getName().compareTo(o2.getName());
}
// get attributes
int attrCompare = transformToStr(o1.getAttributes()).compareTo(transformToStr(o2.getAttributes()));
if(attrCompare!=0){
return attrCompare;
}
if(o1.getValue()!=null && o2.getValue()!=null){
return o1.getValue().compareTo(o2.getValue());
}
return 0;
});
return xmlOutputter.outputString(root);
}
private static String transformToStr(List<Attribute> attributes){
return attributes.stream().map(e-> e.getName()+":"+e.getValue()).sorted().collect(Collectors.joining(","));
}
public static boolean areXmlSimilar(String xml1, String xml2) throws Exception {
Diff diff = DiffBuilder.compare(rewriteXml(xml1)).withTest(rewriteXml(xml2))
.normalizeWhitespace()
.ignoreWhitespace()
.ignoreComments()
.checkForSimilar()
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byNameAndText))
.build();
return !diff.hasDifferences();
}
// move below into another test class..
#Test
public void compareXml() throws Exception {
String xml1 = "<<your first XML str here>>";
String xml2 = "<<another XML str here>>";
assertTrue(XmlUtil.areXmlSimilar(xml1, xml2));
}
}

How to get h2 Tag of a table using Jsoup

I need some help scraping a webpage with Jsoup. I want to pars player profiles from the hcfactions webpage and gather their kills and deaths. The problem I'm running into is that each profile page is dynamically created and will only have said tables if the player has kills or deaths. So in order to tell which table I'm parsing I need to get the header text that's set after the call.
example web page: http://www.hcfactions.net/index.php?action=playerinfo&player=Djmaddox.
Below is a html segment from the web page I'm scraping:
<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>
<tr><td>Date</td><td>Reason</td><td>Details</td></tr><tr><td>Dec 11 5:27pm CST</td>.....
I have this code that pulls the tables and counts entries but it wont pull the h2 tags with it for me to select.
public void getPlayerDetails(String name) {
String data = "";
Avatar temp = _db.getPlayer(name);
playerUrl = "http://www.hcfactions.net/index.php?action=playersearch&player=" + name;
try {
// data = Jsoup.connect(url)
// .url(url).get().html();
playerDoc = Jsoup.connect(playerUrl).get();
} catch (IOException ex) {
Logger.getLogger(JParser.class.getName()).log(Level.SEVERE, null, ex);
}
if (playerDoc.select("table").size() == 1) {
return;
} else if (playerDoc.select("table").size() >= 2) {
for (int x = 1; x < playerDoc.select("table").size(); x++) {
System.out.println("deaths");
Element table = playerDoc.select("table").get(x);
Iterator<Element> ite = table.select("tr").iterator();
int count = 0;
while (ite.hasNext()) {
data = ite.next().text();
count++;
}
if (count > 0) {
temp.setDeaths(count - 1);
}
}
}
}
The tag <h2> is on an invalid position. That's why JSoup cannot find it I think. You have to extract it yourself with regular expressions. You can get the content of the <h2> with the following code:
String tableToString = "<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>" + "<tr>" + "<td>Date</td>" + "<td>Reason</td>" + "<td>Details</td>" + "</tr>" + "</table>";
String regex = "<h2.*>(.*)?</h2>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(tableToString);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
You can init tableToString with table.toString() from your code.
As ka3ak says, the <h2> is mispositioned. But you don't have to abandon your parser as resort to regex for that. Assuming JSoup is a decent HTML parser (never used it myself) the <h2> element should be the element immediately preceding the <table> element. Get your 'select' statement to look for it there.
Elements headers=playerDoc.select("div.span10.offset1 h2");
IMHO Your selections seams to be little bit overcomplicated, but maybe it has to be like that. Anyway snippet above will get you every H2 tags present in proper container.
Later on you ca select required tables like that Elements tables=playerDoc.select("div.span10.offset1 table"); and apply proper data digging onto them. Headers will be in corresponding order to tables ofc. I think, that my job is done here :)

How to insert/replace XML tag in XmlDocument?

I have a XmlDocument in java, created with the Weblogic XmlDocument parser.
I want to replace the content of a tag in this XMLDocument with my own data, or insert the tag if it isn't there.
<customdata>
<tag1 />
<tag2>mfkdslmlfkm</tag2>
<location />
<tag3 />
</customdata>
For example I want to insert a URL in the location tag:
<location>http://something</location>
but otherwise leave the XML as is.
Currently I use a XMLCursor:
XmlObject xmlobj = XmlObject.Factory.parse(a.getCustomData(), options);
XmlCursor xmlcur = xmlobj.newCursor();
while (xmlcur.hasNextToken()) {
boolean found = false;
if (xmlcur.isStart() && "schema-location".equals(xmlcur.getName().toString())) {
xmlcur.setTextValue("http://replaced");
System.out.println("replaced");
found = true;
} else if (xmlcur.isStart() && "customdata".equals(xmlcur.getName().toString())) {
xmlcur.push();
} else if (xmlcur.isEnddoc()) {
if (!found) {
xmlcur.pop();
xmlcur.toEndToken();
xmlcur.insertElementWithText("schema-location", "http://inserted");
System.out.println("inserted");
}
}
xmlcur.toNextToken();
}
I tried to find a "quick" xquery way to do this since the XmlDocument has an execQuery method, but didn't find it very easy.
Do anyone have a better way than this? It seems a bit elaborate.
How about an XPath based approach? I like this approach as the logic is super-easy to understand. The code is pretty much self-documenting.
If your xml document is available to you as an org.w3c.dom.Document object (as most parsers return), then you could do something like the following:
// get the list of customdata nodes
NodeList customDataNodeSet = findNodes(document, "//customdata" );
for (int i=0 ; i < customDataNodeSet.getLength() ; i++) {
Node customDataNode = customDataNodeSet.item( i );
// get the location nodes (if any) within this one customdata node
NodeList locationNodeSet = findNodes(customDataNode, "location" );
if (locationNodeSet.getLength() > 0) {
// replace
locationNodeSet.item( 0 ).setTextContent( "http://stackoverflow.com/" );
}
else {
// insert
Element newLocationNode = document.createElement( "location" );
newLocationNode.setTextContent("http://stackoverflow.com/" );
customDataNode.appendChild( newLocationNode );
}
}
And here's the helper method findNodes that does the XPath search.
private NodeList findNodes( Object obj, String xPathString )
throws XPathExpressionException {
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression expression = xPath.compile( xPathString );
return (NodeList) expression.evaluate( obj, XPathConstants.NODESET );
}
How about an object oriented approach? You could deserialise the XML to an object, set the location value on the object, then serialise back to XML.
XStream makes this really easy.
For example, you would define the main object, which in your case is CustomData (I'm using public fields to keep the example simple):
public class CustomData {
public String tag1;
public String tag2;
public String location;
public String tag3;
}
Then you initialize XStream:
XStream xstream = new XStream();
// if you need to output the main tag in lowercase, use the following line
xstream.alias("customdata", CustomData.class);
Now you can construct an object from XML, set the location field on the object and regenerate the XML:
CustomData d = (CustomData)xstream.fromXML(xml);
d.location = "http://stackoverflow.com";
xml = xstream.toXML(d);
How does that sound?
If you don't know the schema the XStream solution probably isn't the way to go. At least XStream is on your radar now, might come in handy in the future!
You should be able to do this with query
try
fn:replace(string,pattern,replace)
I am new to xquery myself and I have found it to be a painful query language to work with, but it does work quiet well once you get over the initial learning curve.
I do still wish there was an easier way which was as efficient?

Categories