XML getting text around a tag

XML getting text around a tag - java

I have a XML with below schema and I want to retrieve the text around(both left and right ) a tag as below (Using JAVA + DOM4j)
<article>
<article-meta></article-meta>
<body>
<p>
Extensible Markup Language (XML) is a markup language that defines a set of
rules for encoding documents in a format that is both human-readable and machine-
readable <ref id = 1>1</ref>. It is defined in the XML 1.0 Specification produced
by the W3C, and several other related specifications
</p>
<p>
Many application programming interfaces (APIs) have been developed to aid
software developers with processing XML <ref id = 2>2</ref>. data, and several schema
systems exist to aid in the definition of XML-based languages.
</p>
</body>
</article>
I want to retrieve the text around tag . For example out for this XML would be
<ref id = 1>1</ref>
left : both human-readable and machine-
readable
right : It is defined in the XML 1.0 Specification

Try
import java.util.List;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;
public class TestDom4j {
public static Document getDocument(final String xmlFileName) {
Document document = null;
SAXReader reader = new SAXReader();
try {
document = reader.read(xmlFileName);
} catch (DocumentException e) {
e.printStackTrace();
}
return document;
}
/**
* #param args
*/
public static void main(String[] args) {
String xmlFileName = "data.xml";
String xPath = "//article/body/p";
Document document = getDocument(xmlFileName);
List<Node> nodes = document.selectNodes(xPath);
for (Node node : nodes) {
String nodeXml = node.asXML();
System.out.println("Left >> " + nodeXml.substring(3, nodeXml.indexOf("<ref")).trim());
System.out.println("Right >> " + nodeXml.substring(nodeXml.indexOf("</ref>") + 6, nodeXml.length() - 4).trim());
}
}
}

Related

Convert relative to absolute links using jsoup

I'm using jsoup to clean a html page, the problem is that when I save the html locally, the images do not show because they are all relative links.
Here's some example code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class so2 {
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??
System.out.println(doc);
}
}
Output:
<html>
<head>
<title>The Title</title>
</head>
<body>
<p><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></p>
</body>
</html>
The output still shows the links as a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif".
I would like it to convert them to a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
Can anyone show me how to get jsoup to convert all the links to absolute links?

You can select all the links and transform their hrefs to absolute using Element.absUrl()
Example in your code:
EDIT (added processing of images)
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com");
Elements select = doc.select("a");
for (Element e : select){
// baseUri will be used by absUrl
String absUrl = e.absUrl("href");
e.attr("href", absUrl);
}
//now we process the imgs
select = doc.select("img");
for (Element e : select){
e.attr("src", e.absUrl("src"));
}
System.out.println(doc);
}

how to extract the text after certain tags using jsoup

i am using jsoup to extract tweeter text. so the html structure is
<p class="js-tweet-text tweet-text">#sexyazzjas There is so much love in the air, Jasmine! Thanks for the shout out. <a href="/search?q=%23ATTLove&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" ><s>#</s><b>ATTLove</b></a></p>
what i want to get isThere is so much love in the air, Jasmine! Thanks for the shout out.
and i want to extract all the tweeter text in the entire page.
I am new to java. the code has bugs. please help me thank you
below is my code:
package htmlparser;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class tweettxt {
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("https://twitter.com/ATT/").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("p class="js-tweet-text tweet-text"");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("p"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Although I do agree with Robin Green about using the API and not Jsoup in this occasion, I will provide a working solution for what you asked just to close this topic and for help on future viewers that have a problem with
selector with tag that has two or more classes
Get the direct text of a Jsoup element that contains other elements.
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("https://twitter.com/ATT/").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
//select this <p class="js-tweet-text tweet-text"></p>
Elements links = doc.select("p.js-tweet-text.tweet-text");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("p"));
/*use ownText() instead of text() in order to grab the direct text of
<p> and not the text that belongs to <p>'s children*/
System.out.println("text : " + link.ownText());
}
} catch (IOException e) {
e.printStackTrace();
}
}

Parsing SOAP Response in Java

I do not succeed in parsing a SOAP Response in Java (using Bonita Open Solution BPM).
I have the following SOAP response (searching for a document in the IBM Content Manager; the SOAP Response returns 1 matching document)
<soapenv:Envelope xmlns:soapenv="http://www.w3.org/2003/05/soap-envelope" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<ns1:RunQueryReply xmlns="http://www.ibm.com/xmlns/db2/cm/beans/1.0/schema" xmlns:ns1="http://www.ibm.com/xmlns/db2/cm/beans/1.0/schema">
<ns1:RequestStatus success="true"></ns1:RequestStatus>
<ns1:ResultSet count="1">
<ns1:Item URI="http://xxxx/CMBSpecificWebService/CMBGetPIDUrl?pid=96 3 ICM8 ICMNLSDB16 ICCSPArchivSuche59 26 A1001001A12D18B30015E9357518 A12D18B30015E935751 14 1087&server=ICMNLSDB&dsType=ICM">
<ns1:ItemXML>
<ICCSPArchivSuche ICCCreatedBy="EBUSINESS\iccadmin" ICCCreatedDate="2012-04-18T10:51:26.000000" ICCFileName="Golem_Artikel.txt" ICCFolderPath="" ICCLastModifiedDate="2012-04-18T10:51:28.000000" ICCLibrary="Dokumente" ICCModifiedBy="EBUSINESS\iccadmin" ICCSharePointGUID="c43f9c93-a228-43f9-8232-06bdea4695d1" ICCSharePointVersion="1.0 " ICCSite="Archiv Suche" cm:PID="96 3 ICM8 ICMNLSDB16 ICCSPArchivSuche59 26 A1001001A12D18B30015E9357518 A12D18B30015E935751 14 1087" xmlns:cm="http://www.ibm.com/xmlns/db2/cm/api/1.0/schema">
<cm:properties type="document">
<cm:lastChangeUserid value="ICCCMADMIN"/>
<cm:lastChangeTime value="2012-04-18T11:00:15.914"/>
<cm:createUserid value="ICCCMADMIN"/>
<cm:createTime value="2012-04-18T11:00:15.914"/>
<cm:semanticType value="1"/>
<cm:ACL name="DocRouteACL"/>
<cm:lastOperation name="RETRIEVE" value="SUCCESS"/>
</cm:properties>
<cm:resourceObject CCSID="0" MIMEType="text/plain" RMName="rmdb" SMSCollName="CBR.CLLCT001" externalObjectName=" " originalFileName="" resourceFlag="2" resourceName=" " size="702" textSearchable="true" xsi:type="cm:TextObjectType">
<cm:URL value="http://cmwin01.ebusiness.local:9080/icmrm/ICMResourceManager/A1001001A12D18B30015E93575.txt?order=retrieve&item-id=A1001001A12D18B30015E93575&version=1&collection=CBR.CLLCT001&libname=icmnlsdb&update-date=2012-04-18+11%3A00%3A15.001593&token=A4E6.IcQyRE6_QbBPESDGxK2;&content-length=0"/>
</cm:resourceObject>
</ICCSPArchivSuche>
</ns1:ItemXML>
</ns1:Item>
</ns1:ResultSet>
</ns1:RunQueryReply>
</soapenv:Body>
</soapenv:Envelope>
I would like to get the filename (ICCFileName="Golem_Artikel.txt") and the url to this file ( <cm:URL value="http://cmwin01.ebusiness.local:9080/icmrm/ICMResourceManager/A10...) in string Variables using Java. I read several articles on how to do this (Can't process SOAP response , How to do the Parsing of SOAP Response) but without success. Here is what I tried:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
// Clean response xml document
responseDocumentBody.normalizeDocument();
// Get result node
NodeList resultList = responseDocumentBody.getElementsByTagName("ICCSPArchivSuche");
Element resultElement = (Element) resultList.item(0);
String XMLData = resultElement.getTextContent();
// Check for empty result
if ("Data Not Found".equalsIgnoreCase(XMLData))
return null;
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(XMLData));
Document doc = documentBuilder.parse(inputSource);
Node node = doc.getDocumentElement();
String result = doc.getNodeType();
return result;
From Bonita, I only get responseDocumentBody or responseDocumentEnvelope (org.w3c.dom.Document) as webservice response. Therefore, I need to navigate from the SOAP Body to my variables. I would be pleased if someone could help.
Best regards

If you do a lot of work with this, I would definitively recommend using JAXB as MGoron suggests. If this is a one shot excersize, XPATH could also work well.
/*
* Must use a namespace aware factory
*/
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
Document doc = dbf.newDocumentBuilder().parse(...);
/*
* Create an XPath object
*/
XPath p = XPathFactory.newInstance().newXPath();
/*
* Must use a namespace context
*/
p.setNamespaceContext(new NamespaceContext() {
public Iterator getPrefixes(String namespaceURI) {
return null;
}
public String getPrefix(String namespaceURI) {
return null;
}
public String getNamespaceURI(String prefix) {
if (prefix.equals("ns1"))
return "http://www.ibm.com/xmlns/db2/cm/beans/1.0/schema";
if (prefix.equals("cm"))
return "http://www.ibm.com/xmlns/db2/cm/api/1.0/schema";
return null;
}
});
/*
* Find the ICCSFileName attribute
*/
Node iccsFileName = (Node) p.evaluate("//ns1:ICCSPArchivSuche/#ICCFileName", doc, XPathConstants.NODE);
System.out.println(iccsFileName.getNodeValue());
/*
* Find the URL
*/
Node url = (Node) p.evaluate("//ns1:ICCSPArchivSuche/cm:resourceObject/cm:URL/#value", doc, XPathConstants.NODE);
System.out.println(url.getNodeValue());

get RunQueryReply schema
map xsd to java classes using jax-b
unmarshall response string to jax-b class object

Below is the code to do this in VTD-XML, it basically consists of 2 XPath queries, each returning one result... however the code is robust as it doesn't assume those queries will return non-empty result...
import com.ximpleware.*;
public class parseSOAP {
public static void main(String[] s) throws VTDException, Exception{
VTDGen vg = new VTDGen();
vg.selectLcDepth(5);// soap has deep nesting so set to 5 to speed up navigation
if (!vg.parseFile("d:\\xml\\soap2.xml", true))
return;
VTDNav vn = vg.getNav();
AutoPilot ap =new AutoPilot(vn);
//declare name space for xpath
ap.declareXPathNameSpace("ns", "http://www.ibm.com/xmlns/db2/cm/beans/1.0/schema");
ap.declareXPathNameSpace("ns1", "http://www.ibm.com/xmlns/db2/cm/beans/1.0/schema");
ap.declareXPathNameSpace("cm", "http://www.ibm.com/xmlns/db2/cm/api/1.0/schema");
ap.declareXPathNameSpace("soapenv", "http://www.w3.org/2003/05/soap-envelope");
ap.selectXPath("/soapenv:Envelope/soapenv:Body/ns1:RunQueryReply/ns1:ResultSet/ns1:Item/ns1:ItemXML//ICCSPArchivSuche/#ICCFileName");
int i=0;
if ((i=ap.evalXPath())!=-1){
System.out.println("file name ==>"+vn.toString(i+1));
}
ap.selectXPath("/soapenv:Envelope/soapenv:Body/ns1:RunQueryReply/ns1:ResultSet/ns1:Item/ns1:ItemXML//ICCSPArchivSuche/cm:resourceObject/cm:URL/#value");
if ((i=ap.evalXPath())!=-1){
System.out.println("file name ==>"+vn.toString(i+1));
}
}
}

Is there a way to give a Java Document an XML schema for XPath queries

javax.xml.parsers.DocumentBuilder can build a document from a single stream which is the XML file. However, I can't find any way to also give it a schema file.
Is there a way to do this so that my XPath queries can perform type aware queries and return typed data?

The JAXP API is designed for XPath 1.0 and has never been upgraded to handle 2.0 concepts like schema-aware queries. If you are using Saxon, use the s9api interface instead of JAXP.
Here's an example of schema aware XPath taken from s9apiExamples.java in the saxon-resources download:
/**
* Demonstrate use of a schema-aware XPath expression
*/
private static class XPathC implements S9APIExamples.Test {
public String name() {
return "XPathC";
}
public boolean needsSaxonEE() {
return true;
}
public void run() throws SaxonApiException {
Processor proc = new Processor(true);
SchemaManager sm = proc.getSchemaManager();
sm.load(new StreamSource(new File("data/books.xsd")));
SchemaValidator sv = sm.newSchemaValidator();
sv.setLax(false);
XPathCompiler xpath = proc.newXPathCompiler();
xpath.declareNamespace("saxon", "http://saxon.sf.net/"); // not actually used, just for demonstration
xpath.importSchemaNamespace(""); // import schema for the non-namespace
DocumentBuilder builder = proc.newDocumentBuilder();
builder.setLineNumbering(true);
builder.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL);
builder.setSchemaValidator(sv);
XdmNode booksDoc = builder.build(new File("data/books.xml"));
// find all the ITEM elements, and for each one display the TITLE child
XPathSelector verify = xpath.compile(". instance of document-node(schema-element(BOOKLIST))").load();
verify.setContextItem(booksDoc);
if (((XdmAtomicValue)verify.evaluateSingle()).getBooleanValue()) {
XPathSelector selector = xpath.compile("//schema-element(ITEM)").load();
selector.setContextItem(booksDoc);
QName titleName = new QName("TITLE");
for (XdmItem item: selector) {
XdmNode title = getChild((XdmNode)item, titleName);
System.out.println(title.getNodeName() +
"(" + title.getLineNumber() + "): " +
title.getStringValue());
}
} else {
System.out.println("Verification failed");
}
}
// Helper method to get the first child of an element having a given name.
// If there is no child with the given name it returns null
private static XdmNode getChild(XdmNode parent, QName childName) {
XdmSequenceIterator iter = parent.axisIterator(Axis.CHILD, childName);
if (iter.hasNext()) {
return (XdmNode)iter.next();
} else {
return null;
}
}
}

Convert XML to JSON format

I have to convert docx file format (which is in openXML format) into JSON format. I need some guidelines to do it. Thanks in advance.

You may take a look at the Json-lib Java library, that provides XML-to-JSON conversion.
String xml = "<hello><test>1.2</test><test2>123</test2></hello>";
XMLSerializer xmlSerializer = new XMLSerializer();
JSON json = xmlSerializer.read( xml );
If you need the root tag too, simply add an outer dummy tag:
String xml = "<hello><test>1.2</test><test2>123</test2></hello>";
XMLSerializer xmlSerializer = new XMLSerializer();
JSON json = xmlSerializer.read("<x>" + xml + "</x>");

There is no direct mapping between XML and JSON; XML carries with it type information (each element has a name) as well as namespacing. Therefore, unless each JSON object has type information embedded, the conversion is going to be lossy.
But that doesn't necessarily matter. What does matter is that the consumer of the JSON knows the data contract. For example, given this XML:
<books>
<book author="Jimbo Jones" title="Bar Baz">
<summary>Foo</summary>
</book>
<book title="Don't Care" author="Fake Person">
<summary>Dummy Data</summary>
</book>
</books>
You could convert it to this:
{
"books": [
{ "author": "Jimbo Jones", "title": "Bar Baz", "summary": "Foo" },
{ "author": "Fake Person", "title": "Don't Care", "summary": "Dummy Data" },
]
}
And the consumer wouldn't need to know that each object in the books collection was a book object.
Edit:
If you have an XML Schema for the XML and are using .NET, you can generate classes from the schema using xsd.exe. Then, you could parse the source XML into objects of these classes, then use a DataContractJsonSerializer to serialize the classes as JSON.
If you don't have a schema, it will be hard getting around manually defining your JSON format yourself.

The XML class in the org.json namespace provides you with this functionality.
You have to call the static toJSONObject method
Converts a well-formed (but not necessarily valid) XML string into a JSONObject. Some information may be lost in this transformation because JSON is a data format and XML is a document format. XML uses elements, attributes, and content text, while JSON uses unordered collections of name/value pairs and arrays of values. JSON does not does not like to distinguish between elements and attributes. Sequences of similar elements are represented as JSONArrays. Content text may be placed in a "content" member. Comments, prologs, DTDs, and <[ [ ]]> are ignored.

If you are dissatisfied with the various implementations, try rolling your own. Here is some code I wrote this afternoon to get you started. It works with net.sf.json and apache common-lang:
static public JSONObject readToJSON(InputStream stream) throws Exception {
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
SAXJsonParser handler = new SAXJsonParser();
parser.parse(stream, handler);
return handler.getJson();
}
And the SAXJsonParser implementation:
package xml2json;
import net.sf.json.*;
import org.apache.commons.lang.StringUtils;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import java.util.ArrayList;
import java.util.List;
public class SAXJsonParser extends DefaultHandler {
static final String TEXTKEY = "_text";
JSONObject result;
List<JSONObject> stack;
public SAXJsonParser(){}
public JSONObject getJson(){return result;}
public String attributeName(String name){return "#"+name;}
public void startDocument () throws SAXException {
stack = new ArrayList<JSONObject>();
stack.add(0,new JSONObject());
}
public void endDocument () throws SAXException {result = stack.remove(0);}
public void startElement (String uri, String localName,String qName, Attributes attributes) throws SAXException {
JSONObject work = new JSONObject();
for (int ix=0;ix<attributes.getLength();ix++)
work.put( attributeName( attributes.getLocalName(ix) ), attributes.getValue(ix) );
stack.add(0,work);
}
public void endElement (String uri, String localName, String qName) throws SAXException {
JSONObject pop = stack.remove(0); // examine stack
Object stashable = pop;
if (pop.containsKey(TEXTKEY)) {
String value = pop.getString(TEXTKEY).trim();
if (pop.keySet().size()==1) stashable = value; // single value
else if (StringUtils.isBlank(value)) pop.remove(TEXTKEY);
}
JSONObject parent = stack.get(0);
if (!parent.containsKey(localName)) { // add new object
parent.put( localName, stashable );
}
else { // aggregate into arrays
Object work = parent.get(localName);
if (work instanceof JSONArray) {
((JSONArray)work).add(stashable);
}
else {
parent.put(localName,new JSONArray());
parent.getJSONArray(localName).add(work);
parent.getJSONArray(localName).add(stashable);
}
}
}
public void characters (char ch[], int start, int length) throws SAXException {
JSONObject work = stack.get(0); // aggregate characters
String value = (work.containsKey(TEXTKEY) ? work.getString(TEXTKEY) : "" );
work.put(TEXTKEY, value+new String(ch,start,length) );
}
public void warning (SAXParseException e) throws SAXException {
System.out.println("warning e=" + e.getMessage());
}
public void error (SAXParseException e) throws SAXException {
System.err.println("error e=" + e.getMessage());
}
public void fatalError (SAXParseException e) throws SAXException {
System.err.println("fatalError e=" + e.getMessage());
throw e;
}
}

Converting complete docx files into JSON does not look like a good idea, because docx is a document centric XML format and JSON is a data centric format. XML in general is designed to be both, document and data centric. Though it is technical possible to convert document centric XML into JSON, handling the generated data might be overly complex. Try to focus on the actual needed data and convert only that part.

If you need to be able to manipulate your XML before it gets converted to JSON, or want fine-grained control of your representation, go with XStream. It's really easy to convert between: xml-to-object, json-to-object, object-to-xml, and object-to-json. Here's an example from XStream's docs:
XML
<person>
<firstname>Joe</firstname>
<lastname>Walnes</lastname>
<phone>
<code>123</code>
<number>1234-456</number>
</phone>
<fax>
<code>123</code>
<number>9999-999</number>
</fax>
</person>
POJO (DTO)
public class Person {
private String firstname;
private String lastname;
private PhoneNumber phone;
private PhoneNumber fax;
// ... constructors and methods
}
Convert from XML to POJO:
String xml = "<person>...</person>";
XStream xstream = new XStream();
Person person = (Person)xstream.fromXML(xml);
And then from POJO to JSON:
XStream xstream = new XStream(new JettisonMappedXmlDriver());
String json = xstream.toXML(person);
Note: although the method reads toXML() XStream will produce JSON, since the Jettison driver is used.

If you have a valid dtd file for the xml snippet, then you can easily convert xml to json and json to xml using the open source eclipse link jar. Detailed sample JAVA project can be found here: http://www.cubicrace.com/2015/06/How-to-convert-XML-to-JSON-format.html

I have come across a tutorial, hope it helps you.
http://www.techrecite.com/xml-to-json-data-parser-converter

Use
xmlSerializer.setForceTopLevelObject(true)
to include root element in resulting JSON.
Your code would be like this
String xml = "<hello><test>1.2</test><test2>123</test2></hello>";
XMLSerializer xmlSerializer = new XMLSerializer();
xmlSerializer.setForceTopLevelObject(true);
JSON json = xmlSerializer.read(xml);

Docx4j
I've used docx4j before, and it's worth taking a look at.
unXml
You could also check out my open source unXml-library that is available on Maven Central.
It is lightweight, and has a simple syntax to pick out XPaths from your xml, and get them returned as Json attributes in a Jackson ObjectNode.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XML getting text around a tag - java

Related

Convert relative to absolute links using jsoup

how to extract the text after certain tags using jsoup

Parsing SOAP Response in Java

Is there a way to give a Java Document an XML schema for XPath queries

Convert XML to JSON format

Categories

Resources