Extract values from xml file using Java - java

Here is my response contain XML file and I want to retrieve bEntityID="328" from this xml response
<?xml version="1.0" encoding="UTF-8"?>
<ns2:aResponse xmlns:ns2="http://www.***.com/F1/F2/F3/2011-09-11">
<createBEntityResponse bEntityID="328" />
</ns2:aResponse>
I am trying to this but getting null
System.out.println("bEntitytID="+XmlPath.with(response.asString())
.getInt("aResponse.createBEntityResponse.bEntityID"));
Any suggestion for getting BEntityID from this response?

Though I dont suggest the below approach to use Regex to get element values, but if you are too desperate to get then try the below code:
public class xmlValue {
public static void main(String[] args) {
String xml = "<ns2:aResponse xmlns:ns2=\"http://www.***.com/F1/F2/F3/2011-09-11\">\n" +
" <createBEntityResponse bEntityID=\"328\" />\n" +
"</ns2:aResponse>";
System.out.println(getTagValue(xml,"createBEntityResponse bEntityID"));
}
public static String getTagValue(String xml, String tagName){
String [] s;
s = xml.split("createBEntityResponse bEntityID");
String [] valuesBetweenQuotes = s[1].split("\"");
return valuesBetweenQuotes[1];
}
}
Output: 328
Note: Better solution is to use XML parsers
This will fetch the first tag value:
public static String getTagValue(String xml, String tagName){
return xml.split("<"+tagName+">")[1].split("</"+tagName+">")[0];
}
Other way around is to use JSoup:
Document doc = Jsoup.parse(xml, "", Parser.xmlParser()); //parse the whole xml doc
for (Element e : doc.select("tagName")) {
System.out.println(e); //select the specific tag and prints
}

I think the best way is deserializing xml to pojo like here, and then get value
entityResponse.getEntityId();

I tried with the same XML file and was able to get the value of bEntityId with the following code. Hope it helps.
#Test
public void xmlPathTests() {
try {
File xmlExample = new File(System.getProperty("user.dir"), "src/test/resources/Data1.xml");
String xmlContent = FileUtils.readFileToString(xmlExample);
XmlPath xmlPath = new XmlPath(xmlContent).setRoot("aResponse");
System.out.println(" Entity ::"+xmlPath.getInt(("createBEntityResponse.#bEntityID")));
assertEquals(328, xmlPath.getInt(("createBEntityResponse.#bEntityID")));
} catch (Exception e) {
e.printStackTrace();
}
}

Related

How to improve performance of querying xml file with VTD-XML and XPath?

I am querying XML files with size of around 1 MB(20k+ lines). I am using XPath to describe what I want to get and VTD-XML library to get it. I think that I have some problems with performance.
The problem is, I am making about 5k+ queries to XML file. It takes approximately 16-17 seconds to retrieve all values. I want to ask you, if this is normal performance for such task? How I can improve it?
I am using VTD-XML library with AutoPilot navigation approach which give me opportunity to use XPath. Implementation is as following:
private VTDGen vg = new VTDGen();
private VTDNav vn;
private AutoPilot ap = new AutoPilot();
public void init(String xml) {
log.info("Creating document");
xml = xml.replace("<?xml version=\"1.0\"?>", "<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
byte[] bytes = xml.getBytes(StandardCharsets.UTF_8);
vg.setDoc(bytes);
try {
vg.parse(true);
vn = vg.getNav();
} catch (ParseException e) {
e.printStackTrace();
}
log.info("Document created");
}
public String parseXmlOrReturnNull(String query) {
String xPathStringVal = null;
try {
ap.selectXPath(query);
ap.bind(vn);
int i = -1;
while ((i = ap.evalXPath()) != -1) {
xPathStringVal = vn.getXPathStringVal();
}
}catch (XPathEvalException e) {
e.printStackTrace();
} catch (NavException e) {
e.printStackTrace();
} catch (XPathParseException e) {
e.printStackTrace();
}
return xPathStringVal;
}
My xml files have specific format, they are divided into lot of parts - segments, and my queries are same for all segments(I am querying it in a loop). For example part of xml:
<segment>
<a>
<b>value1</b>
<c>
<d>value2</d>
<e>value3</d>
</c>
</a>
</segment>
<segment>
<a>
<b>value4</b>
<c>
<d>value5</d>
<e>value6</d>
<f>value6</d>
</c>
</a>
</segment>
...
If I want to get value1 in first segment I am using query:
//segment[1]/a/b
for value 4 in second segment
//segment[2]/a/b
etc.
Intuition says a few things: in my approach every query is independent (it doesn't know anything about other query), it means that AutoPilot, my iterator, always starts at the beginning of the file when I want to query it.
My question is: Is there any way to set AutoPilot at the beginning of processing segment? And when I finish querying move AutoPilot to next segment? I think that if my method will start searching value not from the beginning but from specifying point It will be much faster.
Another way is to divide xml file into small xml files (one xml file = one segment) and querying those small xml files.
What do you think guys? Thanks in advance
Minor: The replace is not needed as UTF-8 is the default encoding; only when there is an encoding, one would need to patch it to UTF-8.
The XPath should only done once, to not start from [0] to the next index.
If you need a List representation you could use JAXB with annotations.
An event based primitive parsing without DOM object probably is best (SAXParser).
Handler handler = new org.xml.sax.helpers.DefaultHandler {
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes) throws SAXException {
}
#Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
}
#Override
public void characters(char ch[], int start, int length) throws SAXException {
}
};
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
InputStream in = new ByteArrayInputStream(bytes);
parser.parse(in, handler);

Get a part of a webpage using JSOUP

I am trying to programmatically search for a word meaning in google & save its meaning in a file in my computer. I have successfully called the page & get the response in Document (org.jsoup.nodes.Document). Now I do not know how to get only the word meaning from this Document. Please find the screenshot where I have indicated the part of data that I need.
The response html is so big that I can't understand from which element I will get my desired data. Please help. Here is what I have done so far:
public class Search {
private static Pattern patternDomainName;
private Matcher matcher;
private static final String DOMAIN_NAME_PATTERN
= "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}";
static {
patternDomainName = Pattern.compile(DOMAIN_NAME_PATTERN);
}
public static void main(String[] args) {
Search obj = new Search();
Set<String> result = obj.getDataFromGoogle("debug%20meaning");
for(String temp : result){
System.out.println(temp);
}
System.out.println(result.size());
}
public String getDomainName(String url){
String domainName = "";
matcher = patternDomainName.matcher(url);
if (matcher.find()) {
domainName = matcher.group(0).toLowerCase().trim();
}
return domainName;
}
private Set<String> getDataFromGoogle(String query) {
Set<String> result = new HashSet<String>();
String request = "https://www.google.com/search?q=" + query + "&num=20";
System.out.println("Sending request..." + request);
try {
// need http protocol, set this as a Google bot agent :)
Document doc = Jsoup
.connect(request)
.userAgent(
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000).get();
/**********Here comes my data fetching logic*****************
* Dont know where to fing my desired data in such a big html
*/
/*
String sc = doc.html().replaceAll("\\n", "");
System.out.println(doc.html());
*/
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
}
Google Dictionary API is deprecated!
But instead scraping through google search URI,which is what you are doing currently, you can do the same thing using this http://google-dictionary.so8848.com/ service which preferably more easy to scrape data from, with what you are doing currently.

using xpath with namespace from a java class

I am trying to parse an xml document with namespace using XPATH. I have read how it is supposed to be done. I have implemented NamespaceContext as well. But, I still am not getting the values. I think I am missing something simple.
My xml input is
<?xml version="1.0" encoding="UTF-8"?>
<ns1:customer xmlns:ns1="http://test/ns1">
<ns1:name>john</ns1:name>
</ns1:customer>
My Main file is TestXMLPath
public static void main(String[] args) throws Exception {
String myInputXML = "src/testxmlpath/input-with-namespace.xml";
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
String expression ="/ns1:customer/ns1:name";
Document document = db.parse(new File(myInputXML)) ;
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new SimpleNamespaceContextImpl());
String value = xpath.evaluate(expression,document);
System.out.println("value" + value);
}
my NamespaceContext implementation is
public class SimpleNamespaceContextImpl implements NamespaceContext {
#Override
public String getNamespaceURI(String prefix) {
System.out.println("getNameSpace for prefix "+prefix);
if (prefix == null) {
throw new NullPointerException("Null prefix");
} else if ("ns1".equals(prefix)) {
return "http://test/ns1";
} else if ("xml".equals(prefix)) {
return XMLConstants.XML_NS_URI;
} else {
return XMLConstants.XML_NS_URI;
}
}
#Override
public String getPrefix(String namespaceURI) {
return "ns1";
}
#Override
public Iterator getPrefixes(String namespaceURI) {
return null;
}
}
I print out when a method gets called. Here is the output.
getNameSpace for prefix ns1
getNameSpace for prefix ns1
value
BUILD SUCCESSFUL
I can't understand, why won't it work ??
Any help will be greatly appreciated.
Thanks
Works fine for me. Output:
getNameSpace for prefix ns1
getNameSpace for prefix ns1
valuejohn
Are you sure you're loading the right document? I'm using Xerces to build the document and Saxon to evaluate the XPath. A dump of the relevant classes:
class com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl
class com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl
class net.sf.saxon.xpath.XPathFactoryImpl

List HTML tags from a String

I have a String and from which I want to list all the HTML tags present within it. Is there any library available to do this job?
Any information will be very helpful to me.
You can use the below code to extract only the HTML tags from your String.
package com.overflow.stack;
/**
*
* #author sarath_sivan
*/
public class ExtractHtmlTags {
public static void getHtmlTags(String html) {
int beginIndex = 0;
while(beginIndex!=-1) {
beginIndex = html.indexOf("<", 0);
int endIndex = html.indexOf(">", beginIndex+1);
String htmlTag = "";
try {
if(beginIndex!=-1) {
htmlTag = html.substring(beginIndex, endIndex+1);
}
} catch(Exception e) {
e.printStackTrace();
}
System.out.println(htmlTag);
html = html.substring(endIndex+1, html.length());
}
}
public static void main(String[] args) {
String html = "<html><body><h2>List HTML tags from a String</h2>hello<br /></body></html>";
ExtractHtmlTags.getHtmlTags(html);
}
}
But, I don't understand what you are trying to do with the extracted HTML tags. Good luck!
You can try http://jsoup.org/
Not sure it allows to get list of tags but you can get the list iterating DOM.
The parser from HTMLUnit can take a String and return an a structured result:
http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/HTMLParser.html
page = Nokogiri::HTML(open('http://yoursite.com'))
page.css("*").map{|x| x.name}.flatten.uniq

Convert XML to JSON format

I have to convert docx file format (which is in openXML format) into JSON format. I need some guidelines to do it. Thanks in advance.
You may take a look at the Json-lib Java library, that provides XML-to-JSON conversion.
String xml = "<hello><test>1.2</test><test2>123</test2></hello>";
XMLSerializer xmlSerializer = new XMLSerializer();
JSON json = xmlSerializer.read( xml );
If you need the root tag too, simply add an outer dummy tag:
String xml = "<hello><test>1.2</test><test2>123</test2></hello>";
XMLSerializer xmlSerializer = new XMLSerializer();
JSON json = xmlSerializer.read("<x>" + xml + "</x>");
There is no direct mapping between XML and JSON; XML carries with it type information (each element has a name) as well as namespacing. Therefore, unless each JSON object has type information embedded, the conversion is going to be lossy.
But that doesn't necessarily matter. What does matter is that the consumer of the JSON knows the data contract. For example, given this XML:
<books>
<book author="Jimbo Jones" title="Bar Baz">
<summary>Foo</summary>
</book>
<book title="Don't Care" author="Fake Person">
<summary>Dummy Data</summary>
</book>
</books>
You could convert it to this:
{
"books": [
{ "author": "Jimbo Jones", "title": "Bar Baz", "summary": "Foo" },
{ "author": "Fake Person", "title": "Don't Care", "summary": "Dummy Data" },
]
}
And the consumer wouldn't need to know that each object in the books collection was a book object.
Edit:
If you have an XML Schema for the XML and are using .NET, you can generate classes from the schema using xsd.exe. Then, you could parse the source XML into objects of these classes, then use a DataContractJsonSerializer to serialize the classes as JSON.
If you don't have a schema, it will be hard getting around manually defining your JSON format yourself.
The XML class in the org.json namespace provides you with this functionality.
You have to call the static toJSONObject method
Converts a well-formed (but not necessarily valid) XML string into a JSONObject. Some information may be lost in this transformation because JSON is a data format and XML is a document format. XML uses elements, attributes, and content text, while JSON uses unordered collections of name/value pairs and arrays of values. JSON does not does not like to distinguish between elements and attributes. Sequences of similar elements are represented as JSONArrays. Content text may be placed in a "content" member. Comments, prologs, DTDs, and <[ [ ]]> are ignored.
If you are dissatisfied with the various implementations, try rolling your own. Here is some code I wrote this afternoon to get you started. It works with net.sf.json and apache common-lang:
static public JSONObject readToJSON(InputStream stream) throws Exception {
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
SAXJsonParser handler = new SAXJsonParser();
parser.parse(stream, handler);
return handler.getJson();
}
And the SAXJsonParser implementation:
package xml2json;
import net.sf.json.*;
import org.apache.commons.lang.StringUtils;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import java.util.ArrayList;
import java.util.List;
public class SAXJsonParser extends DefaultHandler {
static final String TEXTKEY = "_text";
JSONObject result;
List<JSONObject> stack;
public SAXJsonParser(){}
public JSONObject getJson(){return result;}
public String attributeName(String name){return "#"+name;}
public void startDocument () throws SAXException {
stack = new ArrayList<JSONObject>();
stack.add(0,new JSONObject());
}
public void endDocument () throws SAXException {result = stack.remove(0);}
public void startElement (String uri, String localName,String qName, Attributes attributes) throws SAXException {
JSONObject work = new JSONObject();
for (int ix=0;ix<attributes.getLength();ix++)
work.put( attributeName( attributes.getLocalName(ix) ), attributes.getValue(ix) );
stack.add(0,work);
}
public void endElement (String uri, String localName, String qName) throws SAXException {
JSONObject pop = stack.remove(0); // examine stack
Object stashable = pop;
if (pop.containsKey(TEXTKEY)) {
String value = pop.getString(TEXTKEY).trim();
if (pop.keySet().size()==1) stashable = value; // single value
else if (StringUtils.isBlank(value)) pop.remove(TEXTKEY);
}
JSONObject parent = stack.get(0);
if (!parent.containsKey(localName)) { // add new object
parent.put( localName, stashable );
}
else { // aggregate into arrays
Object work = parent.get(localName);
if (work instanceof JSONArray) {
((JSONArray)work).add(stashable);
}
else {
parent.put(localName,new JSONArray());
parent.getJSONArray(localName).add(work);
parent.getJSONArray(localName).add(stashable);
}
}
}
public void characters (char ch[], int start, int length) throws SAXException {
JSONObject work = stack.get(0); // aggregate characters
String value = (work.containsKey(TEXTKEY) ? work.getString(TEXTKEY) : "" );
work.put(TEXTKEY, value+new String(ch,start,length) );
}
public void warning (SAXParseException e) throws SAXException {
System.out.println("warning e=" + e.getMessage());
}
public void error (SAXParseException e) throws SAXException {
System.err.println("error e=" + e.getMessage());
}
public void fatalError (SAXParseException e) throws SAXException {
System.err.println("fatalError e=" + e.getMessage());
throw e;
}
}
Converting complete docx files into JSON does not look like a good idea, because docx is a document centric XML format and JSON is a data centric format. XML in general is designed to be both, document and data centric. Though it is technical possible to convert document centric XML into JSON, handling the generated data might be overly complex. Try to focus on the actual needed data and convert only that part.
If you need to be able to manipulate your XML before it gets converted to JSON, or want fine-grained control of your representation, go with XStream. It's really easy to convert between: xml-to-object, json-to-object, object-to-xml, and object-to-json. Here's an example from XStream's docs:
XML
<person>
<firstname>Joe</firstname>
<lastname>Walnes</lastname>
<phone>
<code>123</code>
<number>1234-456</number>
</phone>
<fax>
<code>123</code>
<number>9999-999</number>
</fax>
</person>
POJO (DTO)
public class Person {
private String firstname;
private String lastname;
private PhoneNumber phone;
private PhoneNumber fax;
// ... constructors and methods
}
Convert from XML to POJO:
String xml = "<person>...</person>";
XStream xstream = new XStream();
Person person = (Person)xstream.fromXML(xml);
And then from POJO to JSON:
XStream xstream = new XStream(new JettisonMappedXmlDriver());
String json = xstream.toXML(person);
Note: although the method reads toXML() XStream will produce JSON, since the Jettison driver is used.
If you have a valid dtd file for the xml snippet, then you can easily convert xml to json and json to xml using the open source eclipse link jar. Detailed sample JAVA project can be found here: http://www.cubicrace.com/2015/06/How-to-convert-XML-to-JSON-format.html
I have come across a tutorial, hope it helps you.
http://www.techrecite.com/xml-to-json-data-parser-converter
Use
xmlSerializer.setForceTopLevelObject(true)
to include root element in resulting JSON.
Your code would be like this
String xml = "<hello><test>1.2</test><test2>123</test2></hello>";
XMLSerializer xmlSerializer = new XMLSerializer();
xmlSerializer.setForceTopLevelObject(true);
JSON json = xmlSerializer.read(xml);
Docx4j
I've used docx4j before, and it's worth taking a look at.
unXml
You could also check out my open source unXml-library that is available on Maven Central.
It is lightweight, and has a simple syntax to pick out XPaths from your xml, and get them returned as Json attributes in a Jackson ObjectNode.

Categories