JAVA , Xml parsing - java

I need some help reading an xml document.
I got a Class Person and i want create a List from that xml
the xml is something like:
<root>
<field1></field1>
<field2></field1>
<field3></field1>
<Persons>
<id></id>
<List>
<Person>
<Name>...</Name>
<LastName>...</LastName>
</Person>
<Person>
<Name>...</Name>
<LastName>...</LastName>
</Person>
<Person>
<Name>...</Name>
<LastName>...</LastName>
</Person>
</List
</Persons>
<field4></field1>
<field5></field1>
<field6></field1>
</root>
i'm using dom parser (org.w3c.dom)
Can anyone please sohw me what's the best way to get the Persons information ?
Thanks

If you want only read info, you'd better (after loading DOM) use XPath on it. XPath is present in J2SE API. Write if you need special examples.

You have to use Simple API for XML (SAX). You may also use Streaming API for XML (StaX) (tutorial).

I prefer JAXB. Its also present in the J2SE API.
Write if you need help.

I hate to just leave this here, but I answered a similar question here.
In Java you have quite a few options on actually parsing the XML - XPath will be the slowest but gives you a nice expression language to query the content with. DOM will be the second slowest but give you a tree-model in memory of your doc to walk. SAX will be faster, but requires you build the list as it parses through the doc on the fly and lastly STAX will be the fastest, but requires that you write some specific code to your format to build your list out.
Lastly, I would recommend a library I wrote called SJXP that gives you the performance of STAX with the ease of XPath... it is the perfect blend of the two.
You write rules like "/root/Persons/list/Person/Name" and give it your doc and it will fire every time it hits a name and call a user-provided callback for you, handing you the name it found.
You create a few rules for all the values you want and viola... you can create a START_TAG rule for the "/root/Persons/list/Person" open-tag, and create a new "Person p = new Person()" in your code, then as every sub-element hits, you just set the appropriate value on the person, something like this (as an example):
IRule linkRule = new DefaultRule(Type.CHARACTER, "/root/Persons/list/Person/Name") {
#Override
public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
// Get the last person we added on open-tag.
Person p = personList.get(personList.size() - 1);
// <Name> tag was parsed, 'text' is our parsed Name. Set it.
p.setName(text);
}
}
The nice thing about SJXP is that the memory overhead is lower than the other parser approaches and performance higher (SAX will parse the elements on a match, STAX-based parsing doesn't parse the elements out of the stream until they are requested).
You will end up writing equally confusing code just to traverse your DOM and all the Node elements to build your list.
LASTLY, if you felt comfortable with XML->Object mapping, you could do what another person said and leverage JAXB. You will need to write a schema for your XML files, then it will generate Java objects for you that map perfect to them. Then you can just map your XML file directly to your Java object and call something like "persons.getList()" or whatever JAXB generates for you.
The memory overhead and performance will be on par with DOM parsing in that case (roughly).

XPath is one of the solution,
if you do not want to use another library...
Than try defining the DTD and using the ID parameter, most of the parsers have getElementById(ID) funciton

Another easy way is to use regular expressions:
Pattern pattern = Pattern.compile("<Person>.*?<Name>(.*?)</Name>.*?<LastName>(.*?)</LastName>.*?</Person>", Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(xml);
while (matcher.find())
{
String name = matcher.group(1);
String lastName = matcher.group(2);
}
Store the name and lastName in your own Persons-Datastructure.
Define the Pattern.compile command as a constant outside your method because it needs time for initialization.
Please see
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

Related

Why am I getting no spaces between text node values?

I am using Xpath expression to get text nodes from a XML document like below:
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
<proj>
<under>E01</under>
<under>E02</under>
</proj>
<name>John Doe</name>
<gender>male</gender>
</emp>
</company>
I have written the following XPATH expression to get the text values :
normalize-space(string(//emp))
It is extracting the correct values and the output is like below:
Acct1000E01E02John Doemale
Notice that there are no spaces between the text node values from different nodes.
I actually want the output value to be in this way:
`Acct 1000 E01 E02 John Doe`
I have used javax.xml.xpath to parse and build the tree as follows:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(new File("/employees.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "normalize-space(string(//emp))";
String output= (String)xPath.compile(expression).evaluate(document, XPathConstants.STRING);
I am using JAVA SE 10 here. So, the Xpath version is 1.0
Is there a better way to extract the text values?
I am pretty new to XPath so any suggestions would be helpful.
You are almost rigth here.
Picking the not operator is the right way to go.
It should be something like this:
/html/body/company/emp/*[not(self::gender)]
That is, all childnodes of emp except gender node.
Here go a full exemple in javascript:
let xpathExpression = '/html/body/company/emp/*[not(self::gender)]';
let contextNode = window.document;
let xpathResult = document.evaluate(xpathExpression, contextNode,
null, XPathResult.ANY_TYPE, null);
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
Oh dear, this one is complicated...
First of all, you haven't tagged your question with an XPath version. Usually people who aren't aware of XPath versions are using the ancient version 1.0, so I'll make that assumption: sorry if it's wrong.
In XPath 1.0, a function that is given a node-set and that expects a string uses the string value of the first node in the node-set, taken in document order.
In your query
normalize-space(string(//emp))
//emp selects a node-set, which happens to contain a single node, so string() takes the string value of that node. The string value of an element node is the concatenation of all its descendant text node. The normalize-space function removes leading and trailing whitespace, and normalizes internal space to a single whitespace character.
You have shown your XML in indented form as
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
etc, so it's reasonable to expect that the whitespace between elements forms part of the string value of the <emp> element. But you haven't told us how the document was parsed and turned into a node tree. Parsers often provide multiple options on how to do this, in particular, on how to handle the whitespace between element nodes. Most retain the whitespace by default, unless perhaps there is a schema or DTD that tells the parser that the whitespace is insignificant. Microsoft's MSXML parser, notoriously, drops the whitespace by default, which causes considerable problems when you are using XML to represent narrative documents, but actually makes life easier for people using XML for this kind of non-document data.
Your parser, for one reason or another (we can't tell) seems to have deleted the whitespace between element nodes. No XPath query is going to bring it back again. You may have options when building the document to retain the whitespace; that depends on the tools you are using.
Your second question asks about dropping one of the elements in the input. That's beyond the scope of XPath. XPath can only select nodes from the input, it cannot modify them in any way. To modify the tree, you need XSLT or XQuery.
Your attempt to solve the problem with //emp[not(descendant::gender)] is hopelessly doomed because this will only select employees that have no descendant element named gender. You appear to be guessing the semantics rather than using a specification or tutorial.

Converting Sting to xml and find the particular node in the xml

I have a confusion on this requirement how to do it.
I receive an xml as a string from the database and need to find the value of particular elements inside the xml string. Here, my thought was,
1- convert String to xml.
2 - loop the xml using NodeList and DocumentBuilder (OR) Use JaxB. which one is the better option?
I'd definitely recommend JAXB instead of doing it by hand but if you're a bit masochistic it's doable by hand :3
One more option is to use Regular Expressions or use Groovy:)

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

Are there any Java HTML parsers where the generated Nodes retain indexes to the original text?

I'd like to query a HTML document as XML (e.g. with XPath), so I need to pass the HTML through some form of HTML cleaner.
But I'd also like to make modifications to the original source string based on the results of the queries.
Is there a Java HTML parser around that retains indexes to the original source string, so I can locate a node and modify the correct part of the original string?
Cheers.
It sounds like Jericho is almost exactly what you want. It is a robust HTML parser designed specifically for making unintrusive modifications to the source document.
While it doesn't come with DOM, SAX, or StAX interfaces, it has custom APIs that are similar enough to those standards that you should be able to adapt your approach to them fairly easily, or write an adapter between whatever you are using and Jericho. For instance, you can do XPath queries on Jericho documents using Jaxen -- see this blog entry for an example.
Jericho has begin and end attributes for every element, and even for parts of the element like the tag name or even an attribute name, so you can edit the document yourself with that information, but where Jericho really shines is the OutputDocument class, which lets you specify replacements directly by calling the appropriate methods with the Jericho elements that match your query instead of having to explicitly call getBegin() and getEnd() on them and pass that to some replacement method.
We use jericho html parser to do the parsing and htmlcleaner to do the actual clean up.
We had problems with jericho's behavior within a server app ( memory management, logging ) that we fixed. (the original developer didn't think our issues were important enough to put in the main code branch). Our fork is on github.
We also made fixes to htmlcleaner.
I don't know about the "retain indexes to the original text" part but Jericho is a very good HTML parser library.
Here is an example of how to remove every span from a html:
public static String removeSpans(String html) {
Source source = new Source(html);
source.fullSequentialParse();
OutputDocument outputDocument = new OutputDocument(source);
List<Tag> tags = source.getAllTags();
for (Tag tag : tags) {
String tagname = tag.getName().toLowerCase();
if (tagname.equals("span")) {
//remove the <span>
outputDocument.remove(tag);
}
}
return outputDocument.toString();
}
I guess you could use HTML Parser.
You can get indexes to original Page using getStartPosition() and getEndPosition() from class Node.
As others have suggested, you probably want to render the DOM. This basically just means constructing the node tree, it wont alter the document source unless you use an HTML cleaner like jTidy. Then you have easy access to the document and can modify it as required. I would suggest DOM4J, it has a good api and xpath support too.
Re your "indexing" requirement, during your traversal/querying of the document you can cache in a list or map any elements or nodes that you wish to modify the text of at a later point.
this works great
http://jtidy.sourceforge.net/
EXAMPLE
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(boolean xhtml); // set desired config options using tidy setters
... // (equivalent to command line options)
tidy.parse(inputStream, System.out);
For crawling the DOM, i recommend using JDOM, its way faster then simple XML.
http://www.jdom.org/
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("root");
Text text = doc.createText("This is the root");
root.appendChild(text);
doc.appendChild(root);
As far as implementation is concerned i would make a new document, and add nodes to it from the source.
You could try ANTLR with an HTML grammar.
You could take (at least) 2 approaches - try and use it as an actual HTML parser, and then get the indexes into the original string that you are interested in.
Or, it also has built-in support for doing in-place transformations on source text, where you define the transformations that you want to perform on the text as part of the grammar.

How to generate an *exact* copy of an XML document with resolved entities

Given an XML document like this:
<!DOCTYPE doc SYSTEM 'http://www.blabla.com/mydoc.dtd'>
<author>john</author>
<doc>
<title>&title;</title>
</doc>
I wanted to parse the above XML document and generate a copy of it with all of its entities already resolved. So given the above XMl document, the parser should output:
<!DOCTYPE doc SYSTEM 'http://www.blabla.com/mydoc.dtd'>
<author>john</author>
<doc>
<title>Stack Overflow Madness</title>
</doc>
I know that you could implement an org.xml.sax.EntityResolver to resolve entities, but what I don't know is how to properly generate a copy of the XML document with everything still intact (except its entities). By everything, I mean the whitespaces, the dtd at the top of the document, the comments, and any other things except the entities that should have been resolved previously. If this is not possible, please suggest a way that at least can preserve most of the things (e.g. all but no comments).
Note also that I am restricted to the pure Java API provided by Sun, so no third party libraries can be used here.
Thanks very much!
EDIT: The above XML document is a much simplified version of its original document. The original one involves a very complex entity resolution using EntityResolver whose significance I have greatly reduced in this question. What I am really interested is how to produce an exact copy of the XML document with an XML parser that uses EntityResolver to resolve the entities.
You almost certainly cannot do this using any XML parser I've heard of, and certainly the Sun XML parsers cannot do it. They will happily discard details that have no significance as far as the meaning of the XML is concerned. For example,
<title>Stack Overflow Madness</title>
and
<title >Stack Overflow Madness</title >
are indistinguishable from the perspective of the XML syntax, and the Sun parsers (rightly) treat them as identical.
I think your choices are to do the replacement treating the XML as text (as #Wololo suggests) or relax your requirements.
By the way, you can probably use an XmlEntityResolver independently of the XML parser. Or create a class that does the same thing. This may mean that String.replace... is not the answer, but you should be able to implement an ad-hoc expander that iterates over the characters in a character buffer, expanding them into a second one.
Is it possible for you to read in the xml template as a string?
And with the string do something like
string s = "<title>&title;</title>";
s = s.replace("&title;", "Stack Overflow Madness");
SaveXml(s);

Categories