Best method to parse various custom XML documents in Java

Best method to parse various custom XML documents in Java - java

What is the best method to parse multiple, discrete, custom XML documents with Java?

I would use Stax to parse XML, it's fast and easy to use. I've been using it on my last project to parse XML files up to 24MB. There's a nice introduction on java.net, which tells you everything you need to know to get started.

Basically, you have two main XML parsing methods in Java :
SAX, where you use an handler to only grab what you want in your XML and ditch the rest
DOM, which parses your file all along, and allows you to grab all elements in a more tree-like fashion.
Another very useful XML parsing method, albeit a little more recent than these ones, and included in the JRE only since Java6, is StAX. StAX was conceived as a medial method between the tree-based of DOM and event-based approach of SAX. It is quite similar to SAX in the fact that parsing very large documents is easy, but in this case the application "pulls" info from the parser, instead of the parsing "pushing" events to the application. You can find more explanation on this subject here.
So, depending on what you want to achieve, you can use one of these approaches.

Use the dom4j library
First read the document
import java.net.URL;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.io.SAXReader;
public class Foo {
public Document parse(URL url) throws DocumentException {
SAXReader reader = new SAXReader();
Document document = reader.read(url);
return document;
}
}
Then use XPATH to get to the values you need
public void get_author(Document document) {
Node node = document.selectSingleNode( "//AppealRequestProcessRequest/author" );
String author = node.getText();
return author;
}

You will want to use org.xml.sax.XMLReader (http://docs.oracle.com/javase/7/docs/api/org/xml/sax/XMLReader.html).

If you only need to parse then I would recommend using XPath library. Here is a nice reference: http://www.ibm.com/developerworks/library/x-javaxpathapi.html
But you may want to consider turning XMLs to objects and then the sky is the limit.
For that you may use XStream, this is a great library which i use alot

Below is the code of extracting some value value using vtd-xml.
import com.ximpleware.*;
public class extractValue{
public static void main(String s[]) throws VTDException, IOException{
VTDGen vg = new VTDGen();
if (!vg.parseFile("input.xml", false));
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/aa/bb[name='k1']/value");
int i=0;
while ((i=ap.evalXPath())!=-1){
System.out.println(" value ===>"+vn.toString(i));
}
}
}

Related

In Java how do I evaluate XPATH expression on XML using SAX Parser?

In Java how do I evaluate XPATH expression on XML using SAX Parser?
Need more dynamic way because the XML format is not fixed. So i should be able pass the following
xpath as string
xml as string / input source
Something like Utility.evaluate("/test/#id='123'", "")

Here is an exemple :
//First create a Document
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new File("test.xml"));
//Init the xpath factory
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/company/employee";
//read a nodelist using xpath
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(doc, XPathConstants.NODESET);
EDIT :
If you want to use a SAX parser, then you can't use the XPath object of Java, see https://docs.oracle.com/javase/7/docs/api/javax/xml/xpath/package-summary.html
The XPath language provides a simple, concise syntax for selecting nodes from an XML document. XPath also provides rules for converting a node in an XML document object model (DOM) tree to a boolean, double, or string value. XPath is a W3C-defined language and an official W3C recommendation; the W3C hosts the XML Path Language (XPath) Version 1.0 specification.
XPath started in life in 1999 as a supplement to the XSLT and XPointer languages, but has more recently become popular as a stand-alone language, as a single XPath expression can be used to replace many lines of DOM API code.
If you want to use SAX you can look at libs detailed in this question : Is there any XPath processor for SAX model? .
Although the mechanic of XPath does not really suit SAX. Indeed using a SAX parser won't create an XML tree in memory. Hence you can't use XPath efficiently because it won't see not loaded nodes.

Only a small subset of XPath is amenable to streamed evaluation, that is, evaluation on-the-fly while parsing the input document. There are therefore not many streaming XPath processor around; most of them are the product of academic research projects.
One thing you could try is Saxon-EE streamed XQuery. This is a small subset of XQuery that allows streamed executaion (it will allow expressions like your example). Details at
http://www.saxonica.com/documentation/#!sourcedocs/streaming/streamed-query

Oracle's XQuery processor for Java will "dynamically" stream path expressions:
https://docs.oracle.com/database/121/ADXDK/adx_j_xqj.htm#ADXDK99930
Specifically, there is information on streaming here, including an example:
https://docs.oracle.com/database/121/ADXDK/adx_j_xqj.htm#ADXDK119
But it will not stream using SAX. You must bind the input XML as either StAX, InputStream, or Reader to get streaming evaluation.

You can use a SAXSource with XPath using Saxon, but - and this is important - be aware that the underlying implementation will almost certainly still be loading and buffering some or all of the document in memory in order to evaluate the xpath. It probably won't be a full DOM tree (Saxon relies on its own structure called TinyTree, which supports lazy-loading and various other optimizations), so it's better than using most DOM implementations, but it still involves loading the document into memory. If your concern is memory load for large data sets, it probably won't help you much, and you'd be better off using one of the streaming xpath/xquery options suggested by others.
An implementation of your utility method might look something like this:
import java.io.StringReader;
import javax.xml.namespace.QName;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import org.xml.sax.InputSource;
import net.sf.saxon.xpath.XPathFactoryImpl;
public class XPathUtils {
public static Object evaluate(String xpath, String xml, QName returnType)
throws Exception {
SAXParser parser = (SAXParser) SAXParserFactory.newInstance()
.newSAXParser();
InputSource source = new InputSource(new StringReader(xml));
SAXSource saxSource = new SAXSource(parser.getXMLReader(), source);
XPath xPath = new XPathFactoryImpl().newXPath();
return xPath.evaluate(xpath, saxSource, returnType);
}
public static String xpathString(String xpath, String xml)
throws Exception {
return (String) evaluate(xpath, xml, XPathConstants.STRING);
}
public static boolean xpathBool(String xpath, String xml) throws Exception {
return (Boolean) evaluate(xpath, xml, XPathConstants.BOOLEAN);
}
public static Number xpathNumber(String xpath, String xml) throws Exception {
return (Number) evaluate(xpath, xml, XPathConstants.NUMBER);
}
public static void main(String[] args) throws Exception {
System.out.println(xpathString("/root/#id", "<root id='12345'/>"));
}
}
This works because the Saxon XPath implementation supports SAXSource as a context for evaluate(). Be aware that trying this with the built-in Apaache XPath implementation will throw an exception.

High performace HTML parsing library [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();

Another library that might be useful for HTML processing is jsoup.
Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
http://jsoup.org/

The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* #author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
*/
public class TestHtmlParse
{
static final String className = "tags";
static final String url = "http://www.stackoverflow.com";
TagNode rootNode;
public TestHtmlParse(URL htmlPage) throws IOException
{
HtmlCleaner cleaner = new HtmlCleaner();
rootNode = cleaner.clean(htmlPage);
}
List getDivsByClass(String CSSClassname)
{
List divList = new ArrayList();
TagNode divElements[] = rootNode.getElementsByName("div", true);
for (int i = 0; divElements != null && i < divElements.length; i++)
{
String classType = divElements[i].getAttributeByName("class");
if (classType != null && classType.equals(CSSClassname))
{
divList.add(divElements[i]);
}
}
return divList;
}
public static void main(String[] args)
{
try
{
TestHtmlParse thp = new TestHtmlParse(new URL(url));
List divs = thp.getDivsByClass(className);
System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
for (Iterator iterator = divs.iterator(); iterator.hasNext();)
{
TagNode divElement = (TagNode) iterator.next();
System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
}

Several years ago I used JTidy for the same purpose:
http://jtidy.sourceforge.net/
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
More information on JTidy can be found on the JTidy SourceForge project page ."

You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.

The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter =
new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

Jericho: http://jericho.htmlparser.net/docs/index.html
Easy to use, supports not well formed HTML, a lot of examples.

HTMLUnit might be of help. It does a lot more stuff too.
http://htmlunit.sourceforge.net/1

Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.
Example:
Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");
Example:
doc.form("#myform", new JerryFormHandler() {
public void onForm(Jerry form, Map<String, String[]> parameters) {
// process form and parameters
}
});
Of course, these are just some quick examples to get the feeling how it all looks like.

The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.
The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)

You can also use XWiki HTML Cleaner:
It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.

If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.

How to parse advanced XML files in Java

I've seen numerous examples about how to read XML files in Java. But they only show simple XML files. For example they show how to extract first and last names from an XML file. However I need to extract data from a collada XML file. Like this:
<library_visual_scenes>
<visual_scene id="ID1">
<node name="SketchUp">
<instance_geometry url="#ID2">
<bind_material>
<technique_common>
<instance_material symbol="Material2" target="#ID3">
<bind_vertex_input semantic="UVSET0" input_semantic="TEXCOORD" input_set="0" />
</instance_material>
</technique_common>
</bind_material>
</instance_geometry>
</node>
</visual_scene>
</library_visual_scenes>
This is only a small part of a collada file. Here I need to extract the id of visual_scene, and then the url of instance_geometry and last the target of instance_material. Of course I need to extract much more, but I don't understand how to use it really and this is a place to start.
I have this code so far:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = null;
try {
builder = factory.newDocumentBuilder();
}
catch( ParserConfigurationException error ) {
Log.e( "Collada", error.getMessage() ); return;
}
Document document = null;
try {
document = builder.parse( string );
}
catch( IOException error ) {
Log.e( "Collada", error.getMessage() ); return;
}
catch( SAXException error ) {
Log.e( "Collada", error.getMessage() ); return;
}
NodeList library_visual_scenes = document.getElementsByTagName( "library_visual_scenes" );
It seems like most examples on the web is similar to this one: http://www.easywayserver.com/blog/java-how-to-read-xml-file/
I need help figuring out what to do when I want to extract deeper tags or find a good tutorial on reading/parsing XML files.

Really, your parsing per se is already done when you call builder.parse(string). What you need to know now is how to select/query information from the parsed XML document.
I would agree with #khachik regarding how to do that. Elaborating a little (since no one else has posted an answer):
XPath is the most convenient way to extract information, and if your input document is not huge, XPath is fast enough. Here is a good starting tutorial on XPath in Java. XPath is also recommended if you need random access to the XML data (i.e. if you have to go back and forth extracting data from the tree in a different order than it appears in the source document), since SAX is designed for linear access.
Some sample XPath expressions:
extract the id of visual_scene: /*/visual_scene/#id
the url of instance_geometry: /*/visual_scene/node/instance_geometry/#url
the url of instance_geometry for node whose name is Sketchup: /*/visual_scene/node[#name = 'Sketchup']/instance_geometry/#url
the target of instance_material: /*/visual_scene/node/instance_geometry/bind_material/technique_common/instance_material/#target
Since COLLADA models can be really large, you might need to do a SAX-based filter, which will allow you to process the document in stream mode without having to keep it all in memory at once. But if your existing code to parse the XML is already performing well enough, you may not need SAX. SAX is more complicated to use for extracting specific data than XPath.

You are using DOM in your code.
DOM creates a tree structure of the xml file it parsed, and you have to traverse the tree to get the information in various nodes.
In your code all you did is create the tree representation. I.e.
document = builder.parse( string );//document is loaded in memory as tree
Now you should reference the DOM apis to see how to get the information you need.
NodeList library_visual_scenes = document.getElementsByTagName( "library_visual_scenes" );
For instance this method returns a NodeList of all elements with the specified name.
Now you should loop over the NodeList
for (int i = 0; i < library_visual_scenes.getLength(); i++) {
Element element = (Element) nodes.item(i);
Node visual_scene = element.getFirstChild();
if(visual_scene.getNodeType() == Node.ELEMENT_NODE)
{
String id = ((Element)visual_scene).getAttribute(id);
System.out.println("id="+id);
}
}
DISCLAIMER: This is a sample code. Have not compiled it. It shows you the concept. You should look into DOM apis.

EclipseLink JAXB (MOXy) has a useful #XmlPath extension for leveraging XPath to populate an object. It may be what you are looking for. Note: I am the MOXy tech lead.
The following example maps a simple address object to Google's representation of geocode information:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
#XmlRootElement(name="kml")
#XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}
For the rest of the example see:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html

Nowadays, several java RAD tools have java code generators from given DTDs, so you can use them.

Java XML Parsing and original byte offsets

I'd like to parse some well-formed XML into a DOM, but I'd like know the offset of each node's tag in the original media.
For example, if I had an XML document with the content something like:
<html>
<body>
<div>text</div>
</body>
</html>
I'd like to know that the node starts at offset 13 in the original media, and (more importantly) that "text" starts at offset 18.
Is this possible with standard Java XML parsers? JAXB? If no solution is easily available, what type of changes are necessary along the parsing path to make this possible?

The SAX API provides a rather obscure mechanism for this - the org.xml.sax.Locator interface. When you use the SAX API, you subclass DefaultHandler and pass that to the SAX parse methods, and the SAX parser implementation is supposed to inject a Locator into your DefaultHandler via setDocumentLocator(). As the parsing proceeds, the various callback methods on your ContentHandler are invoked (e.g. startElement()), at which point you can consult the Locator to find out the parsing position (via getColumnNumber() and getLineNumber())
Technically, this is optional functionality, but the javadoc says that implementations are "strongly encouraged" to provide it, so you can likely assume the SAX parser built into JavaSE will do it.
Of course, this does mean using the SAX API, which is noone's idea of fun, but I can't see a way of accessing this information using a higher-level API.
edit: Found this example.

Use the XML Streamreader and its getLocation() method to return location object. location.getCharacterOffset() gives the byte offset of current location.
import javax.xml.stream.Location;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
public class Runner {
public static void main(String argv[]) {
XMLInputFactory factory = XMLInputFactory.newInstance();
try{
XMLStreamReader streamReader = factory.createXMLStreamReader(
new FileReader("D:\\BigFile.xml"));
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
Location location = streamReader.getLocation();
System.out.println("byte location: " + location.getCharacterOffset());
}
}
} catch(Exception e){
e.printStackTrace();
}

What is the best/simplest way to read in an XML file in Java application? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Currently our Java application uses the values held within a tab delimited *.cfg file. We need to change this application so that it now uses an XML file.
What is the best/simplest library to use in order to read in values from this file?

There are of course a lot of good solutions based on what you need. If it is just configuration, you should have a look at Jakarta commons-configuration and commons-digester.
You could always use the standard JDK method of getting a document :
import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
[...]
File file = new File("some/path");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(file);

XML Code:
<?xml version="1.0"?>
<company>
<staff id="1001">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="2001">
<firstname>low</firstname>
<lastname>yin fong</lastname>
<nickname>fong fong</nickname>
<salary>200000</salary>
</staff>
</company>
Java Code:
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
public class ReadXMLFile {
public static void main(String argv[]) {
try {
File fXmlFile = new File("/Users/mkyong/staff.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("staff");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
System.out.println("\nCurrent Element :" + nNode.getNodeName());
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("Staff id : "
+ eElement.getAttribute("id"));
System.out.println("First Name : "
+ eElement.getElementsByTagName("firstname")
.item(0).getTextContent());
System.out.println("Last Name : "
+ eElement.getElementsByTagName("lastname")
.item(0).getTextContent());
System.out.println("Nick Name : "
+ eElement.getElementsByTagName("nickname")
.item(0).getTextContent());
System.out.println("Salary : "
+ eElement.getElementsByTagName("salary")
.item(0).getTextContent());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Output:
----------------
Root element :company
----------------------------
Current Element :staff
Staff id : 1001
First Name : yong
Last Name : mook kim
Nick Name : mkyong
Salary : 100000
Current Element :staff
Staff id : 2001
First Name : low
Last Name : yin fong
Nick Name : fong fong
Salary : 200000
I recommended you reading this: Normalization in DOM parsing with java - how does it work?
Example source.

What is the best/simplest library to
use in order to read in values from
this file?
As you're asking for the simplest library, I feel obliged to add an approach quite different to that in Guillaume's top-voted answer. (Of the other answers, sjbotha's JDOM mention is closest to what I suggest).
I've come to think that for XML handling in Java, using the standard JDK tools is certainly not the simplest way, and that only in some circumstances (such as not being able to use 3rd party libraries, for some reason) it is the best way.
Instead, consider using a good XML library, such as XOM. Here's how to read an XML file into a nu.xom.Document object:
import nu.xom.Builder;
import nu.xom.Document;
import java.io.File;
[...]
File file = new File("some/path");
Document document = new Builder().build(file);
So, this was just a little bit simpler, as reading the file into org.w3c.dom.Document wasn't very complicated either, in the "pure JDK" approach. But the advantages of using a good library only start here! Whatever you're doing with your XML, you'll often get away with much simpler solutions, and less of your own code to maintain, when using a library like XOM. As examples, consider this vs. this, or this vs. this, or this post containing both XOM and W3C DOM examples.
Others will provide counter-arguments (like these) for why sticking to Java's standard XML APIs may be worth it - these probably have merit, at least in some cases, although personally I don't subscribe to all of them. In any case, when choosing one way or the other, it's good to be aware of both sides of the story.
(This answer is part of my evaluation of XOM, which is a strong contender in my quest for finding the best Java XML library to replace dom4j.)

Is there a particular reason you have chosen XML config files? I have done XML configs in the past, and they have often turned out to be more of a headache than anything else.
I guess the real question is whether using something like the Preferences API might work better in your situation.
Reasons to use the Preferences API over a roll-your-own XML solution:
Avoids typical XML ugliness (DocumentFactory, etc), along with avoiding 3rd party libraries to provide the XML backend
Built in support for default values (no special handling required for missing/corrupt/invalid entries)
No need to sanitize values for XML storage (CDATA wrapping, etc)
Guaranteed status of the backing store (no need to constantly write XML out to disk)
Backing store is configurable (file on disk, LDAP, etc.)
Multi-threaded access to all preferences for free

JAXB is simple to use and is included in Java 6 SE. With JAXB, or other XML data binding such as Simple, you don't have to handle the XML yourself, most of the work is done by the library. The basic usage is to add annotation to your existing POJO. These annotation are then used to generate an XML Schema for you data and also when reading/writing your data from/to a file.

I've only used jdom. It's pretty easy.
Go here for documentation and to download it: http://www.jdom.org/
If you have a very very large document then it's better not to read it all into memory, but use a SAX parser which calls your methods as it hits certain tags and attributes. You have to then create a state machine to deal with the incoming calls.

Look into JAXB.

The simplest by far will be Simple http://simple.sourceforge.net, you only need to annotate a single object like so
#Root
public class Entry {
#Attribute
private String a
#Attribute
private int b;
#Element
private Date c;
public String getSomething() {
return a;
}
}
#Root
public class Configuration {
#ElementList(inline=true)
private List<Entry> entries;
public List<Entry> getEntries() {
return entries;
}
}
Then all you have to do to read the whole file is specify the location and it will parse and populate the annotated POJO's. This will do all the type conversions and validation. You can also annotate for persister callbacks if required. Reading it can be done like so.
Serializer serializer = new Persister();
Configuration configuraiton = serializer.read(Configuration.class, fileLocation);

Depending on your application and the scope of the cfg file, a properties file might be the easiest. Sure it isn't as elegant as xml but it certainly easier.

Use java.beans.XMLDecoder, part of core Java SE since 1.4.
XMLDecoder input = new XMLDecoder(new FileInputStream("some/path.xml"));
MyConfig config = (MyConfig) input.readObject();
input.close();
It's easy to write the configuration files by hand, or use the corresponding XMLEncoder with some setup to write new objects at run-time.

This is what I use. http://marketmovers.blogspot.com/2014/02/the-easy-way-to-read-xml-in-java.html It sits on top of the standard JDK tools, so if it's missing some feature you can always use the JDK version.
This really makes things easier for me. It's especially nice when I'm reading a config file that was saved by and older version of the software, or was manually edited by a user. It's very robust and won't throw an exception if some data is not exactly in the format you expect.

Here's a really simple API that I created for reading simple XML files in Java. It's incredibly simple and easy to use. Hope it's useful for you.
http://argonrain.wordpress.com/2009/10/27/000/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best method to parse various custom XML documents in Java - java

What is the best method to parse multiple, discrete, custom XML documents with Java?

I would use Stax to parse XML, it's fast and easy to use. I've been using it on my last project to parse XML files up to 24MB. There's a nice introduction on java.net, which tells you everything you need to know to get started.

You will want to use org.xml.sax.XMLReader (http://docs.oracle.com/javase/7/docs/api/org/xml/sax/XMLReader.html).

Related

In Java how do I evaluate XPATH expression on XML using SAX Parser?

High performace HTML parsing library [duplicate]

How to parse advanced XML files in Java

Java XML Parsing and original byte offsets

What is the best/simplest way to read in an XML file in Java application? [closed]

Categories

Resources