Java XML Parsing and original byte offsets - java

I'd like to parse some well-formed XML into a DOM, but I'd like know the offset of each node's tag in the original media.
For example, if I had an XML document with the content something like:
<html>
<body>
<div>text</div>
</body>
</html>
I'd like to know that the node starts at offset 13 in the original media, and (more importantly) that "text" starts at offset 18.
Is this possible with standard Java XML parsers? JAXB? If no solution is easily available, what type of changes are necessary along the parsing path to make this possible?

The SAX API provides a rather obscure mechanism for this - the org.xml.sax.Locator interface. When you use the SAX API, you subclass DefaultHandler and pass that to the SAX parse methods, and the SAX parser implementation is supposed to inject a Locator into your DefaultHandler via setDocumentLocator(). As the parsing proceeds, the various callback methods on your ContentHandler are invoked (e.g. startElement()), at which point you can consult the Locator to find out the parsing position (via getColumnNumber() and getLineNumber())
Technically, this is optional functionality, but the javadoc says that implementations are "strongly encouraged" to provide it, so you can likely assume the SAX parser built into JavaSE will do it.
Of course, this does mean using the SAX API, which is noone's idea of fun, but I can't see a way of accessing this information using a higher-level API.
edit: Found this example.

Use the XML Streamreader and its getLocation() method to return location object. location.getCharacterOffset() gives the byte offset of current location.
import javax.xml.stream.Location;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
public class Runner {
public static void main(String argv[]) {
XMLInputFactory factory = XMLInputFactory.newInstance();
try{
XMLStreamReader streamReader = factory.createXMLStreamReader(
new FileReader("D:\\BigFile.xml"));
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
Location location = streamReader.getLocation();
System.out.println("byte location: " + location.getCharacterOffset());
}
}
} catch(Exception e){
e.printStackTrace();
}

Related

Converting java xml sax event calls to an xml string

Does java xml sax api provide a ContentHandler subclass which would convert the event calls to an xml string. For example, the following calls to this handler should produce the following xml:
XMLPrinterHandler h;
String data = "hello";
h.startDocument();
h.startElement("", "element", "element", new Attributes());
h.characters(h.toCharArray(), 0, h.size());
h.endElement("", "element", "element");
h.endDocument();
System.out.println(h.getXml());
This should print:
<element>hello</element>
I'm dealing with some code which encodes some data as xml and would like to know the intermediate output. The encoding class takes a ContentHandler and calls the appropriate methods on it to encode the data.
You want:
SAXTransformerFactory f = new SAXTransformerFactory();
TransformerHandler t = new f.newTransformerHandler();
t.setResult(System.out);
t.startDocument();
etc
The TransformerHandler performs a "null transformation" from SAX input to lexical XML output.
You can also use
h.getTransformer().setOutputProperty()
to set serialization properties such as indenting, based on the properties defined in the XSLT specification. (The standard JDK TransformerHandler gives you XSLT 1.0 serialization properties, if you want the extended set defined in XSLT 3.0 plus Saxon extensions, use the Saxon implementation.)
Personally I find that writing Java code as a direct client of the SAX ContentHandler interface is very clumsy. I much prefer the XMLStreamWriter interface.

Parsing XML strings in MATLAB

I need to parse an XML string with MATLAB (caution: without file I/O, so I don't want to write the string to a file and then read them). I'm receiving the strings from an HTTP connection and the parsing should be very fast. I'm mostly concerned about reading the values of certain tags in the entire string
The net is full of death threats about parsing XML with regexp so I didn't want to get into that just yet. I know MATLAB has seamless java integration but I'm not very java savvy. Is there a quick way to get certain values from XML very very rapidly?
For example I want to get the 'volume' information from this string below and write this to a variable.
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<root>
<volume>256</volume>
<length>0</length>
<time>0</time>
<state>stop</state>
....
For what it's worth, below is the Matlab executable Java code to perform the required task, without writing to an intermediate file:
%An XML formatted string
strXml = [...
'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>' char(10)...
'<root>' char(10) ...
' <volume>256</volume>' char(10) ...
' <length>0</length>' char(10) ...
' <time>0</time>' char(10) ...
' <state>stop</state>' char(10) ...
'</root>' ];
%"simple" java code to create a document from said string
xmlDocument = javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder.parse(java.io.StringBufferInputStream(strXml));
%"intuitive" methods to explore the xmlDocument
nodeList = xmlDocument.getElementsByTagName('volume');
numberOfNodes = nodeList.getLength();
firstNode = nodeList.item(0);
firstNodeContent = firstNode.getTextContent;
disp(firstNodeContent); %Returns '256'
As an alternative, if your application allows it, consider passing the URL directly into your XML parser. Untested java code is below, but that probably also opens up the Matlab built-in xslt function as well.
xmlDocument = javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder.parse('URL_AS_A_STRING_HERE');
Documentation here. Start at the "javax.xml.parsers" package.
There's an entire class of functions for dealing with xml, including xmlread and xmlwrite. Those should be pretty useful for your problem.
I am not familiar with Matlab's APIs at all, but I would point out that using the DOM method outlined by Pursuit will take the most time/memory if you only want specific values out of the XML stream you are getting back over the HTTP connection.
While STAX will give you the fastest parsing approach in Java, using the API can be unwieldy especially if you are not that familiar with Java. You could use SJXP which is an extremely thin abstraction ontop of STAX parsing in Java (disclaimer: I am the author) that allows you to define paths to the elements you want, then you give the parser a stream (your HTTP stream in this case) and it pulls out all the values for you.
As an example, let's say you wanted the /root/state and /root/volume values out of the examples XML you posted, the actual Java would look something like this:
// Create /root/state rule
IRule stateRule = new DefaultRule(Type.CHARACTER, "/root/state") {
#Override
public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
System.out.println("State is: " + text);
}
}
// Create /root/volume rule
IRule volRule = new DefaultRule(Type.CHARACTER, "/state/volume") {
#Override
public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
System.out.println("Volume is: " + text);
}
}
// Create the parser with the given rules
XMLParser parser = new XMLParser(stateRule, volRule);
You can do all of that initialization on program start then at some point later when you are processing the stream from your HTTP connection, you would do something like:
parser.parser(httpConnection.getOutputStream());
or the like; then all of your handler code you defined in your rules will get called as the parser runs through the stream of characters from the HTTP connection.
As I mentioned I am not familiar with Matlab and don't know the proper ways to "Matlab-i-fy" this code, but it looks like from the first example you can more or less just use the Java APIs directly in which case this solution will both be faster and use significantly less memory for parsing if that is important than the DOM approach.

High performace HTML parsing library [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
Another library that might be useful for HTML processing is jsoup.
Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
http://jsoup.org/
The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* #author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
*/
public class TestHtmlParse
{
static final String className = "tags";
static final String url = "http://www.stackoverflow.com";
TagNode rootNode;
public TestHtmlParse(URL htmlPage) throws IOException
{
HtmlCleaner cleaner = new HtmlCleaner();
rootNode = cleaner.clean(htmlPage);
}
List getDivsByClass(String CSSClassname)
{
List divList = new ArrayList();
TagNode divElements[] = rootNode.getElementsByName("div", true);
for (int i = 0; divElements != null && i < divElements.length; i++)
{
String classType = divElements[i].getAttributeByName("class");
if (classType != null && classType.equals(CSSClassname))
{
divList.add(divElements[i]);
}
}
return divList;
}
public static void main(String[] args)
{
try
{
TestHtmlParse thp = new TestHtmlParse(new URL(url));
List divs = thp.getDivsByClass(className);
System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
for (Iterator iterator = divs.iterator(); iterator.hasNext();)
{
TagNode divElement = (TagNode) iterator.next();
System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
Several years ago I used JTidy for the same purpose:
http://jtidy.sourceforge.net/
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
More information on JTidy can be found on the JTidy SourceForge project page ."
You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.
The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter =
new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);
Jericho: http://jericho.htmlparser.net/docs/index.html
Easy to use, supports not well formed HTML, a lot of examples.
HTMLUnit might be of help. It does a lot more stuff too.
http://htmlunit.sourceforge.net/1
Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.
Example:
Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");
Example:
doc.form("#myform", new JerryFormHandler() {
public void onForm(Jerry form, Map<String, String[]> parameters) {
// process form and parameters
}
});
Of course, these are just some quick examples to get the feeling how it all looks like.
The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.
The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)
You can also use XWiki HTML Cleaner:
It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.
If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.

Ignore SOAP tags in XML file

I have a XML file with some SOAP tags that I want to ignore.
I was parsing the XML file with pull-parser but it stop working since that SOAP tags came along.
The XML file looks something like:
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns1:getAllUsersListResponse xmlns:ns1="http://webservice.business.ese.wiccore.myent.com/">
<return xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"><![CDATA[<User>
and inside the tag <User> come all the tags that I want to parse (and I know how with pull-parser) and then
</User>]]></return>
<return xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"><![CDATA[<User>
until
</User>]]></return>
</ns1:getAllUsersListResponse>
</soap:Body>
</soap:Envelope>
The thing is, I know how to parse normal tags, but I don't want to parse this Soap tags, I want to IGNORE the SOAP tags! Anyone know how to achieve this?
Not being overly familiar with pull-parsing (I'm typically a SAX guy), I'm not probably not the most authoritative source on such things, but here goes...
I believe most (if not all) Java pull parsers should expose CDATA sections using a specific CDATA node (I believe in StAX, for example, the relevant event type is XMLStreamConstants.CDATA). As such, you'll want to parse your document and pull out that CDATA section (inside the SOAP <return> element) and extract its contents.
The contents of that section are the document you are interested in, so then you'd want to in turn run a new pull-parse over the contents you just extracted.
I'm sorry I can't be more help. Hopefully there will be someone else out there that can flesh the details out a bit more for you.
EDIT: in response to comments, you can achieve this using SAX as follows (exception handling omitted for brevity):
import org.xml.sax.ext.DefaultHandler2;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.XMLReader;
class MyParsingApp extends DefaultHandler2 // see note 1
{
private boolean inCdata, parsingSubDocument;
private String subDocument;
public static void main (String args[])
{
InputStream stream = ... // see note 2
XMLReader reader = XMLReaderFactory.createXMLReader(); // see note 3
reader.setContentHandler (new MyParsingApp ( ));
reader.parse (new InputSource(stream));
parsingSubDocument = true;
reader.parse (new InputSource(new StringReader(subDocument)));
...
}
public MyParsingApp ( )
{
inCdata = parsingSubDocument = false;
subDocument = "";
}
#Override
public void startCDATA() throws SAXException
{
inCdata = true;
}
#Override
public void endCDATA() throws SAXException
{
inCdata = false;
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException
{
if (inCdata)
subDocument += new String(ch, start, length); // see note 4
}
}
Some important notes:
Normally you would use a separate class as your content handler, probably one for the "main" document (including SOAP elements), and one for your "target" document (in the CDATA section). I've not done so here just to keep it as short as possible.
I'm not sure what format your XML is in, but I'm assuming it's in an InputStream here. The InputSource class will happily use an InputStream, a Reader or a String specifying a filename to read from. Use whatever suits you best.
You will need to use a SAX2 reader to be able to handle CDATA content. Your default SAX reader may or may not be SAX2 compliant. As such, you may need to (for example) manually create an instance of a particular SAX2 parser. You can find a list of some SAX2 parsers here, if that's the case.
There are probably more efficient ways of doing this too (StringBuffer/StringBuilder might be options). Again, I'm just doing it this way for simplicity.
I've not actually tested this code. Your mileage may vary.
If you've not used SAX before, it's probably also worth running through the SAX Quickstart Guide.

Best method to parse various custom XML documents in Java

What is the best method to parse multiple, discrete, custom XML documents with Java?
I would use Stax to parse XML, it's fast and easy to use. I've been using it on my last project to parse XML files up to 24MB. There's a nice introduction on java.net, which tells you everything you need to know to get started.
Basically, you have two main XML parsing methods in Java :
SAX, where you use an handler to only grab what you want in your XML and ditch the rest
DOM, which parses your file all along, and allows you to grab all elements in a more tree-like fashion.
Another very useful XML parsing method, albeit a little more recent than these ones, and included in the JRE only since Java6, is StAX. StAX was conceived as a medial method between the tree-based of DOM and event-based approach of SAX. It is quite similar to SAX in the fact that parsing very large documents is easy, but in this case the application "pulls" info from the parser, instead of the parsing "pushing" events to the application. You can find more explanation on this subject here.
So, depending on what you want to achieve, you can use one of these approaches.
Use the dom4j library
First read the document
import java.net.URL;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.io.SAXReader;
public class Foo {
public Document parse(URL url) throws DocumentException {
SAXReader reader = new SAXReader();
Document document = reader.read(url);
return document;
}
}
Then use XPATH to get to the values you need
public void get_author(Document document) {
Node node = document.selectSingleNode( "//AppealRequestProcessRequest/author" );
String author = node.getText();
return author;
}
You will want to use org.xml.sax.XMLReader (http://docs.oracle.com/javase/7/docs/api/org/xml/sax/XMLReader.html).
If you only need to parse then I would recommend using XPath library. Here is a nice reference: http://www.ibm.com/developerworks/library/x-javaxpathapi.html
But you may want to consider turning XMLs to objects and then the sky is the limit.
For that you may use XStream, this is a great library which i use alot
Below is the code of extracting some value value using vtd-xml.
import com.ximpleware.*;
public class extractValue{
public static void main(String s[]) throws VTDException, IOException{
VTDGen vg = new VTDGen();
if (!vg.parseFile("input.xml", false));
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/aa/bb[name='k1']/value");
int i=0;
while ((i=ap.evalXPath())!=-1){
System.out.println(" value ===>"+vn.toString(i));
}
}
}

Categories