Parsing XML strings in MATLAB - java

I need to parse an XML string with MATLAB (caution: without file I/O, so I don't want to write the string to a file and then read them). I'm receiving the strings from an HTTP connection and the parsing should be very fast. I'm mostly concerned about reading the values of certain tags in the entire string
The net is full of death threats about parsing XML with regexp so I didn't want to get into that just yet. I know MATLAB has seamless java integration but I'm not very java savvy. Is there a quick way to get certain values from XML very very rapidly?
For example I want to get the 'volume' information from this string below and write this to a variable.
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<root>
<volume>256</volume>
<length>0</length>
<time>0</time>
<state>stop</state>
....

For what it's worth, below is the Matlab executable Java code to perform the required task, without writing to an intermediate file:
%An XML formatted string
strXml = [...
'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>' char(10)...
'<root>' char(10) ...
' <volume>256</volume>' char(10) ...
' <length>0</length>' char(10) ...
' <time>0</time>' char(10) ...
' <state>stop</state>' char(10) ...
'</root>' ];
%"simple" java code to create a document from said string
xmlDocument = javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder.parse(java.io.StringBufferInputStream(strXml));
%"intuitive" methods to explore the xmlDocument
nodeList = xmlDocument.getElementsByTagName('volume');
numberOfNodes = nodeList.getLength();
firstNode = nodeList.item(0);
firstNodeContent = firstNode.getTextContent;
disp(firstNodeContent); %Returns '256'
As an alternative, if your application allows it, consider passing the URL directly into your XML parser. Untested java code is below, but that probably also opens up the Matlab built-in xslt function as well.
xmlDocument = javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder.parse('URL_AS_A_STRING_HERE');
Documentation here. Start at the "javax.xml.parsers" package.

There's an entire class of functions for dealing with xml, including xmlread and xmlwrite. Those should be pretty useful for your problem.

I am not familiar with Matlab's APIs at all, but I would point out that using the DOM method outlined by Pursuit will take the most time/memory if you only want specific values out of the XML stream you are getting back over the HTTP connection.
While STAX will give you the fastest parsing approach in Java, using the API can be unwieldy especially if you are not that familiar with Java. You could use SJXP which is an extremely thin abstraction ontop of STAX parsing in Java (disclaimer: I am the author) that allows you to define paths to the elements you want, then you give the parser a stream (your HTTP stream in this case) and it pulls out all the values for you.
As an example, let's say you wanted the /root/state and /root/volume values out of the examples XML you posted, the actual Java would look something like this:
// Create /root/state rule
IRule stateRule = new DefaultRule(Type.CHARACTER, "/root/state") {
#Override
public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
System.out.println("State is: " + text);
}
}
// Create /root/volume rule
IRule volRule = new DefaultRule(Type.CHARACTER, "/state/volume") {
#Override
public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
System.out.println("Volume is: " + text);
}
}
// Create the parser with the given rules
XMLParser parser = new XMLParser(stateRule, volRule);
You can do all of that initialization on program start then at some point later when you are processing the stream from your HTTP connection, you would do something like:
parser.parser(httpConnection.getOutputStream());
or the like; then all of your handler code you defined in your rules will get called as the parser runs through the stream of characters from the HTTP connection.
As I mentioned I am not familiar with Matlab and don't know the proper ways to "Matlab-i-fy" this code, but it looks like from the first example you can more or less just use the Java APIs directly in which case this solution will both be faster and use significantly less memory for parsing if that is important than the DOM approach.

Related

Most efficient way to read and edit an xml file

I have an xml template file, some fields are blank and need to be filled by my application.
This has to result in an xml string representation of that file which will be given to another routine.
So, let's take this simple xml as example:
<root>
<name anAttr=""></name>
<age></age>
</root>
As you can see I'd have to read the xml and, in the parsing process, add some contents to it.
I though about using a sax parser and in the handler I would do something like this:
StringBuilder finalXml = new StringBuilder();
DefaultHandler handler = new DefaultHandler(){
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
finalXml.append("<"+qName+">");
if(qName.equals("name")){
finalXml.append("donald");
}
}
would it be correct/efficient this way? Or is there a better way?
I've used dom4j when i have wanted to parse xml in Java, and it's quite efficient.
If you have a choice of technology then I would suggest using JAXB .
It will unmarshal the XML into Java Object ,here do the modifications to java Object and then Marshal the modified Java Object into new XML File.
It has little bit of learning curve but code will be readable and maintainable.
for Basic tutorial of JAXB please refer to URL

How match JAXB elements in CIM/RDF?

Trying to load a model from a CIM/XML file acording to IEC 61970 (Common Information Model, for power systems models), I found a problem;
According JAXB´s graphs between elements are provided by #XmlREF #XmlID and these both should be equals to match. But in CIM/RDF the references to a resource through an ID, i.e. rdf:resource="#_37C0E103000D40CD812C47572C31C0AD" contain the "#" character, consequently JAXB is unable to match "GeographicalRegion" vs. "SubGeographicalRegion.Region" when in the rdf:resource atribute the "#" character is present.
Here an example:
<cim:GeographicalRegion rdf:ID="_37C0E103000D40CD812C47572C31C0AD">
<cim:IdentifiedObject.name>GeoRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>OpenCIM3bus</cim:IdentifiedObject.localName>
</cim:GeographicalRegion>
<cim:SubGeographicalRegion rdf:ID="_ID_SubGeographicalRegion">
<cim:IdentifiedObject.name>SubRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>SubRegion</cim:IdentifiedObject.localName>
<cim:SubGeographicalRegion.Region rdf:resource="#_37C0E103000D40CD812C47572C31C0AD"/>
</cim:SubGeographicalRegion>
I realize you're asking for a solution using JAXB, but I would urge you to consider an RDF-based solution as it is more flexible and robust. You're basically trying to reinvent what RDF parsers already have built in. RDF/XML is a difficult format to parse, it doesn't make much sense to try and hack your own parsing together - especially since files that have very different XML structures can express exactly the same information: this only becomes apparent when looking at the level of the RDF. You may find that your JAXB parser workaround works on one CIM/RDF file but completely fails on another.
So, here's an example of how to process your file using the Sesame RDF API. No inferencing is involved, this just parses the file and puts it in an in-memory RDF model, which you can then manipulate and query from any angle.
Assuming the root element of your CIM file looks something like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:cim="http://example.org/cim/">
(only a guess of course, but I need prefixes for a proper example)
Then you can do the following, using Sesame's Rio RDF/XML parser:
String baseURI = "http://example.org/my/file";
FileInputStream in = new FileInputStream("/path/to/my/cim.rdf");
Model model = Rio.parse(in, baseURI, RDFFormat.RDFXML);
This creates an in-memory RDF model of your document. You can then simply filter-query over that. For example, to print out the properties of all resources that have _37C0E103000D40CD812C47572C31C0AD as their SubGeographicalRegion.Region:
String CIM_NS = "http://example.org/cim/";
ValueFactory vf = ValueFactoryImpl.getInstance();
URI subRegion = vf.createURI(CIM_NS, "SubGeographicalRegion.Region");
URI res = vf.createURI("http://example.org/my/file#_37C0E103000D40CD812C47572C31C0AD");
Set<Resource> subs = model.filter(null, subRegion, res).subjects();
for (Resource sub: subs) {
System.out.println("resource: " + sub + " has the following properties: ");
for (URI prop: model.filter(sub, null, null).predicates()) {
System.out.println(prop + ": " + model.filter(sub, prop, null).objectValue());
}
}
Of course at this point you can also choose to convert the model to some other syntax format for further handling by your application - as you see fit. The point is that the difference between the identifiers with the leading # and without has been resolved for you by the RDF/XML parser.
This is of course personal opinion only, since I don't know the details of your use case, but I think you'll find that this is quite quick and flexible. I should also point out that although the above solution keeps the entire model in memory, you can easily adapt this to a more streaming (and therefore less memory-intensive) approach if you find your files are too big.

How to extract data from a lot of URLs?

I have about 3200 URLs to small XML files which have some data in the form of strings(obviously).The XML files are displayed(not downloaded) when I go to the URLs. So I need to extract some data from all those XMLs and save it in a single .txt file or XML file or whatever. How can I automate this process?
*Note: This is what the files look like. I need to copy the 'location' and 'title' from all of them and put them in one single file. Using what methodology can this be achieved?
<?xml version="1.0"?>
-<playlist xmlns="http://xspf.org/ns/0/" version="1">
-<tracklist>
<location>http://radiotool.com/fransn.mp3</location>
<title>France, Paris radio 104.5</title>
</tracklist>
</playlist>
*edit: Fixed XML.
It's easy enough with XQuery or XSLT, though the details will depend on how the URLs are held. If they're in a Java List, then (with Saxon at least) you can supply this list as a parameter to the following query:
declare variable urls as xs:string* external;
<data>{
for $u in $urls return doc($u)//*:tracklist
}</data>
The Java code would be something like:
Processor proc = new Processor();
XQueryCompiler c = proc.newXQueryCompiler();
XQueryEvaluator q = c.compile($query).load();
List<XdmItem> urls = new ArrayList();
for (url : inputUrls) {
urls.append(new XdmAtomicValue(url);
}
q.setExternalVariable(new QName("urls"), new XdmValue(urls));
q.setDestination(...)
run();
Have a look at the JSoup library here: http://jsoup.org/
It has facilities for pulling and fixing the contents of a URL, it is intended for HTML though, so I'm not sure it will be good for XML, but it is worth a look.

Storing html values in xml

Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.
I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.
I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.
i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.
There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
How to parse a String containing XML in Java and retrieve the value of the root node?
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.
I would use HTMLAgilityPack or Chris Lovett's SGMLReader.
Or, simply HTML Tidy.
Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.
I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.
TagSoup
JSoup
Beautiful Soup

Best approach to serialize XML to stream with Java?

We serialize/deserialize XML using XStream... and just got an OutOfMemory exception.
Firstly I don't understand why we're getting the error as we have 500MB allocated to the server.
Question is - what changes should we make to stay out of trouble? We want to ensure this implementation scales.
Currently we have ~60K objects, each ~50 bytes. We load the 60K POJO's in memory, and serialize them to a String which we send to a web service using HttpClient. When receiving, we get the entire String, then convert to POJO's. The XML/object hierarchy is like:
<root>
<meta>
<date>10/10/2009</date>
<type>abc</type>
</meta>
<data>
<field>x</field>
</data>
[thousands of <data>]
</root>
I gather the best approach is to not store the POJO's in memory and not write the contents to a single String. Instead we should write the individual <data> POJO's to a stream. XStream supports this but seems like the <meta> element wouldn't be supported. Data would need to be in form:
<root>
<data>
<field>x</field>
</data>
[thousands of <data>]
</root>
So what approach is easiest to stream the entire tree?
You definitely want to avoid serializing your POJOs into a humongous String and then writing that String out. Use the XStream APIs to serialize the POJOs directly to your OutputStream. I ran into the same situation earlier this year when I found that I was generating 200-300Mb XML documents and getting OutOfMemoryErrors. It was very easy to make the switch.
And ditto of course for the reading side. Don't read the XML into a String and ask XStream to deserialize from that String: deserialize directly from the InputStream.
You mention a second issue regarding not being able to serialize the <meta> element and the <data> elements. I don't think this is an XStream problem or limitation as I routinely serialize much more complex structures on the order of:
<myobject>
<item>foo</item>
<anotheritem>foo</anotheritem>
<alist>
<alistitem>
<value1>v1</value1>
<value2>v2</value2>
<value3>v3</value3>
...
</alistitem>
...
<alistitem>
<value1>v1</value1>
<value2>v2</value2>
<value3>v3</value3>
...
</alistitem>
</alist>
<anotherlist>
<anotherlistitem>
<valA>A</valA>
<valB>B</valB>
<valC>C</valC>
...
</anotherlistitem>
...
</anotherlist>
</myobject>
I've successfully serialized and deserialized nested lists too.
Not sure what the problem is here...you've found your answer on that webpage.
The example code on the link you provided suggests:
Writer someWriter = new FileWriter("filename.xml");
ObjectOutputStream out = xstream.createObjectOutputStream(someWriter, "root");
out.writeObject(dataObject);
// iterate over your objects...
out.close();
and for reading nearly identical but with Reader for Writer and Input for Output:
Reader someReader = new FileReader("filename.xml");
ObjectInputStream in = xstream.createObjectInputStream(someReader);
DataObject foo = (DataObject)in.readObject();
// do some stuff here while there's more objects...
in.close();
I'd suggest using tools like Visual VM or Eclipse Memory Analyzer to make sure you don't have a memory leak/problem.
Also, how do you know each object is 50 bytes? That doesn't sound likely.
Use XMLStreamWriter (or XStream) to serialize it, you can write whatever you want on it. If you have the option of getting the input stream instead of the entire string, use a SAXParser, it is event based and, although the implementation maybe a little bit clumsy, you will be able to read any XML that is thrown at you, even if it the XML is huge (I have parse 2GB+ more XML files with SAXParser).
Just as a side note, you should send the binary data and not the string to a XML parser. XML parsers will read the encoding of the byte array that is going to come next through the xml tag in the beginning of the XML sequence:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
A string is encoded in something already. It's better practice to let the XML parse the original stream before you create a String with it.

Categories