Can I use ANTLR for both two-way parsing/generating? - java

I need to both parse incoming messages and generate outgoing messages in EDIFACT format (basically a structured delimited format).
I would like to have a Java model that will be generated by parsing a message. Then I would like to use the same model to create an instance and generate a message.
The first half is fine, I've used ANTLR before to go from raw -> Java objects. But I've never done the reverse, or if I have it's been custom.
Does ANTLR support generating using a grammar or is it really just a parse-only tool?
EDIT:
Expansion - I want to define two things ideally. A grammar that describes the raw message (EDIFACT in this case but pretend it's CSV if you like). And a Java object model.
I know I can write an ANTLR grammar to get from the raw -> Java model. e.g. Parsing a SQL string -> Java model which I've done before. But I need to go the other way as well ideally without changing the grammar.
If you liken it to JAXB (XML world), I really want JAXB for EDIFACT (rather than XML).

Can ANTLR do what you are asking, YES. Although it might require multiple grammers.
To me, this sounds like you want to create a AST from your parser. Have one tree walker doing all the java object creation required (second grammer possibly). And then a second tree walker to create the output messages (third grammer), and you can even use StringTemplate if you want. Maybe you can get away with two grammers.
But at this point actual details would have to be given for any more help, what the AST will look like for a specific input and what the output message should be.

I have never done it myself (I also used ANTLR for parsing only) but I know for sure that ANRLR can be used as a generator as well.
in fact, it's using a library called stringtemplates for it's own code generation (by the same author).

Related

Xtext cardinality meta model

I am currently working on a project, where I am creating a feature model out of Xtext grammar. My task is to transform grammar syntax into a CSV file importable into eclipse plug-in pure::variants.
Feature model is basicaly tree of features. These features are different types ( mandatory, optional, alternative etc. ).
For constructing the tree, I am using generated ecore meta model of my xtext grammar syntax. This file ( .ecore ) is basically a XML file with objects of the grammar. It is consistent, simple and easy to create tree out of.
My problem is, that I need to assign types ( mandatory, alternative etc. ) to the nodes of my created tree. These types of features correspond to a cardinality operators. These operators are written in xtext grammar like this: ´(no operator)´, ´?´, ´*´ and ´+´ ( this can be seen in xtext user manual section 2.1.3 https://www.eclipse.org/Xtext/documentation/1_0_1/xtext.pdf). Problem is, that these cardinalities of xtext grammar don't seem to be anywhere to find. I thought that they would appear in .ecore or .genmodel files, but there are no cardinalities at all.
I imagine that if xtext is able to check and control these cardinalities, it has to have some kind of meta model, where these cardinalities can be seen and are easily gettable ( something like .xml file similiar to .ecore or .genmodel file).
So my question is: Is there some kind of xtext generated file, which contains these cardinalities? If there is not, I would have to somehow get these cardinalities out of grammar itself, but it would be unneccessarily time consuming and complicated, maybe even impossible, because written grammar doesn't fully correspond with ecore metamodel I am getting my feature tree out of and is really complex.
Only generated file I was able to find, which contains something "maybe useful" is generated file XXXXGrammarAccess.java ( XXXX stands for name of the grammar ), which is complex generated file, with a lot of library depedencies and I have no idea how to get these cardinalities out of that or if it is even possible. I imagine that there is a possibility, because this file uses a lot of IGrammarAccess methods, such as getRule(), getKeyword() and more, but I am not able to use this file, or print something out of it, because it is a generated file and I am not able to run it on itself.
If there is not some kind of meta model I am looking for, is there any possibility to somehow get these cardinalities different way during generating?
Thank you very much for your answers.
first of all the cardinalities in the metamodel and the grammar do not have to match 100%. the cardinality validation in the parser is different than the one in ecore.
the lower cardinality of 1 (for required) is not there to prevent really ugly error messages. the :1 or :-1 (=*) is reflected in the ecore though.
this was a deliberate decision when Xtext was created 10 years ago.
the grammar access just gives you access to the grammar at runtime.
can you elaborate why you actually care?
The Xtext grammar is itself a model, an instance of http://www.eclipse.org/2008/Xtext. (It used to be possible to demonstrate this by opening a *.xtext file with the Sample Reflective Ecore Editor, but unfortunately the use of classpath: URIs has broken it again.) Nonetheless you can open a *.xtext file programmatically as an EMF Resource and see everything that is in the grammar. See https://git.eclipse.org/c/ocl/org.eclipse.ocl.git/tree/examples/org.eclipse.ocl.examples.xtext2lpg/src/org/eclipse/ocl/examples/xtext2lpg/xtext2xbnf.qvto for the first stage of a transformation chain that starts by reading an Xtext grammar and ends up with an LPG grammar.

Parsing very large XML files and marshalling to Java Objects

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?
I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH
Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/
Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml

parsing/scanning/tokenizing "raw XML"

I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.
I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?
edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.
I think you might have to generate your own grammar.
Some links:
Parsing XML with ANTLR Tutorial
ANTXR
XPA
http://www.google.com/search?q=antlr+xml
I don't think any XML parser will do what you want. Why ? For instance, the XML spec doesn't enforce attribute ordering. I think you're going to have to parse it yourself, and that is non-trivial.
Why do you have to do this ? I'm guessing you have some client 'XML' that enforces or relies on non-standard construction. In that case I'd push back and get that fixed, rather than jump through numerous fixes to try and accommodate this.
I'm not entirely sure that I understand what it is you are trying to do. Have you tried using CDATA regions for the parts of the document you don't want the parser to touch?
Also relying on attribute order is not a good idea - if I remember the XML standard correctly then order is never to be expected.
It sounds like you are dealing with some malformed XML and that it would be easier to first turn it into proper XML.

XML template in Java

I need to generate XML and they differ only in the values, that the tags contain.
Is it possible to create a template XML and then write only the values each time? (I do not want to go the JAXB way as these are small XMLs and are not worth creating objects for them).
Is this a good approach?
Any thoughts?
You can use freemarker or velocity for templating in java -- or even just add PHP tags to a sample XML to generate from a template.
I think as a general rule, though, once you start conditionally adding elements or attributes, or looping to generate multiples, you're better of generating your XML -- though I agree sometimes getting it into a format you want (not what the generator wants) is sometimes a pain.
As long as the XML file to be produced is small, simple and mostly consistent in format, I tend to buck the trend: I simply create and write a text string.
writer.out.format("<?xml version='1.0'><root><tag1>%s</tag1></root>", value1)
kinda thing.
Despite the fact that you are against jaxb (which I have yet to use), I wish to recommend a comparable way to do this with Apache's XMLBeans.
This requires you to use an xml schema - but from my experience it worth it...

Castor and sockets

I'm new to Castor and data binding in general. I'm working on an application that, in part, needs to take data off of a socket and unmarshall the data to make POJOs. Now, I've got the socket stuff down, and I've even generated and compiled java files thanks to Ant and Castor.
Here's the problem: the data stream that I'll receive could be one of about 9 different objects. That is, I receive a stream of text (XML) that represents an object with stuff that I'll operate on; again, depending on the object type. If it were just one object, it'd be easy: call the unmarshall commands on it and go on my merry way. But, since it could be one of many kinds of objects, who do I know what to unmarshall? I read up on mapping, but either I didn't get it, or it seems like a static mapping, not a dynamic mapping.
Any help out there?
You are right, Castor expects a static mapping. But you can work with that. You can write some code that will modify the incoming xml so that, on your side, Castor can use one schema, and on your clients' side they don't have to change their schemas.
Change the schema that Castor expects to get to something with a common root-element, with under that your nine different alternatives for your different objects (I think you can restrict it so the schema will allow only one of the nine, if that doesn't work out you could just make all the sub-elements optional).
Then you can write code that modifies the incoming xml to wrap your incoming xml with that common root-element, then feeds the wrapped xml into a stream that gets read by the Castor unmarshaller.
There are at least 3 different ways to implement the xml-wrapping part: SAX, XSLT, and XML libraries (like JDOM, DOM4J, and XOM--I prefer XOM but any of them will work).
The SAX way is probably best if you're already familiar with SAX or if one of the other ways has worked but come up short on performance. If I had to implement that then I would create an XMLFilter that takes in xml and writes xml out, stacking that on top of another piece that writes xml to an OutputStream, and writing a wrapper method around the unmarshalling stuff to feed the incoming stream to the xmlreader, copy the OutputStream to another InputStream (an easy way is to use commons-io), and feed the new InputStream to the Castor unmarshaller.
With XSLT there is no fooling with SAX, although XSLT has a reputation for pain sometimes, it seems to me like this might be a relatively straightforward transformation, but I haven't taken a stab at it either. It is a long time since I used XSLT for anything. I am not sure about performance either, though I wouldn't write it off out of hand.
Using XOM or JDOM or DOM4J to wrap the XML is also possible, and the learning curve is a lot lower than for SAX or XSLT. The downside is the whole XML document tends to get sucked into memory at once so if you deal with big enough documents you could run out of memory.
I have a similar thing in Jibx where all of the incoming message objects implement a base interface which has a field denoting the message type.
The text/xml is serialized into the base interface and I then used the command pattern to call the respective business logic depending upon the message type which is defined in the base interface.
Not sure if this is possible using castor but take a look at Jibx as the performance is fantastic.
http://jibx.sourceforge.net/
I appreciate your insights. You both have given me some good information to go on and new knowledge that I didn't have. In the end, I got the process to work via a hack. I grab the text stream, parse out the root tag of the message, and then switch on it to determine the right object to create. I'm unmarshalling all of my objects independently and everyone is happy on our end.

Categories