Xtext cardinality meta model - java

I am currently working on a project, where I am creating a feature model out of Xtext grammar. My task is to transform grammar syntax into a CSV file importable into eclipse plug-in pure::variants.
Feature model is basicaly tree of features. These features are different types ( mandatory, optional, alternative etc. ).
For constructing the tree, I am using generated ecore meta model of my xtext grammar syntax. This file ( .ecore ) is basically a XML file with objects of the grammar. It is consistent, simple and easy to create tree out of.
My problem is, that I need to assign types ( mandatory, alternative etc. ) to the nodes of my created tree. These types of features correspond to a cardinality operators. These operators are written in xtext grammar like this: ´(no operator)´, ´?´, ´*´ and ´+´ ( this can be seen in xtext user manual section 2.1.3 https://www.eclipse.org/Xtext/documentation/1_0_1/xtext.pdf). Problem is, that these cardinalities of xtext grammar don't seem to be anywhere to find. I thought that they would appear in .ecore or .genmodel files, but there are no cardinalities at all.
I imagine that if xtext is able to check and control these cardinalities, it has to have some kind of meta model, where these cardinalities can be seen and are easily gettable ( something like .xml file similiar to .ecore or .genmodel file).
So my question is: Is there some kind of xtext generated file, which contains these cardinalities? If there is not, I would have to somehow get these cardinalities out of grammar itself, but it would be unneccessarily time consuming and complicated, maybe even impossible, because written grammar doesn't fully correspond with ecore metamodel I am getting my feature tree out of and is really complex.
Only generated file I was able to find, which contains something "maybe useful" is generated file XXXXGrammarAccess.java ( XXXX stands for name of the grammar ), which is complex generated file, with a lot of library depedencies and I have no idea how to get these cardinalities out of that or if it is even possible. I imagine that there is a possibility, because this file uses a lot of IGrammarAccess methods, such as getRule(), getKeyword() and more, but I am not able to use this file, or print something out of it, because it is a generated file and I am not able to run it on itself.
If there is not some kind of meta model I am looking for, is there any possibility to somehow get these cardinalities different way during generating?
Thank you very much for your answers.

first of all the cardinalities in the metamodel and the grammar do not have to match 100%. the cardinality validation in the parser is different than the one in ecore.
the lower cardinality of 1 (for required) is not there to prevent really ugly error messages. the :1 or :-1 (=*) is reflected in the ecore though.
this was a deliberate decision when Xtext was created 10 years ago.
the grammar access just gives you access to the grammar at runtime.
can you elaborate why you actually care?

The Xtext grammar is itself a model, an instance of http://www.eclipse.org/2008/Xtext. (It used to be possible to demonstrate this by opening a *.xtext file with the Sample Reflective Ecore Editor, but unfortunately the use of classpath: URIs has broken it again.) Nonetheless you can open a *.xtext file programmatically as an EMF Resource and see everything that is in the grammar. See https://git.eclipse.org/c/ocl/org.eclipse.ocl.git/tree/examples/org.eclipse.ocl.examples.xtext2lpg/src/org/eclipse/ocl/examples/xtext2lpg/xtext2xbnf.qvto for the first stage of a transformation chain that starts by reading an Xtext grammar and ends up with an LPG grammar.

Related

Configurable HTML information extraction

Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.
I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.

Why shouldn't XML comments contain hidden commands?

I'm reading Core Java vol. 2 by Cay Horstmann, and in the chapter about XML where he talks about XML comments he says:
Comments should only be information for human readers. They should
never contain hidden commands; use processing instructions for
commands.
What does he mean by hidden commands, why can't I use them in XML comments and and how do I use processing instructions for them?
XML comments shouldn't contain out-of-band (hidden) data or commands because the purpose of XML is to communicate information within a mutually agreed upon framework.
Neither the rules of well-formedness that define the basis of XML itself nor the common XML schema languages that define further constraints of an XML document's vocabulary and grammar have a means to define the contents of a comment beyond that of basic text. This is by design and mirrors similar design decisions regarding comments in many programming languages.
Instead of adding flags, or worse, a micro-language within XML comments, surface data as XML elements and attributes, and surface commands as processing instructions so that the entire existing ecosystem of parsers, schemas, validators, and established standards may be leveraged in reading and writing the data.
Some characters have a different meaning in XML.
If you place a character like "<" inside an XML element, it will generate an error because the parser interprets it as the start of a new element.
This will generate an XML error because the parser will look for a matching closing tab:
salary < 1000
(This can cause major problems once the application goes live)
When you use commands inside of a comment, it could sometimes inevitably cause parsing errors such as above. Being a hidden command inside of a comment, it will become difficult to find the root cause of parsing issues as we may not look inside the comment. Hence it is better to avoid adding hidden commands inside of comments.
Because of these two reasons:
To reduce misinterpretations or possible mistakes, even comments should not include commands that are likely to be considered an executable statement. XML scripts are for transferring data and patterns in them should be defined completely clear.
It is for secure programming. To completely prevent someone from running the files, XML files should never contain hidden commands. By this you prevent abuse of code. By hidden commands someone may copy the files somewhere else to run them, or even use the command.

How to color keywords in a Java text editor?

I have a small working text editor written in Java using JTextPane and I would like to color specific words. It would work in the same fashion that keywords in Java (within Eclipse) are colored; if they are a keyword, they are highlighted after the user is done typing them. I am new to text editors implemented in Java, any ideas?
In computer science, lexical analysis is the process of converting a
sequence of characters into a sequence of tokens. A program or
function that performs lexical analysis is called a lexical analyzer,
lexer, tokenizer,[1] or scanner. A lexer often exists as a single
function which is called by a parser or another function, or can be
combined with the parser in scannerless parsing.
Having said that, it is no trivial task. You need a high level library to do that. It will ease your task. What is the way out ?
Use ANTLR. Here is what its site says:
ANTLR is a powerful parser generator that you can use to read,
process, execute, or translate structured text or binary files. It’s
widely used in academia and industry to build all sorts of languages,
tools, and frameworks....
NetBeans IDE parses C++ with ANTLR.
There, problem solved. The author of ANTLR also has a book on how to use ANTLR which you may want to buy if you wanna learn how to use it.
Having given you enough brain melt, there is an out of the box solution available for you: JSyntaxPane. Just like any JCOmponent, you initialize it and pop it into a JFrame. It works like a charm. It supports a whole lot of languages apart from Java

Can I use ANTLR for both two-way parsing/generating?

I need to both parse incoming messages and generate outgoing messages in EDIFACT format (basically a structured delimited format).
I would like to have a Java model that will be generated by parsing a message. Then I would like to use the same model to create an instance and generate a message.
The first half is fine, I've used ANTLR before to go from raw -> Java objects. But I've never done the reverse, or if I have it's been custom.
Does ANTLR support generating using a grammar or is it really just a parse-only tool?
EDIT:
Expansion - I want to define two things ideally. A grammar that describes the raw message (EDIFACT in this case but pretend it's CSV if you like). And a Java object model.
I know I can write an ANTLR grammar to get from the raw -> Java model. e.g. Parsing a SQL string -> Java model which I've done before. But I need to go the other way as well ideally without changing the grammar.
If you liken it to JAXB (XML world), I really want JAXB for EDIFACT (rather than XML).
Can ANTLR do what you are asking, YES. Although it might require multiple grammers.
To me, this sounds like you want to create a AST from your parser. Have one tree walker doing all the java object creation required (second grammer possibly). And then a second tree walker to create the output messages (third grammer), and you can even use StringTemplate if you want. Maybe you can get away with two grammers.
But at this point actual details would have to be given for any more help, what the AST will look like for a specific input and what the output message should be.
I have never done it myself (I also used ANTLR for parsing only) but I know for sure that ANRLR can be used as a generator as well.
in fact, it's using a library called stringtemplates for it's own code generation (by the same author).

JAXB Compiler and Attribute Order [duplicate]

This question already has answers here:
Using XSL to sort attributes
(2 answers)
Closed 2 years ago.
I would like to control the attribute order in .java files generated by the JAXB compiler.
I'm aware that attribute order is not important for xml validation. The order is important for textual comparison of marshalled xml in a regression test environment. The order of attributes in a file directly affects the order of the attributes in marshalled xml tags.
Every time the JAXB compiler is run attribute groups appear in a different order, even with no changes to the schema. There is no apparent option available on the compiler to prevent this behavior.
I would like to avoid running a post-compilation script to alphabetically reorder attributes in the generated .java files since this breaks up the attribute groups, but I'm not sure there is another option.
Any suggestions are much appreciated.
Thanks,
Dave
Apparently, in JAXB 2.0 you can use the annotation #XmlAccessorOrder or #XmlType(propOrder=)
I'd recommend using an XML parser to validate the output instead of doing textual comparisons. If you're going to be parsing the xml to re-order it anyway, you may as well just do the comparison using XML tools.
Edit:
Attempting to control the generated XML by manipulating the Java source code order seems like a fragile way of doing things. Granted, this is for testing only, so if something breaks the code might still work properly. People change source code order all the time, sometimes by accident, and it will be annoying or a subtle source of problems if you have to rely on a certain ordering.
As for ways of comparing the XML data using XML tools, I've never personally done this on a large scale, but this link mentions a few free tools. For me the extension to JUnit that provides XML-related assertions would be my first step, as that could integrate well with my existing tests. Otherwise, since you're mainly looking for exact equivalence, you could just parse the two XML files, then iterate over the nodes in the 'expected' file and see if those nodes are present in the 'actual' file. Then just check for any other nodes that you don't expect to see.
If you need to perform textual comparison of XML documents, there are better ways of doing it than by trying to control the output of an XML framework that does not distinguish between attribute ordering.
For example, there's XMLUnit, which is a junit extension specifically for XML assertions, and it handles whitespace and ordering quite nicely.
A more general solution is XOM's Canonicalizer, which outputs XML DOMs such that the attribute ordering and whitespace is predictable. Very handy.
So... let JAXB (or whatever) generate the XML as it sees fit, then run the outputs through XMLUnit or XOM, and compare. This has the added advantage of not depending on JAXB, it'll work with any generated XML.

Categories