Beginner's guide to writing grammar - java

The application I am working inputs lot of data from file import and updates the database column accordingly. I need to come up with a custom Rule engine that would process all the input values based on validation and perform transformation of data accordingly. E.x.
One of the fields in our application is Product Name. So one of the rules we need to implement is to convert Product name from lower case to upper case, if the input value from the file is in lower case. Similarly, there are many text/mathematical transformations that need to be done. For these reasons, we need to come up with custom rule engine where we define the rules for each attribute, parse them and then apply the rules.
I do know that ANTLR is one of the parser generators around for Java. I am seeking advice on following queries:
1> General information on working of a parser generator and best practices for implementing grammar.
2> Since I need to design this rule engine completely, can anyone point me to a sample rule engine out there that I can refer to? right from UI to database design. I am using GWT for UI, Java for core logic and oracle for database
3> Are there any other parser generators around for Java
4> Though I do want to follow the path of defining my own grammar and using parser generator to build this rule engine, is there any other approach I should consider?

You might want to consider just using JbossRules (formerly Drools) which is a Java based rules engine. Alternatively, a scripting engine may be another way to implement your rules (e.g. Apache Rhino (Javascript in Java)).
Writing your own in this situation seems like overkill, but it may allow you to provide better security guarantees if end users are going to be creating the rules / scripts.
EDIT to address questions in comments:
I suggest using an existing rules engine (ala JbossRules/Drools) instead of writing your own parser and grammar (for the rule component). Take a look here for instance: Drools.
For specialized logic that rules may need to use (db access or computation libraries) you should write a single Java API used by your rules (so that rules are not deeply accessing your other code since that can lead to bugs if/when you refactor). This advice applies regardless of which rules engine you use (your own or an existing one).
I assume that you already have the data format of your data input files solved and that you are only looking for a solution to the rule format and rule parsing.

There is JavaCC, which is a Parser generator and there is groovy for evaluating rules. If you are going to use a script engine or not depends on the grammar. If the rules can't be expressed in javascript, java, python, etc, and you want to write them in a new language, well then you have to use a parser generator. But you can always do anything you want inside methods that you create and then call them from the rules. The rules will be evaluated by the script engine.

Related

How to evaluate user expressions in a sandbox

I want my app to evaluate an expression from an untrusted user, that I'll be reading from a JSON file. Such as:
value = "(getTime() == 60) AND isFoo('bar')"
I've found many threads about this here on StackOverflow. Usually recommending using Java's own ScriptEngine class, which can read JavaScript. Or recommending the user to either use an existing library such as JEXL, MVEL, or any other from this list:
http://java-source.net/open-source/expression-languages
But they all seem to rely on a trusted user (ex.: a configuration file you write yourself and want to do some scripting in it). But in my case, I want my expression evaluation to run in a secure sandbox. So the user cannot do something as simple as:
value = "while(true)" // or
value = "new java.io.File(\"R:/t.txt\").delete()" // this works on MVEL
And lock up my app, or access unwanted resources.
1) So are any of those existing libraries able to be easily configured so that it can run on a safe box? By 'easily', I mean high level configuration API that would faster for me to use than to write my own expression evaluator. After doing a little bit of my own research, both JEXL and MVEL seem to be out.
2) Or is there an existing expression language that is extremely simple so that it cannot be exploited by an untrusted user? All the ones I found are very complex, and implement things like loops, import statements etc. All I need is to parse math, logic operators and my own defined variables and methods. Anything beyond that is outside of my scope.
3) If the only solution is to write my own expression evaluator, then where can I find some guidance on how to write a consistent security model? I'm new to this, and have no idea of what are the common tricks used for code injection. Which is why I wanted avoid having to write this on my own.
I could recommend embedding Rhino, enabling the user to write javascript. It fits your criteria in (2) perfectly being a java library that enables you to run javascript (or run java from javascript).
You set up a context and the user only has access to what you put in the context or make accessible from it. The javascript expressions can be as simple as the simplest case you show above, or can get as complex as they need to. Embedding Rhino and exposing a limited set of objects was a great way to enable all sorts of user scripting in a past project and that was some years ago, Rhino is quite mature now.
You've also got the advantage that if your problem requires it, you may well be able to set it up so that the same expressions will happily run client or server side.
More information on embedding Rhino to accomplish what you need at http://www.mozilla.org/rhino/tutorial.html#runScript

Automatically generating Java source code

I'm looking for a way to automatically generate source code for new methods within an existing Java source code file, based on the fields defined within the class.
In essence, I'm looking to execute the following steps:
Read and parse SomeClass.java
Iterate through all fields defined in the source code
Add source code method someMethod()
Save SomeClass.java (Ideally, preserving the formatting of the existing code)
What tools and techniques are best suited to accomplish this?
EDIT
I don't want to generate code at runtime; I want to augment existing Java source code
What you want is a Program Transformation system.
Good ones have parsers for the language you care about, build ASTs representing the program for the parsed code, provide you with access to the AST for analaysis and modification, and can regenerate source text from the AST. Your remark about "scanning the fields" is just a kind of traversal of the AST representing the program. For each interesting analysis result you produce, you want to make a change to the AST, perhaps somewhere else, but nonetheless in the AST.
And after all the chagnes are made, you want to regenerate text with comments (as originally entered, or as you have constructed in your new code).
There are several tools that do this specifically for Java.
Jackpot provides a parser, builds ASTs, and lets you code Java procedures to do what you want with the trees. Upside: easy conceptually. Downside: you write a lot more Java code to climb around/hack at trees than you'd expect. Jackpot only works with Java.
Stratego and TXL parse your code, build ASTs, and let you write "surce-to-source" transformations (using the syntax of the target language, e.g., Java in this case) to express patterns and fixes. Additional good news: you can define any programming language you like, as the target language to be processed, and both of these have Java definitions.
But they are weak on analysis: often you need symbol tables, and data flow analysis, to really make analyses and changes you need. And they insist that everything is a rewrite rule, whether that helps you or not; this is a little like insisting you only need a hammer in toolbox; after all, everything can be treated like a nail, right?
Our DMS Software Reengineering Toolkit allows the definition of an abitrary target language (and has many predefined langauges including Java), includes all the source-to-source transformation capabilities of Stratego, TXL, the procedural capability of Jackpot,
and additionally provides symbol tables, control and data flow analysis information. The compiler guys taught us these things were necessary to build strong compilers (= "analysis + optimizations + refinement") and it is true of code generation systems too, for exactly the same reasons. Using this approach you can generate code and optimize it to the extent you have the knowledge to do so. One example, similar to your serialization ideas, is to generate fast XML readers and writers for specified XML DTDs; we've done that with DMS for Java and COBOL.
DMS has been used to read/modify/write many kinds of source files. A nice example that will make the ideas clear can be found in this technical paper, which shows how to modify code to insert instrumentation probes: Branch Coverage Made Easy.
A simpler, but more complete example of defining an arbitrary lanauges and transformations to apply to it can be found at How to transform Algebra using the same ideas.
Have a look at Java Emitter Templates. They allow you to create java source files by using a mark up language. It is similar to how you can use a scripting language to spit out HTML except you spit out compilable source code. The syntax for JET is very similar to JSP and so isn't too tricky to pick up. However this may be an overkill for what you're trying to accomplish. Here are some resources if you decide to go down that path:
http://www.eclipse.org/articles/Article-JET/jet_tutorial1.html
http://www.ibm.com/developerworks/library/os-ecemf2
http://www.vogella.de/articles/EclipseJET/article.html
Modifying the same java source file with auto-generated code is maintenance nightmare. Consider generating a new class that extends you current class and adds the desired method. Use reflection to read from user-defined class and create velocity templates for the auto-generating classes. Then for each user-defined class generate its extending class. Integrate the code generation phase in your build lifecycle.
Or you may use 'bytecode enhancement' techniques to enhance the classes without having to modify the source code.
Updates:
mixing auto-generated code always pose a risk of someone modifying it in future to just to tweak a small behavior. It's just the matter of next build, when this changes will be lost.
you will have to solely rely on the comments on top of auto-generated source to prevent developers from doing so.
version-controlling - Lets say you update the template of someMethod(), now all of your source file's version will be updated, even if the source updates is auto-generated. you will see redundant history.
You can use cglib to generate code at runtime.
Iterating through the fields and defining someMethod is a pretty vague problem statement, so it's hard to give you a very useful answer, but Eclipse's refactoring support provides some excellent tools. It'll give you constructors which initialize a selected set of the defined members, and it'll also define a toString method for you.
I don't know what other someMethod()'s you'd want to consider, but there's a start for you.
I'd be very wary of injecting generated code into files containing hand-written code. Hand-written code should be checked into revision control, but generated code should not be; the code generation should be done as part of the build process. You'd have to structure your build process so that for each file you make a temporary copy, inject the generated source code into it, and compile the result, without touching the original source file that the developers work on.
Antlr is really a great tool that can be used very easily for transforming Java source code to Java source code.

Java Collada Parser - XML Pull based implementation

I am looking at a set of parsers generated for Atom, XAL, Kml etc. seemingly using an automated technique with a XML pull based parser. The clue towards the automation is presence of "package.html" in all XML-to-Java mapped classes folders. I would like to produce a similar one for the rather large Collada 1.4 spec. My first attempt with Altova ran into small problems due the "enum" keyword. I am sure I can fix it in the next run with appropriate renaming. Khronos admit to not designing the 1.4 spec to being automated parser generation friendly.
The actual parsers i.e. XAL parser, Atom parser etc. implement the XMLEventParser interface. I would like to know if anybody has encountered/used this pattern. If so which tool can be used to map the XSD to a class set simply giving access to the data components of the nodes using getters and setters.
I'm not sure I understand your question, but it appears that you want to process XML formats like Atom and represent it in objects with getters/setters. This can easily be done with JAXB.
For an example see:
http://bdoughan.blogspot.com/2010/09/processing-atom-feeds-with-jaxb.html

Coding a parser for a domain specific language in Java

We want to design a simple domain specific language for writing test scripts to automatically test a XML-based interface of one of our applications. A sample test would be:
Get an input XML file from network shared folder or subversion repository
Import the XML file using the interface
Check if the import result message was successfull
Export the XML corresponding to the object that was just imported using the interface and check if it correct.
If the domain specific language can be declarative and its statements look as close as my sentences in the sample above as possible, it will be awesome because people won't necessarily have to be programmers to understand/write/maintain the tests. Something like:
newObject = GET FILE "http://svn/repos/template1.xml"
reponseMessage = IMPORT newObject
newObjectID = GET PROPERTY '/object/id/' FROM responseMessage
(..)
But then I'm not sure how to implement a simple parser for that languange in Java. Back in school, 10 years ago, I coded a language parser using Lex and Yacc for the C language. Maybe an approach would be to use some equivalent for Java?
Or, I could give up the idea of having a declarative language and choose an XML-based language instead, which would possibly be easier to create a parser for? What approach would you recommend?
You could try JavaCC or Antlr for creating a parser for your domain specific language. If the editors of that file are not programmers, I would prefer this approach over XML.
Take a look at Xtext - it will take a grammar definition and generate a parser as well as a fully-featured eclipse editor pluging with syntax highlighting and -checking.
ANTLR should suffice
ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages. ANTLR provides excellent support for tree construction, tree walking, translation, error recovery, and error reporting.
Look at Antlr library. You'll have to use EBNF grammatic to describe your language and then use Antlr to make java classes from your grammatic.
Have a look at how Cucumber defines its test cases:
(source: cukes.info)
http://cukes.info/ - can run in JRuby.
Or, I could give up the idea of having a declarative language and
choose an XML-based language instead,
which would possibly be easier to
create a parser for? What approach
would you recommend?
This could be easily done using XML to describe your test scenarios.
< GETFILE object="newObject" file="http://svn/repos/template1.xml"/ >
Since your example of syntax is quite simple, it should also be possible to simply use StringTokenizer to tokenize and parse these kind of scripts.
If you want to introduce more complex expressions or control structures you probably better choose ANTLR
I realize this thread is 3 years old but still feel prompted to offer my take on it. The questioner asked if Java could be used for a DSL to look as closely as possible like
Get an input XML file from network shared folder or subversion repository
Import the XML file using the interface
Check if the import result message was successfull
Export the XML corresponding to the object that was just imported
using the interface and check if it correct.
The answer is yes it can be done, and has been done for similar needs. Many years ago I built a Java DSL framework that - with simple customization - could allow the following syntax to be used for compilable, runnable code:
file InputFile
message Message
get InputFile from http://<....>
import Message from InputFile
if validate Message export Message
else
begin
! Signal an error
end
In the above, the keywords file, message, get, import, validate and export are all custom keywords, each one requiring two simple classes of less than a page of code to implement their compiler and runtime functions. As each piece of functionality is completed it is dropped into the framework, where it is immediately available to do its job.
Note that this is just one possible form; the exact syntax can be freely chosen by the implementor. The system is effectively a DIY high-level assembly language, using pre-written Java classes to perform all the functional blocks, both for compiling and for the runtime. The framework defines where these bits of functionality have to be placed, and provides the necessary abstract classes and interfaces to be implemented.
The system meets the primary need of clarity, where non-programmers can easily see what's happening. Changes can be made quickly and run immediately as compilation is almost instantaneous.
Complete (open) source code is available on request. There's a generic Java version and also one for Android.

Java; Runtime Interpretation; Strategies To Add Plugins

I'm beginning to start on my first large project. It will be a program very similar to Rosetta Stone. It will be a program, used for learning a foreign language, written in Java using Swing. In my program I plan on the user being able to select downloaded courses to learn from. I will be able to create an English course since I am a native English speaker. However, I want people who speak other languages to be able to write courses for users to use as well (this is an essential part for my program to work).
Since I want the users to be able to download courses of languages they want, having it hard-coded into the program is out of the question. The courses needed to be interpreted during the runtime. Also since I want others to collaborate with my work (ie make courses), I need to make it easy for them to do so.
What would be the best way to go about doing this?
The idea I have come up with is having a strict empty course outline (hard-coded) with a simple xml file which details the text and sounds to be used. The drawback to this is that it extremely limits the author. Different languages may need to start out with learning different parts.
Any advice on the problem at hand as well as the project as a whole will be greatly appreciated. Any links to any relevant resources or information would also be greatly appreciated.
Think you for your time and effort,
Joseph Pond
Simply, you should base your program on a system such as Eclipse RCP, or the Netbeans Platform. Both of these systems already deal with exactly this problem, and both are perfectly adequate for this task. They're not just for IDEs.
It's a larger first step as you will need to learn one of these platforms beyond simply just Swing.
But, they solve the problem, and their overall organization and technique will serve your program well anyway.
Don't reinvent this wheel, just learn one of these instead.
If you are set on doing this from scratch (Will's idea isn't bad), What I would do is first lay down the file format that would be easiest to create your language course in. It could be XML, plaintext or some other format you come up with yourself.
You will probably need some flexibility in the language format because you will want to actually be able to specify things like questions and answers. XML is a pain because of all the extra terminators, but it gives a good amount of meta-data. If you like XML for that, you may consider defining your language file in YML, it gives you the data of XML but uses whitespace delineators instead of angle brackets.
You probably also want to define your file in the language it's created for, so you might or might not want to require english words as keys. If you don't want any english, you may have to skip both XML and YML and come up with your own file format--possibly where the layout and/or special symbols define the flow and "functionality".
Once you have defined the file format, you won't have to worry about hard-coding anything... you won't be able to because it will already be in the file.
Plug-in functionality would be nice as well... This is where your definition file also contains information that tells you what class to instantiate (reflectively) and use to parse/display the data. In that way you could add new types of questions just by delivering a new jar file.
If this is confusing, sorry, this is difficult in a one-way forum because I can't look at your face and see if you're following me or if I'm even going in the right direction. If you think I'm on the right track and want more details (I've done a bit of this stuff before) feel free to leave a follow-up question (or an email address) in a comment and I'd be glad to discuss it with you further.
If I was doing this, I'd seriously consider using Eclipse EMF to model the "language" for defining courses. EMF is rather daunting to start with, but it gives you:
A high-level model that can be entered/edited in a variety of ways.
An automatic mechanism for serializing "instances" (i.e. courses) to XML. (And you can tinker with the serialization if you choose.)
Automatically generated Java classes for in-memory representations of your instances. These provide APIs that are tuned to your model, an generic ones that are the EMF equivalent of Java reflection ... but based on EMF model classes rather than Java classes.
An automatically generated tree editor for your "instances".
Hooks for implementing your own constraints / validation rules to say what is a valid "course".
Related Eclipse plugins offer:
Mappings to text-based languages with generation of parsers/unparsers
Mappings to graphical languages; e.g. notations using boxes / arrows / etc
Various more advanced persistence mechanisms
Comparisons/differencing, model-to-model transformations, constraints in OCL, etc
I've used EMF in a couple of largish projects, and the main point that keeps me coming back for more is ease of model evolution ... compared with building everything at a lower level of abstraction. If my model (language) needs to be extended / changed, I can make the necessary changes using the EMF Model editor, regenerate the code, extend my custom code to do the right stuff with the extensions, and I'm pretty much done (modulo conversion of stored instances).

Categories