I'm working on a tool in the context of a java project to evaluate a custom domain specific, rule-like expression like
min-5 avg datalist > Number
with the individual tokens meaning the following:
min-5 : optional minimum (or maximum, in that case max-5) occurences of the following term
avg : an optional aggregation function which operates on the following token datalist (can also be sum or anything similar)
datalist : A list of data (type: integer/ double) which will be available before the evaluation of the entire expression starts, can be reduced to a single value by the preceding aggregation function
operator: conditional operator < or > or =
Number: value for the conditional operator
Note(s):
The optional amount of occurrences and the aggregation can not happen both, that would make no sense.
There can be multiple of the above expressions, chained with and/or
These expressions are external input, not pre-defined
The evaluation of this expression should output a boolean
As I am rather new to expression evaluation / parsing I am searching for an elegant way to solve this, possibly with a java framework/tool.
What I've tried so far:
Parsing by hand which turned out not so nicely
Trying to use Janino Expression Evaluator, but I don't know how to handle this programmatically
I am searching for a solution to solve this in an elegant way, I am thankful for any suggestions
what you try to do is a DSL (domain specific language) and the elegant way to solve your issue is to create a grammar for yuor specific language that help you on parsing function.
Take a look at JavaCC or Antlr.
I have to give the user the option to enter in a text field a mathematical formula and then save it in the DB as a String. That is easy enough, but I also need to retrieve it and use it to do calculations.
For example, assume I allow someone to specify the formula of employee salary calculation which I must save in String format in the DB.
GROSS_PAY = BASIC_SALARY - NO_PAY + TOTAL_OT + ALLOWANCE_TOTAL
Assume that terms such as GROSS_PAY, BASIC_SALARY are known to us and we can make out what they evaluate to. The real issue is we can't predict which combinations of such terms (e.g. GROSS_PAY etc.) and other mathematical operators the user may choose to enter (not just the +, -, ×, / but also the radical sigh - indicating roots - and powers etc. etc.). So how do we interpret this formula in string format once where have retrieved it from DB, so we can do calculations based on the composition of the formula.
Building an expression evaluator is actually fairly easy.
See my SO answer on how to write a parser. With a BNF for the range of expression operators and operands you exactly want, you can follow this process to build a parser for exactly those expressions, directly in Java.
The answer links to a second answer that discusses how to evaluate the expression as you parse it.
So, you read the string from the database, collect the set of possible variables that can occur in the expression, and then parse/evaluate the string. If you don't know the variables in advance (seems like you must), you can parse the expression twice, the first time just to get the variable names.
as of Evaluating a math expression given in string form there is a JavaScript Engine in Java which can execute a String functionality with operators.
Hope this helps.
You could build a string representation of a class that effectively wraps your expression and compile it using the system JavaCompiler — it requires a file system. You can evaluate strings directly using javaScript or groovy. In each case, you need to figure out a way to bind variables. One approach would be to use regex to find and replace known variable names with a call to a binding function:
getValue("BASIC_SALARY") - getValue("NO_PAY") + getValue("TOTAL_OT") + getValue("ALLOWANCE_TOTAL")
or
getBASIC_SALARY() - getNO_PAY() + getTOTAL_OT() + getALLOWANCE_TOTAL()
This approach, however, exposes you to all kinds of injection type security bugs; so, it would not be appropriate if security was required. The approach is also weak when it comes to error diagnostics. How will you tell the user why their expression is broken?
An alternative is to use something like ANTLR to generate a parser in java. It's not too hard and there are a lot of examples. This approach will provide both security (users can't inject malicious code because it won't parse) and diagnostics.
This is the code sample which I want to parse. I want getSaveable PaymentMethodsSmartList() as a token, when I overwrite the function in the parserBaseListener.java file created by ANTLR.
/** #suppress */
public any function getSaveablePaymentMethodsSmartList() {
if(!structKeyExists(variables, "saveablePaymentMethodsSmartList")) {
variables.saveablePaymentMethodsSmartList = getService("paymentService").getPaymentMethodSmartList();
variables.saveablePaymentMethodsSmartList.addFilter('activeFlag', 1);
variables.saveablePaymentMethodsSmartList.addFilter('allowSaveFlag', 1);
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodType', 'creditCard,giftCard,external,termPayment');
if(len(setting('accountEligiblePaymentMethods'))) {
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodID', setting('accountEligiblePaymentMethods'));
}
}
return variables.saveablePaymentMethodsSmartList;
}
I already have the grammar that parses function declaration, but I need a new rule that can associate doctype comments with a function declaration and give the function name as separate token if there is a doctype comment associated with it.
Grammar looks like this:
functionDeclaration
: accessType? typeSpec? FUNCTION identifier
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement
;
You want grammar rules that:
return X if something "far away" in the source is a A,
returns Y if something far away is a B (or ...).
In general, this is context dependency. It is not handled well by context free grammars, which is something that ANTLR is trying to approximate with its BNF rules. In essence, what you think you want to do is to encode history of what the parser has seen long ago, to influence what is being produced now. Generally that is hard.
The usual solution to something like this is to not address it in the grammar at all. Instead:
have the grammar rules produce an X regardless of what is far away,
build a tree as you parse (ANTLR does this for you); this captures not only X but everything about the parsed entity, including tokens for A that are far away
walk over the tree, interpreting a found X as Y if the tree contains the A (usually far away in the tree).
For your specific case of docstring-influences-function name, you can probably get away with encoding far away history.
You need (IMHO, ugly) grammar rules that look something like this:
functionDeclaration: documented_function | undocumented_function ;
documented_function: docstring accessType? typeSpec? FUNCTION
documented_function_identifier rest_of_function ;
undocumented_function: accessType? typeSpec? FUNCTION
identifier rest_of_function ;
rest_of_function: // avoids duplication, not pretty
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement ;
You have to recognize the docstring as an explicit token that can be "seen" by the parser, which means modifying your lexer to make docstrings from comments (e.g, whitespace) into tokens. [This is the first ugly thing]. Then having seen such a docstring, the lexer has to switch to a lexical mode that will pick up identifier-like text and produce documented_function_identifier, and then switch back to normal mode. [This is the second ugly thing]. What you are doing is implementing literally a context dependency.
The reason you can accomplish this in spite of my remarks about context dependency is that A is not very far away; it is within few tokens of X.
So, you could do it this way. I would not do this; you are trying to make the parser do too much. Stick to the "usual solution". (You'll have different problem: your A is a comment/whitespace, and probably isn't stored in the tree by ANTLR. You'll have to solve that; I'm not an ANTLR expert.)
I am trying to understand if I'm missing something, because it seems very bizarre to me why Jsoup includes the current element in the search performed by select.
For example (scala code):
val el = doc.select("div").first
el.select("div").contains(el) // => true
What is the point of this? I see very limited cases where you'd actually want this. Do I need to always use el.children.select instead? Is there a nicer method?
Side question: Is there a nicer way to do el.children.select(s).first? In Ruby Nokogiri it would be el.at_css(s) which is much shorter, is there a similar option in Jsoup?
As to why the select method was implemented the way it did, my only guess would be because it's the most straightforward way to do it if we take into consideration the struct that holds the data resulted by your query.
If we think about el, we will see that it is a "tree" representation of the elements that you asked for, having as root the first parent div node. Then you call select on that tree. Now it all depends on how you decide to see this tree. Should we treat this "tree" as a whole (include root) or not (discard root)? It's a matter of taste I guess.
If I judge from myself, a lot of people using Jsoup, probably have had some experience on DOM parsing with jQuery. The equivalent would be something like this $("div").first().find("div") where find is documented as
Get the descendants of each element in the current set of matched
elements, filtered by a selector, jQuery object, or element.
This is in agreement with what you stated. It's just a matter of how the two libraries "see" the resulting tree. Jsoup treats the root as one of the nodes, jQuery differentiates the root (as far find is concerned).
About the second part of your question.
val el = doc.select("div").first
el.children.select(s).first
No there isn't. The only way is to change the css selector.
val result = doc.select("div:eq(0) " + s).first;
I'm looking for an XPath evaluator that doesn't rebuild the whole DOM document to look for the nodes of a document: actually the object is to manage a large amount of XML data (ideally over 2Gb) with SAX model, which is very good for memory management, and give the possibility to search for nodes.
Thank you all for the support!
For all those who say it's not possible: I recently, after asked the question, found a project named "saxpath" (http://www.saxpath.org/), but I can't find any implementing project.
My current list (compiled from web search results and the other answers) is:
http://code.google.com/p/xpath4sax/
http://spex.sourceforge.net/
https://github.com/santhosh-tekuri/jlibs/wiki/XMLDog (also contains a performance chart)
http://www.cs.umd.edu/projects/xsq/ (uniersity project, dead since 10 years, GPL)
MIT-Licensed approach http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html
Other parsers/memory models supporting fast XPath:
http://vtd-xml.sourceforge.net/ ("The world's fastest XPath 1.0 implementation.")
http://jaxen.codehaus.org/ (contains http://www.saxpath.org/)
http://www.saxonica.com/documentation/sourcedocs/streaming/streamable-xpath.html
The next step is to use the examples of XMLDog and compare the performance of all these approaches. Then, the test cases should be extended to the supported XPath expressions.
We regularly parse 1GB+ complex XML files by using a SAX parser which extracts partial DOM trees that can be conveniently queried using XPath. I blogged about it here: http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html - Sources are available on github - MIT License.
XPath DOES work with SAX, and most XSLT processors (especially Saxon and Apache Xalan) do support executing XPath expressions inside XSLTs on a SAX stream without building the entire dom.
They manage to do this, very roughly, as follows :
Examining the XPath expressions they need to match
Receiving SAX events and testing if that node is needed or will be needed by one of the XPath expressions.
Ignoring the SAX event if it is of no use for the XPath expressions.
Buffering it if it's needed
How they buffer it is also very interesting, cause while some simply create DOM fragments here and there, others use very optimized tables for quick lookup and reduced memory consumption.
How much they manage to optimize largely depends on the kind of XPath queries they find. As the already posted Saxon documentation clearly explain, queries that move "up" and then traverse "horizontally" (sibling by sibling) the document obviously requires the entire document to be there, but most of them require just a few nodes to be kept into RAM at any moment.
I'm pretty sure of this because when I was still making every day webapp using Cocoon, we had the XSLT memory footprint problem each time we used a "//something" expression inside an XSLT, and quite often we had to rework XPath expressions to allow a better SAX optimization.
SAX is forward-only, while XPath queries can navigate the document in any direction (consider parent::, ancestor::, preceding:: and preceding-sibling:: axis). I don't see how this would be possible in general. The best approximation would be some sort of lazy-loading DOM, but depending on your queries this may or may not give you any benefit - there is always a worst-case query such as //*[. != preceding::*].
Sorry, a slightly late answer here - it seems that this is possible for a subset of XPath - in general it's very difficult due to the fact that XPath can match both forwards and backwards from the "current" point. I'm aware of two projects that solve it to some degree using state machines: http://spex.sourceforge.net & http://www.cs.umd.edu/projects/xsq. I haven't looked at them in detail but they seem to use a similar approach.
I'll toss in a plug for a new project of mine, called AXS. It's at https://code.google.com/p/annotation-xpath-sax/ and the idea is that you annotate methods with (forward-axis-only) XPath
statements and they get called when the SAX parser is at a node that matches it. So with a document
<doc>
<nodes>
<node name="a">text of node 1</node>
<node name="b">text of node 2</node>
<node otherattr="I have attributes!">text of node 3</node>
</nodes>
</doc>
you can do things like
#XPath("/nodes/node")
void onNode(String nodeText)
{
// will be called with "text of node [123]"
}
or
#XPathStart("//node[#name='']")
void onNode3(Attrs node3Attrs) { ... }
or
#XPathEnd("/nodes/node[2]")
void iDontCareAboutNode3() throws SAXExpression
{
throw new StopParsingExpression();
}
Of course, the library is so new that I haven't even made a release of it yet, but it's MIT licensed, so feel free to give it a try and see if it matches your need. (I wrote it to
do HTML screen-scraping with low enough memory requirements that I can run it on
old Android devices...) If you find bugs, please let me know by filing them on the
googlecode site!
There are SAX/StAX based XPath implementations, but they only support a small subset of XPath expressions/axis largely due to SAX/StAX's forward only nature.. the best alternative I am aware of is extended VTD-XML, it supports full xpath, partial document loading via mem-map.. and a max document size of 256GB, but you will need 64-bit JVM to use it to its full potential
What you could do is hook an XSL transformer to a SAX input source. Your processing will be sequential and the XSL preprocessor will make an attempt to catch the input as it comes to fiddle it into whatever result you specified. You can use this to pull a path's value out of the stream. This would come in especially handy if you wanted to produce a bunch of different XPATH results in one pass.
You'll get (typically) an XML document as a result, but you could pull your expected output out of, say, a StreamResult with not too much hassle.
Have a look at the streaming mode of the Saxon-SA XSLT-processor.
http://www.saxonica.com/documentation/sourcedocs/serial.html
"The rules that determine whether a path expression can be streamed are:
The expression to be streamed starts with a call on the document() or doc() function.
The path expression introduced by the call on doc() or document must conform to a subset of XPath defined as follows:
any XPath expression is acceptable if it conforms to the rules for path expressions appearing in identity constraints in XML Schema. These rules allow no predicates; the first step (but only the first) can be introduced with "//"; the last step can optionally use the attribute axis; all other steps must be simple Axis Steps using the child axis.
In addition, Saxon allows the expression to contain a union, for example doc()/(*/ABC | /XYZ). Unions can also be expressed in abbreviated form, for example the above can be written as doc()//(ABC|XYZ).
The expression must either select elements only, or attributes only, or a mixture of elements and attributes.
Simple filters (one or more) are also supported. Each filter may apply to the last step or to the expression as a whole, and it must only use downward selection from the context node (the self, child, attribute, descendant, descendant-or-self, or namespace axes). It must not be positional (that is, it must not reference position() or last(), and must not be numeric: in fact, it must be such that Saxon can determine at compile time that it will not be numeric). Filters cannot be applied to unions or to branches of unions. Any violation of these conditions causes the expression to be evaluated without the streaming optimization.
These rules apply after other optimization rewrites have been applied to the expression. For example, some FLWOR expressions may be rewritten to a path expression that satisfies these rules.
The optimization is enabled only if explicitly requested, either by using the saxon:stream() extension function, or the saxon:read-once attribute on anXSLT xsl:copy-of instruction, or the XQuery pragma saxon:stream. It is available only if the stylesheet or query is processed using Saxon-SA."
Note: It is most likely in the commercial version this facility is available. I've used Saxon extensively earlier, and it is a nice piece of work.
Mmh I don't know if I really understand you. As far as I know, the SAX model is event oriented. That means, you do something if a certain node is encountered during the parsing. Yeah, it is better for memory but I don't see how you would like to get XPath into it. As SAX does not build a model, I don't think that this is possible.
I don't think xpath works with SAX, but you might take a look at StAX which is an extended streaming XML API for Java.
http://en.wikipedia.org/wiki/StAX
The standard javax xpath API technically already works with streams; javax.xml.xpath.XPathExpression can be evaluated against an InputSource, which in turn can be constructed with a Reader. I don't think it constructs a DOM under the covers.
Did you have tried also QuiXPath https://code.google.com/p/quixpath/ ?