Performing complicated XPath queries in Scala

Performing complicated XPath queries in Scala - java

What's the simplest API to use in scala to perform the following XPath queries on a document?
//s:Annotation[#type='attitude']/s:Content/s:Parameter[#role='type' and not(text())]
//s:Annotation[s:Content/s:Parameter[#role='id' and not(text())]]/#type
(s is defined as a nickname for a particular namespace)
The only documentation I can find on Scala's XML libraries has no information on performing complicated real XPath queries.
I used to like JDOM for this purpose (in Java), but since JDOM doesn't support generics, it will be painful to work with in Scala. (Other XML libraries for Java have tended to be even more painful in Java, but I admit I don't know the landscape real well.)

//s:Annotation[#type='attitude']/s:Content/s:Parameter[#role='type' and not(text())]
Well, I don't understand the s: notation, and couldn't find it on XPath spec either. However, ignoring that this would look like this:
(
(xml
\\ "Annotation"
filter (_ \ "#type" contains Text("x"))
)
\ "Content"
\ "Parameter"
filter (el => (el \ "#type" contains Text("type")) && el.isInstanceOf[Text])
)
Note the necessity of parenthesis because of higher precedence of \ over filter. I have changed the formatting to a multi-line expression as the Scala equivalent is just way too verbose for a single line.
I can't answer about namespaces, though. No clue how to work with them on searches, if it's even possible. The docs mention #{uri}attribute for prefixed attributes, not does not mention anything about prefixed elements. Also, note that you need to pass an uri which resolves to the namespace you want, as literal namespaces in search are not supported.

I think I'm going to go with lightly pimping XOM. It's a bit of a shame the XOM authors decided against exposing collections of child nodes and the like, but they had more work and less advantage to doing so in Java than in Scala. (And it is an otherwise well-designed library.)
EDIT: I wound up pimping JDOM after all, because XOM doesn't compile XPath queries ahead of time. Since most of my effort was directed towards XPath this time, I was able to come up with a good model that sidesteps most of the generics issues. It shouldn't be too hard to come up with reasonable genericized versions of the methods getChildren and getAttributes and getAdditionalNamespaces in org.jdom.Element (by pimping the library with new methods that have slightly changed names.) I don't think there's a fix for getContent, and I'm not sure about getDescendants.

Scales Xml adds both string based full XPath evaluation and an internal DSL providing a fairly complete coverage for querying

I guess when scalaxmljaxen is mature, we'll be able to do this reliably on scala's built-in XML classes.

I would suggest using kantan.xpath:
import kantan.xpath._
import kantan.xpath.implicits._
input.evalXPath[List[String]](xp"/annotation[#type='attitude']/content/parameter[#role='type' and not(text())]/#value")
This yields:
res1: kantan.xpath.XPathResult[List[String]] = Success(List(foobar))

Related

Higher-level, semantic search-and-replace in Java code from command-line

Command-line tools like grep, sed, awk, and perl allow one to carry out textual search-and-replace operations.
However, is there any tool that would allow me to carry out semantic search-and-replace operations in a Java codebase, from command-line?
The Eclipse IDE allows me, e.g., to easily rename a variable, a field, a method, or a class. But I would like to be able to do the same from command-line.
The rename operation above is just one example. I would further like to be able to select the replacee text with additional semantic constraints such as:
only the scopes of methods M1, M2 of classes C, D, and E;
only all variables or fields of class C;
all expressions in which a variable of some class occurs;
only the scope of the class definition of a variable;
only the scopes of all overridden versions of method M of class C;
etc.
Having selected the code using such arbitrary semantic constraints, I would like to be able to then carry out arbitrary transformations on it.
So, basically, I would need access to the symbol-table of the code.
Question:
Is there an existing tool for this type of work, or would I have to build one myself?
Even if I have to build one myself, do any tools or libraries exist that would at least provide me the symbol-table of Java code, on top of which I could add my own search-and-replace and other refactoring operations?

The only tool that I know can do this easily is the long awaited Refaster. However it is still impossible to use it outside of Google. See [the research paper](http:// research.google.com/pubs/pub41876.html) and status on using Refaster outside of Google.
I am the author of AutoRefactor, and I am very interested in implementing this feature as part of this project. Please follow up on the github issue if you would like to help.

What you want is the ability to find code according to syntax, constrained by various semantic conditions, and then be able to replace the found code with new syntax.
access to the symbol table (symbol type/scope/mentions in scope) is just one kind of semantic constraint. You'll probably want others, such as control flow sequencing (this happens after that) and data flow reaching (data produced here is consumed there). In fact there are an unbounded number of semantic conditions you might consider important, depending on the properties of the language (does this function access data in parallel to that function?) or your application interests (is this matrix an upper triangular matrix?)
In general you can't have a tool that has all possible semantic conditions of interest off the shelf. That means you need to be to express new semantic conditions when you discover the need for them.
The best you might hope for is a tool that
knows the language syntax
has some standard semantic properties built in (my preference is symbol tables, control and data flow analysis)
can express patterns on the source in terms of the source code
can constrain the patterns based on such semantic properties
can be extended with new semantic analyses to provide additional properties
There is a classic category of tools that do this, call source to source program transformation systems.
My company offers the DMS Software Reengineering Toolkit, which is one of these. DMS has been used to carry out production transformations at scale on a wide variety of languages (including OP's target: Java). DMS's rewrite rules are of the form:
rule <rule_name>(syntax_parameters): syntax_category =
<match_pattern> -> <replacement_pattern>
if <semantic_condition>;
You can see a lot more detail of the pattern language and rewrite rules look like: DMS Rewrite Rules.
It is worth noting that the rewrite rules represent operations on trees. This means that while they might look like text string matches, they are not. Consequently a rewrite rule matches in spite of any whitespace issues (and in DMS's case, even in spite of differences in number radix or character string escapes). This makes the DMS pattern matches far more effective than a regex, and a lot easier to write since you don't have worry about these issues.
This Software Recommendations link shows how one can define rules with DMS, and (as per OP's request) "run them from the command line": This isn't as succinct as running SED, but then it is doing much more complex tasks.
DMS has a Java front with symbol tables, control and data flow analysis. If one wants additional semantic analyses, one codes them in DMS's underlying programming language.

OGNL Expression Parsing vs Compilation

In OGNL, it is recommended to parse expressions that are reused in order to improve performance.
When consulting the API, I also noticed that there is a compileExpression method:
After searching thoroughly for information on compilation vs parsing, the only article I could find is part of the Struts documentation, and mentions how to do it, but not what it does compared to parsing.
Under what conditions should you use compilation instead of parsing alone, and are there significant performance benefits to be gained from compiling an expression compared to simply parsing that same expression?
From the method signatures, it appears that Ognl.parseExpression() produces an input-independent object, but Ognl.compileExpression() produces an object that depends upon the given input (root and context). Is this correct?

That http://struts.apache.org/release/2.3.x/docs/ognl-expression-compilation.html link is pretty old and I'm not sure if it's outdated or not but it's the only real documentation I ever wrote on how to use the javassist-based expression JIT code.
It's only a relevant concern if your own use of something either directly or indirectly using ognl shows a performance hit in that area. The normal expression evaluation mechanism is probably more than adequate for most needs but this extra step turns what is basically a java reflection chain of invocation calls into pure java equivalents so it eliminates almost entirely any hit you might otherwise incur using OGNL because of reflection.
Really, if you aren't sure if you need it you probably don't. Sorry I never got around to integrating the concept thoroughly into OGNL without so much scary looking extra work. Probably would've been best as an optional configuration setting in OGNL that was turned off or on but .. Feel free to fork on github if you want. =)

Formal or Practical XML Tag Length Limit?

I've not managed to find any mention of a limit to xml tag length on the web. I'm looking to build XML Schemas that act as a specification for 3rd parties to send data to us.
The Schema (and the data) are supposed to conform to our custom ontology/data dictionary thingy which is hierarchical and user-customizable.
The natural mapping is for nodes in the hierarchy to be used to name types and tags in the XSD/XML. Because however leaf node names in the ontology do not have to be unique, I am considering encoding the full path of nodes in the hierarchy as the tag name, suitably mangled for XML lexical rules.
So if my ontology has multiple 'lisa' nodes meaning different things as they are at different places in the hierarchy I could use the full path to the nodes to generate different XML types/tag names, so you can have
<abe_homer_lisa> simpsons lisa ... </abe_homer_lisa>
<applei_appleii_lisa> ... apple lisa </applei_appleii_lisa>
<mona_lisa> and paintings </mona_lisa>
... data for any of the different 'lisa' types in the same file without ambiguity.
I can't find anything on the web that specifies a maximum tag length (or a minimum supported tag length for standards-compliant engines). (Good summary of the lexical rules for XML here)
The same thing was asked about attribute length and if the standard specifies no limit for attributes then I doubt there's one for tags, but there may be a practical limit.
I suspect even a practical limit would be vastly bigger than my needs (I would expect things to be smaller than 255 chars most of the time); basically if the Java XML processors, standard ETL tools and the common XSLT processors can all handle tags much bigger than this then it won't be an issue.

I think you're unlikely to find tools that can't handle names of say 1K characters, at which point you're hitting serious performance and usability problems rather than hard limits.
But your design is wrong. XML is hierarchic, take advantage of the fact rather than trying to fight it.

There is no limit to tag name lengths that I know of but there can be some implementation limits depending on the tool that tries to parse the XML even if the XML specification may not mention any limits.
On the other hand why not use XML's native & inherently hierarchical structure. Why encode everything as <abe_homer_lisa> instead of encoding it as:
<abe>
<homer>
<lisa>simpsons lisa</lisa>
</homer>
</abe>
<applei>
<appleii>
<lisa> ... apple lisa </lisa>
</applei>
</appleii>

I would strongly suggest to use an established XML mechanism to distinguish elements, namely to use namespaces. That way you would have e.g.
<lisa xmlns="http://example.com/simpsons">..</lisa>
<lisa xmlns="http://example.com/apple">...</lisa>
Both the W3C schema language as well as XSLT and XPath fully support namespaces.

Based on the comments of Michael Kay (something of an expert on XML) and Mihai Stancu above I'd say the answer to my original question was:
No official limit
Tools likely to support 1000+ chars as an absolute minimum
Likely to hit problems in performance [given an XML tool processing those files would have to do lots of string indexing and comparison on very long strings] and usability way before then
XML namespaces and/or using the structure of the document tree to provide discriminating context would probably be better ways of "uniquifying" tag names
I was after an answer to that very specific question about legal tag length, and since I found the same question asked about attribute length but not tags I thought it might be worth having "an" answer around in case someone else googles it. Thanks to all respondents. Valid points about whether my design was sensible; I'll explain the rationale elsewhere.

Thanks to those who pointed out there might be more sensible ways to address the underlying problem (ensuring types/tag names in an XML schema are unique).
Re using a hierarchy of nodes to provide the context:
I agree this would generally be appropriate. However (I didn't really explain my precise problem domain in the q) in this particular case, the user-configurable grouping of items in the tree-structure data dictionary I have to deal with is pretty arbitrary and has almost nothing to do with relationships in the data that the dictionary describes.
So in the
<abe>
<homer>
<lisa>lisa1</lisa>
</homer>
</abe>
example should another lisa node be under the same homer node, or a different one? Should the homers be under the same abe node or not? In the case of the data in question, the distinction is more or less meaningless: it would be like grouping data according to the page of an index it happened to be referenced on in a particular book. I suppose I could just make an arbitrary call and lock it down in the XSD.
If using something like XSL to extract data then it wouldn't matter, //abe/homer/lisa would get all of the lisa nodes irrespective of how they were grouped together. In practice someone is likely to be generating these from CSV files or whatever so I'd prefer as flat a structure as possible.
Ditto for namespaces: although they're designed for this very purpose (providing context for a name and ensuring that accidental clashes do not cause ambiguity when different types of data are bundled together in a file), in practice they'd add an extra layer of complexity to whoever generates the data from source systems.
In my precise circumstances, I expect name clashes in this arbitrary grouping to be pretty unlikely (and reflect poor usage), and hence just need reasonable handling, without imposing an undue penalty on the majority case.

Contrary to conventional wisdom, I would strongly advise against using the so-called XML Namespaces mechanism. Over the longer haul, it will cause you pain. Just say no to namespaces. You do not need them.
Your intuition that elements can be distinguished by their context - represented, in this case, by their "paths" - is correct. However, your idea of encoding the entire path into the name of an element may not be optimal. Consider instead using the simple name, along with an attribute to hold the context or path. (Name this attribute 'context' or 'path' or anything more evocative!) This will be enough to distinguish the meanings.[*]
For varying content models, you can use a variant of the same technique. Give each different type a circumstantially convenient name, and record the "real" name in another attribute named, say 'ontology'.
As for your question, the XML spec does not place any inherent limitation on the length of names, although for purely technical reasons you may find a limit of 65536 characters quoted in some places. That same "limitation" may also apply to the length of attribute value literals. At an average of 20 characters per atomic name, 20 levels of hierarchy would still amount to fewer than 500 bytes for a path, so you probably have little to worry about.
[*] Note: this technique is actually very old, but almost completely forgotten in XML mindspace. In HTML, for example, there is a single element type named INPUT to cover all sorts of GUI controls, and yet there is no confusion, thanks to the 'type' attribute.

Java 1.5: mathematical formula parser

Hello i often develop JTableModels in which some cells must contain the result of apliying a certain simple mathematical formula. This formulas can have:
Operators (+,-,*,/)
Number constants
Other cell references (which contains numbers)
Parameters (numbers with a reference name like "INTEREST_RATE")
I often resolve it making a little calculator class which parses the formula, which syntax i define. The calculator class uses a stack for the calcs, and the syntax uses allways a Polish notation.
But the Polish notation is unnatural for me and for my users. So my question is...
Is there a lib which runs in 1.5 jvm's and can handle my requeriments and use normal notation (with brackets, i don't know the name of this notation style) for formulas?
P.D it's supposed that the formulas are allways syntax correct and i can preprocess the numbers that are not constants to provide their values

Have you thought about the benefits of JSR-223 ? in a few words, this spec allows Java developers to integrate with great ease dynamic languages and their parsers. Using such parser, your need for defining a parser transforms into the need for defining an internal DSL, which resolves into creating simply a good API, and letting your user choose wether they prefer Javascript/Groovy/Scala/WTF syntax they happen to prefer.

Try JEP.
You can define new variables to the parser hence it can contain reference names like "INTEREST_RATE".But you have to define it before hand.
As for cell references you will have to extract the number's and edit the expression accordingly or probably there might be some options which I'm not yet aware of.

If you can't use Java 6 and its scripting support then have a look at the Apache Bean Scripting Framework (BSF). From that page:
... BSF 3.x will run on Java 1.4+, allowing access to JSR-223 scripting for Java 1.4 and Java 1.5.

i released an expression evaluator based on Dijkstra's Shunting Yard algorithm, under the terms of the Apache License 2.0:
http://projects.congrace.de/exp4j/index.html

There's a commercial tool called formula4j which may be useful to some.
It has no direct help for cell references. You would have to handle those yourself, and translate the cell references into values.

Implement a Custom Escaper in Freemarker

Freemarker has the ability to do text escaping using something like this:
<#escape x as x?html>
Foo: ${someVal}
Bar: ${someOtherVal}
</#escape>
xml, xhtml, and html are all built in escapers. Is there a way to register a custom written escaper? I want to generate CSV and have each individual element escaped and that seems like a good mechanism.
I'm trying to do this in Struts 2 if that matters as well.

You seem to be confusing two concepts here. ?xml, ?xhtml and ?html are string built-ins.
<#escape> OTOH is syntax sugar to save you from typing the same expression over and over again. It can be used with any expression, it's not limited to built-ins.
That said, there's unfortunately no built-in for csv string escaping and there's no way to write your own without modifying FreeMarker source (though if you do want to go this way it's pretty straightforward - take a look at freemarker.core.BuiltIn). Perhaps you can by with ?replace using regex or just write / expose an appropriate method and invoke it in your template.

The Javadoc for HtmlEscaper indicates how to instantiate/register that in code (see the header), so I suspect if you implement your own TemplateTransformModel, and register it in a similar fashion then that should work.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.