Extracting information from AST - java

I am trying to use ANTLR to extract information from a PLSQL file. I am using porcelli PLSQL grammar, using which ANTLR spits out AST on my input plsql file. I need to read the returned "CommonTree" class (which represents the AST) and obtain different information - say name of tables and related columns. I was thinking if it would make sense to use the visitor pattern to collect information about tables and related columns on a particular table. For instance, a query like this
SELECT s.name from students s, departments d WHERE d.did=10 and s.sid=d.did
will be shown in AST as
Obtaining table name and related columns here will involve capturing aliases first from the FROM element and then matching with columns used in SELECT_LIST. Information about tables and columns is hidden deep in leaf nodes under repeatedly used elements such as "ANY_ELEMENT".
So, How to go about using a visitor pattern here? Would I end up with way too many visitors because there are potentially a lot of element types? Is Visitor pattern relevant here?
EDIT
After thinking over it for a while, I am nearing a conclusion that Visitor pattern wouldn't make sense in this scenario. Given the fact that the data structure that needs to be visited is a tree and there are potentially so many node types (select, update, insert, delete, from, where, into..), defining what should happen on visiting each of these node types for any given visitor could result in hundreds of methods per visitor class!

As updated in my last edit, I resolved this by not implementing a visitor pattern because such a pattern would require me to create all node types and for PL/SQL there will be too many.

Related

Use pagination on query on multiple indices in hibernate search

we are implementing a global search that we can use to query all entities in our application, let's say cars, books and movies. Those entities do not share common fields, cars have a manufacturer, books have an author and movies have a director, for example.
In a global search field I'd like to search for all my entities with one query to be able to use pagination. Two aproaches come to my mind when thinking about how to solve this:
Query one index after another and manually merge the result. This means that I have to implement pagination myself.
Add common fields to each item like name and creator (or create an interface as shown here Single return type in Hibernate Search).In this case I can only search for fields in my global search that I map to the common fields.
My question is now: Is there a third (better) way? How would you suggest to implement such a global search on multiple indices?
Query one index after another and manually merge the result. This means that I have to implement pagination myself.
I definitely wouldn't do that, as this will perform very poorly, especially with deep pagination (page 40, etc.).
Add common fields to each item like name and creator (or create an interface as shown here Single return type in Hibernate Search).In this case I can only search for fields in my global search that I map to the common fields.
That's the way. You don't even need a common interface since you can just target multiple fields in the same predicate. The common interface would only help to target all relevant types: you can call .search(MyInterface.class) instead of .search(Arrays.asList(Car.class, Book.class, Movie.class)).
You can still apply predicates to fields that are specific to each type; it's just that fields that appear in more than one type must be consistent (same type, etc.). Also, obviously, if you require that the "manufacturer" (and no other field) matches "james", Books and Movies won't match anymore, since they don't have a manufacturer.
Have you tried it? For example, this should work just fine as long as manufacturer, author and director are all text fields with the same analyzer:
SearchResult<Object> result = searchSession.search( Arrays.asList(
Car.class, Book.class, Movie.class
) )
.where( f -> f.simpleQueryString()
.fields( "manufacturer", "author", "director" )
.matching( "james" ) )
.fetch( 20 );
List<Object> hits = result.hits(); // Result is a mix of Car, Book and Movie.
One approach would be to create a SQL view (SearchEntry?) that combines all of the tables you want to search. This allows you to alias your different column names. It won't be very good for performance but you could also just create one big field that is a concatenation of different searchable fields. Finally, include a "type" field that you tie back to your entity.
Now you can query everything in one go and use the type/id to tie back to a specific entity that the "search" data was initially pulled from.

How can I get next available node in DOM with schema?

I need to query the names of "available" sub elements of an element node in DOM.
For example if schema says "There can be age, name, occupation elements under person element." then I wanna function like this,
import org.w3c.dom.Element;
Element person_element;
String[] names_of_available_sub_element =
get_available_sub_element_names(person_element);
which makes
names_of_available_sub_element == {"age", "name", "occupation"}.
How can I implement this function?
This isn't easy, but it can be done if you're prepared to put a lot of work in.
There are a number of approaches to getting information from an XSD schema. You could try and process the XSD source code, but I wouldn't recommend that, because there are so many things you have to take into account (wildcards, substitution groups, types derived by restriction and extension, and so on). A better approach is to use some kind of API that gives you access to the information in digested form. For that, some possible suggestions are:
(a) Xerces gives you a Java API providing programmatic access to the compiled schema.
(b) Saxon gives you two possibilities: (i) the SCM file, which is an XML representation of the compiled schema, and (ii) an XPath API giving programmatic access to the compiled schema using extension functions.
Do remember that knowing you're at a "person" element isn't (in the general case) enough to determine what the permitted children are. That's because there can be global and local elements using the name "person", but with different types. Whether this is a problem in your case depends on what you are trying to achieve, which you haven't really explained in much detail.

Neo4j indexing (with Lucene) - good way to organize node "types"?

This is more actually more of a Lucene question, but it's in the context of a neo4j database.
I have a database that's divided into 50 or so node types (so "collections" or "tables" in other types of dbs). Each has a subset of properties that need to be indexed, some share the same name, some don't.
When searching, I always want to find nodes of a specific type, never across all nodes.
I can see three ways of organizing this:
One index per type, properties map naturally to index fields: index 'foo', 'id'='1234'.
A single global index, each field maps to a property name, to distinguish the type either include it as part of the value ('id'='foo:1234') or check the nodes once they're returned (I expect duplicates to be very rare).
A single index, type is part of the field name: 'foo.id'='1234'.
Once created, the database is read-only.
Are there any benefits to one of those, in terms of convenience, size/cache efficiency, or performance?
As I understand it, for the first option neo4j will create a separate physical index for each type, which seems suboptimal. For the third, I end up with most lucene docs only having a small subset of the fields, not sure if that affects anything.
I came across this problem recently when I was building an ActiveRecord connection adapter for Neo4j over REST, to be used in a Rails project. Since ActiveRecord and ActiveRelation, both, have a tight coupling with SQL syntaxes, it became difficult to fit everything into NoSQL. Might not be the best solution, but here's how I solved it:
Created an index named model_index which indexed nodes under two keys, type and model
Index lookup with type key currently happens with just one value model. This was introduced primarily to achieve a SHOW TABLES SQL functionality which can get me a list of all models present in the graph.
Index lookup with model key takes place with values corresponding to different model names in my system. This is primarily for achieving DESC <TABLENAME> functionality.
With each table creation as in CREATE TABLE, a node is created with table definition attributes being stored in node properties.
Created node is indexed under model_index with type:model and model:<model-name>. This enables the newly created model in the list of 'tables' and also allows one to directly reach the model node by an index lookup with model key.
For each record created per model (type in your case), an outgoing edge is created labeled instances directed from model node to this new record. v[123] :=> [instances] :=> v[245] where v[123] represents model node and v[245] represents a record of v[123]'s type.
Now if you want to get all instances of a specified type, you could lookup the model_index with model:<model-name> to reach a model node and then fetch all adjacent nodes over an outgoing edge labeled instances. Filtered lookups can be further achieved by applying filters and other complex traversals.
The above solution prevents model_index from clogging since it contains 2x and achieves an effective record lookup via one index lookup and single-level traversal.
Although in your case, nodes of different types are not adjacent to each other, even if you wanted to do so, you could determine the type of any arbitrary node by simply looking up it's adjacent node with an incoming edge labeled instances. Further, I'm considering the incorporate SpringDataGraph's pattern of storing a __type__ property on each instance node to avoid this adjacent node lookup.
I'm currently translating AREL to Gremlin scripts for almost everything. You could find the source code for my AR Adapter at https://github.com/yournextleap/activerecord-neo4j-adapter
Hope this helps, Cheers! :)
A single index will be smaller than several little indexes, because some data, such as the term dictionary, will be shared. However, since a term dictionary lookup is a O(lg(n)) operation, a lookup in a bigger term dictionary might be a little slower. (If you have 50 indexes, this would only require 6 (2^6>=50) more comparisons, it is likely you won't notice any difference.)
Another advantage of a smaller index is that the OS cache is likely to make queries run faster.
Instead of your options 2 and 3, I would index two different fields id and type and search for (id:ID AND type:TYPE) but I don't know if it is possible with neo4j.
spring-data-neo4j is using the first approach - it creates a different index for each type. So I guess that's a good option for the general scenario. But in your particular case it might be suboptimal, as you say. I'd run some benchmarks to measure the performance.
The other two, by the way, seem a bit artificial. You are possibly indexing completely unrelated information in the same index, which doesn't sound right.

How to search across multiple fields in Lucene using Query Syntax?

I'm searching a lucene index and I'm building search queries like
field1:"hello" AND field2:"world"
but I'd like to search for a value in any field as well as the values in specific fields in the same query i.e.
field1:"hello" AND anyField:"world"
Can anyone tell me how I can search across all indexed fields in this way?
Based on the answers I got for this question: Impact of repeat value across multiple fields in Lucene...
I can put the same search term into multiple fields and therefore create an "all" field which I put everything in. This way I can create a query like...
field1:"hello" AND all:"world"
This seems to work very nicely, prevents the need for huge search queries, and apparently the performance impact is minimal.
Boolean (OR) queries with a clause for each field are used to search multiple fields. The MultiFieldQueryParser will do that as well, but the fields still need to be enumerated. There's no implicit "all" fields; but IndexReader.getFieldNames can acquire them.
This might not apply to you, but in Azure Search, which is based on Lucene, using Lucene syntax, I use this:
name:plywood^100 OR plywood
Results with "plywood" in the "name" field are boosted.

Lucene custom scoring (Lucene 3.2) involves iterating through all documents in the index - fastest way?

I'm trying to implement a custom scoring formula in Lucene that has nothing to do with tf-idf (so changing just the similarity, for example, will not work).
In order to do this, I need to be able to take my custom Query and generate a score for every document stored in the index - not just the ones that match the terms in the query (since my scoring involves checking what are essentially synonyms, so even if a doc doesn't have the exact Terms, it could still produce a positive score). Is the best way to simply create an IndexReader and call Document d = reader.doc(i) for all docs (as described here), and then generate a score on the spot?
I've been looking around at Lucene's scoring internals, specifically various Scorer and Collector classes, and it appears that what happens (for Lucene 3.2) is a Weight provides a Scorer, which along with the Collector loops through all documents that match the query. Can I utilize this structure in some way, but again get a custom Scorer implementation to consider ALL documents?
If you decide to go for a custom scoring scheme, the proper way is to use a subclass of CustomScoreQuery with getCustomScoreProvider overridden to return your subclass of CustomScoreProvider. The CustomScoreQuery constructor requires a subquery. Here you will want to provide a fast native Lucene Query that will narrow down the result set as much as possible before going through your custom score calculation. You can also arrange to store any number of float values with each of your docs and make those accessible to your custom score provider. You will need to provide an appropriate ValueSourceQuery to the constructor of CustomScoreQuery for each such float value. See the Javadocs on these classes, they are well written. Unfortunately I don't have a Java snippet at hand.
As I understand Lucene, it stores (term, doc) pairs in its index, so that querying is implemented as
Get documents containing the query terms,
score/sort them.
I've never implemented my own scoring, but I'd look at IndexReader.termDocs first; it seems to implement step 1.
With IndexReader.termDocs you can iterate through a term's posting list, that is, all documents that contain that term. You could use this to provide your whole own query processing own top of Lucene, but then you won't be able to use any of Query, Similarity and stuff.
Also, if you are working with synonyms Lucene has some things in the contrib package. Another possible solution, don't know if you tried it, is to inject synonyms into the documents through a Analyzer (or other). That way you could return documents even if they don't have query terms.

Categories