compare graph structure using java - java

I am implementing a schema matching algorithm.I need to perform schema structure matching, i need to represent schema as a is-a has-a relationship graph....one graph per schema...
each node in relation model will represent a table with is-a as table and one has-a relationship for each column(having there own is-a).
My question is how to implement this in best way using java, comparing graphs will be pseudo polynomial in graph size and may through out of memory error if we pull complete schema..i want to find nodes with similar relationships in both graphs ( this will lead to DFS)
is there any already existing java implementation that can do this, i already explored jgraphT, jung...not sure which one will be best to do this..please help
thanks in advance.!!

Whatever graph API you use ought to allow you to do something like this:
boolean equal = graph1.equals(graph2);
where that evaluates true if the nodesets and edgesets are equal. The nodes would need IDs or else content so you could establish actual equality as opposed to graph isomorphism.
Is that what you are asking?

Related

Increasing elements in a data structure that follows the Composite pattern

I have often read/heard that the composite pattern is a good solution to represent hierarchical data structures like binary trees, which it is great to explain this pattern because internal nodes are composite objects and leaves are leaf objects. I can appreciate that using this pattern, it is easy to visit every element in an uniform way.
However, I am not so sure if it is the best example if you are considering that a tree is filling on demand (every time an insert method is executed) because we have to convert a leaf to composite object many times (e.g. when leaf has to add a child). To convert a leaf object, I imagine a tricky way like (inspired by become: from Smalltalk I guess):
aComposite = aLeaf.becomeComposite();
aComposite.addChild(newElement);
//destroy aLeaf (bad time performance)
To sum up, Is a good example the use of a tree-like structure to illustrate the composite pattern if this kind of structure is born commonly empty and then you have to add/insert elements?
GoF states the Intent of Composite as follows:
"Compose objects into tree structures to represent part-whole hierarchies. ..... treat individual object and compositions of objects uniformly"
So a tree is not so much a structure to illustrate Composite, rather a tree is the structure by which composite is defined and operates. Its also worth remembering that for the purposes of Composite, a tree can be a binary tree (2 children), a linked list (one child) or can be composed of nodes with a variable number of children .
Its quite normal to build a tree from nothing. Consider an arithmetic expression parser, building a composite "parse" tree. The parser will start from nothing and create leaf nodes for terminal symbols (like + - * / braces, numbers) and composite nodes to combine the terminals perform the calculations. The parser constructs the tree such that invoking evaluate() on the head node will cause a traversal to evaluate the expression.
I use this example to show that a tree can be built bottom up, never having to "convert a leaf to composite object".
If your application builds the tree top down, or progressively in stages, its hard to see that matters, because the build process will consist of creating appropriate nodes and inserting them in a way that makes sense for the application.
If converting leaf nodes to composite nodes is problematic in any specific application, then for sure you make to look at ways to minimise the overhead in that situation. but its only a valid Composite structure when the tree is built!

In Neo4j, is there any way to restrict nodes and relation types in path while using Java API?

I have source node and destination node I want to put restriction on nodes and relation types in the path. I am using Neo4j Java API.
Consider following toy example,
We have three person nodes A, B & C.
Source Node: A & Destination Node: B. There are many other kind of paths may exists between them. I want to restrict paths to specific format like-
(person) -[worksAt]-> (company) -[CompetitorOf]-> (company) <-[worksAt]- (person)
This can be very easily achieved from cypher query, but I want to know is there any way we can do it using Java API.
NOTE:
Kindly do not suggest putting restriction on path length, that
doesn't solve the problem. I want to restrict the node and relation
types in path.
Example mentioned above is toy example. Graph I am trying to work is more complex and there are many possible paths not feasible to traverse and validate individual paths.
It's not really clear from your question what you're actually trying to compute. Do you have A and B and want to find if their companies are competitors? Do you have C and want to find who among their friends work at competing companies?
Anyway, if you're using the traversal API (you're talking about paths), you can write a custom PathExpander which will use the last relationship in the Path to determine the next type of relationship to traverse.
If you're just traversing the relationships manually, I don't really see the problem: just call Node.getRelationships(RelationshipType, Direction) with the proper parameters at each step.
Contrary to what you do in Cypher, you don't declare the pattern you're looking for in the path, you just compute the path to follow the wanted pattern.
After reading the Neo4j java documentation carefully and experimenting with the code I got following solution working-
To filter path explored by PathFinder create a custom PathExpander using PathExpanderBuilder.
PathExpanderBuilder pathExpanderBuilder = PathExpanderBuilder.empty();
pathExpanderBuilder.add(RelationshipType.withName("worksat"), Direction.OUTGOING);
pathExpanderBuilder.add(RelationshipType.withName("competitorof"), Direction.BOTH);
pathExpanderBuilder.add(RelationshipType.withName("worksat"), Direction.INCOMING);
PathExpander<Object> pathExpander pathExpander = pathExpanderBuilder.build();
Once you create a custom PathExpander you can use it to create appropriate PathFinder which will filter traversal by the PathFinder.
PathFinder<Path> allPathFinder = GraphAlgoFactory.allSimplePaths(this.pathExpander, 4);
Iterable<Path> allPaths = allPathFinder.findAllPaths(sourceNode, targetNode);
In our example sourceNode would be Node 'A' and targetNode would be Node 'B'.

Data Structures in JAVA to implement joins

Hi i am trying to implement a simple join algorithim in Java...
I have three relations i.e M(ABX) N(ACY) and O(BCZ). These relations are currently in a comma separated file and all integers(example file M will have values like 1,5,6; 2,7.9;..) was wondering what was the best data structure to use in Java to implement the join MxNxO i.e M and N will join on attribute A producing a schema(ABCXY) which will then join with O on attributes B and C producing a final result of ABXCYZ which will have all join results..
Perhaps an embedded database like hsqldb would be the right choice. It's flexible, performant, and easy to use.
There is no specialized data structures that you can readily use for this.
You would have to represent the tables extracted from your CSV files via List<List>> and then you would have to iterate over the lists and compare the proper attribute representing the column name to create intermediate lists and so on until you have joined all the relations.
I.e. you would need to implement this logic yourself.
The best way for this IMHO is to follow the answer of #Ernest Friedman-Hill.
Not only will you get this functionality faster you will get it error free as you would not need to test that the join algorithm works correctly over any dataset. The embedded database will do this for you.

Neo4j indexing (with Lucene) - good way to organize node "types"?

This is more actually more of a Lucene question, but it's in the context of a neo4j database.
I have a database that's divided into 50 or so node types (so "collections" or "tables" in other types of dbs). Each has a subset of properties that need to be indexed, some share the same name, some don't.
When searching, I always want to find nodes of a specific type, never across all nodes.
I can see three ways of organizing this:
One index per type, properties map naturally to index fields: index 'foo', 'id'='1234'.
A single global index, each field maps to a property name, to distinguish the type either include it as part of the value ('id'='foo:1234') or check the nodes once they're returned (I expect duplicates to be very rare).
A single index, type is part of the field name: 'foo.id'='1234'.
Once created, the database is read-only.
Are there any benefits to one of those, in terms of convenience, size/cache efficiency, or performance?
As I understand it, for the first option neo4j will create a separate physical index for each type, which seems suboptimal. For the third, I end up with most lucene docs only having a small subset of the fields, not sure if that affects anything.
I came across this problem recently when I was building an ActiveRecord connection adapter for Neo4j over REST, to be used in a Rails project. Since ActiveRecord and ActiveRelation, both, have a tight coupling with SQL syntaxes, it became difficult to fit everything into NoSQL. Might not be the best solution, but here's how I solved it:
Created an index named model_index which indexed nodes under two keys, type and model
Index lookup with type key currently happens with just one value model. This was introduced primarily to achieve a SHOW TABLES SQL functionality which can get me a list of all models present in the graph.
Index lookup with model key takes place with values corresponding to different model names in my system. This is primarily for achieving DESC <TABLENAME> functionality.
With each table creation as in CREATE TABLE, a node is created with table definition attributes being stored in node properties.
Created node is indexed under model_index with type:model and model:<model-name>. This enables the newly created model in the list of 'tables' and also allows one to directly reach the model node by an index lookup with model key.
For each record created per model (type in your case), an outgoing edge is created labeled instances directed from model node to this new record. v[123] :=> [instances] :=> v[245] where v[123] represents model node and v[245] represents a record of v[123]'s type.
Now if you want to get all instances of a specified type, you could lookup the model_index with model:<model-name> to reach a model node and then fetch all adjacent nodes over an outgoing edge labeled instances. Filtered lookups can be further achieved by applying filters and other complex traversals.
The above solution prevents model_index from clogging since it contains 2x and achieves an effective record lookup via one index lookup and single-level traversal.
Although in your case, nodes of different types are not adjacent to each other, even if you wanted to do so, you could determine the type of any arbitrary node by simply looking up it's adjacent node with an incoming edge labeled instances. Further, I'm considering the incorporate SpringDataGraph's pattern of storing a __type__ property on each instance node to avoid this adjacent node lookup.
I'm currently translating AREL to Gremlin scripts for almost everything. You could find the source code for my AR Adapter at https://github.com/yournextleap/activerecord-neo4j-adapter
Hope this helps, Cheers! :)
A single index will be smaller than several little indexes, because some data, such as the term dictionary, will be shared. However, since a term dictionary lookup is a O(lg(n)) operation, a lookup in a bigger term dictionary might be a little slower. (If you have 50 indexes, this would only require 6 (2^6>=50) more comparisons, it is likely you won't notice any difference.)
Another advantage of a smaller index is that the OS cache is likely to make queries run faster.
Instead of your options 2 and 3, I would index two different fields id and type and search for (id:ID AND type:TYPE) but I don't know if it is possible with neo4j.
spring-data-neo4j is using the first approach - it creates a different index for each type. So I guess that's a good option for the general scenario. But in your particular case it might be suboptimal, as you say. I'd run some benchmarks to measure the performance.
The other two, by the way, seem a bit artificial. You are possibly indexing completely unrelated information in the same index, which doesn't sound right.

structure for holding data in this instance (Hashmap/ArrayList etc)?

Best way to describe this is explain the situation.
Imagine I have a factory that produces chairs. Now the factory is split into 5 sections. A chair can be made fully in one area or over a number of areas. The makers of the chairs add attributes of the chair to a chair object. At the end of the day these objects are collected by my imaginary program and added into X datatype(ArrayList etc).
When a chair is added it must check if the chair already exists and if so not replace the existing chair but append this chairs attributes to it(Dont worry about this part, Ive got this covered)
So basically I want a structure than I can easily check if an object exists if not just straight up insert it, else perform the append. So I need to find the chair matching a certain unique ID. Kind of like a set. Except its not matching the same object, if a chair is made in three areas it will be three distinct objects - in real life they all reperesent the same object though - yet I only want one object that will hold the entire attribute contents of all the chairs.
Once its collected and performed the update on all areas of the factory it needs iterate over each object and add its contents to a DB. Again dont worrk about adding to the DB etc thats covered.
I just want to know what the best data structure in Java would be to match this spec.
Thank you in advance.
I'd say a HashMap: it lets you quickly check whether an object exists with a given unique ID, and retrieve that object if it does exist in the collection. Then it's simply a matter of performing your merge function to add attributes to the object that is already in the collection.
Unlike most other collections (ArrayList, e.g.), HashMaps are actually optimized for looking something up by a unique ID, and it will be just as fast at doing this regardless of how many objects you have in your collection.
This answer originally made reference to the Hashtable class, but after further research (and some good comments), I discovered that you're always better off using a HashMap. If you need synchronization, you can call Collections.synchronizedMap() on it. See here for more information.
I'd say use ArrayList. Override the hashcode/equals() method on your Chair object to use the unique ID. That way you can just use list.contains(chair) to check if it exists.
I'd say use an EnumMap. Define an enum of all possible part categories, so you can query the EnumMap for which part is missing
public enum Category {
SEAT,REST,LEGS,CUSHION
}

Categories