Creating a Graph Query Language (Node/Edge/HyperEdge)

Creating a Graph Query Language (Node/Edge/HyperEdge) - java

I'm creating an API that encapsulates JPA objects with additional properties and helpers. I do not want the users to access the database, because I have to provide certain querying functionality for the consumers of the API.
I have the following:
Node1(w/ attributes) -- > Edge1(w/ attr.) -- > Node2(w/ attr.)
and
Node1(w/ attributes) -- > |
Node2(w/ attributes) -- > | -- > HyperEdge1(w/ attr.)
Node3(w/ attributes) -- > |
Basically a Node can be of a certain type, which would dictate the kind of attributes available. So I need to be able to query these "paths" depending on different types and attributes.
For example: Start from a Node, and find a path typeA > typeB & attr1 > typeC.
So I need to do something simple, and be able to write the query as a string, or maybe a builder pattern style.
What I have so far, is a visitor pattern set up to traverse the Nodes/Edges/HyperEdges, and this allows for a sort of querying, but it's not very simple, since you have to create a new visitor for new types of queries.
This is my implementation so far:
ConditionImpl hasMass = ConditionFactory.createHasMass( 2.5 );
ConditionImpl noAttributes = ConditionFactory.createNoAttributes();
List<ConditionImpl> conditions = new ArrayList<ConditionImpl>();
conditions.add( hasMass );
conditions.add( noAttributes );
ConditionVisitor conditionVisitor = new ConditionVisitor( conditions );
node.accept( conditionVisitor );
List<Set<Node>> validPaths = conditionVisitor.getValidPaths();
The code above, does a query that checks if the starting node has a mass of 2.5 and a linked node (child) has no attributes. The visitor does a condition.check( Node ) and returns a boolean.
Where do I start with creating a querying language for a graph that is simpler?
Note: I do not have the option of using an existing graph library and I will have hundreds of thousands of nodes, plus the edges..

Personally, I like the idea of the visitor pattern, however it might turn out to expensive to visit all nodes.
Query Interface: If users / other developers are using it, I would use a builder style interface, with readable method names:
Visitor v = QueryBuilder
.selectNodes(ConditionFactory.hasMass(2.5))
.withChildren(ConditionFactory.noAttributes())
.buildVisitor();
node.accept(v);
List<Set<Node>> validPaths = v.getValidPaths();
As pointed out above, this is more or less just syntactic sugar for what you already have (but sugar makes all the difference). I would separate the code for "moving on the graph" (like "check whether visited node fulfills condition" or "check whether connected nodes fulfill condition") from the code that actually checks (or is) a condition. Also, use composites on conditions to build and/or:
// Select nodes with mass 2.5, follow edges with both conditions fulfilled and check that the children on these edges have no attributes.
Visitor v = QueryBuilder
.selectNodes(ConditionFactory.hasMass(2.5))
.withEdges(ConditionFactory.and(ConditionFactory.freestyle("att1 > 12"), ConditionFactory.freestyle("att2 > 23"))
.withChildren(ConditionFactory.noAttributes())
.buildVisitor();
(I used "freestyle" because of missing creativity right now, but the intention of it should be clear) Node that in general this might be two different interfaces in order to not build strange queries.
public interface QueryBuilder {
QuerySelector selectNodes(Condition c);
QuerySelector allNodes();
}
public interface QuerySelector {
QuerySelector withEdges(Condition c);
QuerySelector withChildren(Condition c);
QuerySelector withHyperChildren(Condition c);
// ...
QuerySelector and(QuerySelector... selectors);
QuerySelector or(QuerySelector... selectors);
Visitor buildVisitor();
}
Using this kind of syntactic sugar makes the queries readable from the source code without forcing you to implement your own data query language. The QuerySelector implementations would than be responsible for "moving" around the visited nodes whereas the Conditition implementation would check whether the condition match.
The clear downside of this approach is, that you need to foresee most of the queries in interfaces and need to implement them already.
Scalability with number of nodes: You might need to add some kind of index to speed up finding "interesting" nodes. One idea which is popping up is to add (for each index) a layer to the graph in which each nodes models one of the different attribute settings for the "indexed variable". The normal edges could then connect these index nodes with the nodes in the original graph. The hyper edges on the index could then build a network which is smaller to search on. Of course there is still the boring way of storing the index in a map-like structure with a attributeValue -> node mapping. Which probably is much more performant than the idea above anyway.
If you have some kind of Index make sure that the index can as well receive a visitor such that it does not have to visit all nodes in the graph.

It sounds like you have all the pieces except some syntactic sugar.
How about an immutable style where you create the whole list above like
Visitor v = Visitor.empty
.hasMass(2.5)
.edge()
.node()
.hasNoAttributes();
You can create any kind of linear query pattern using this style; and if you add a some extra state you could even do branching queries by e.g. setName("A") and later .node("A") to return to that point of the query.

Related

Correct implementation for property of all objects that are equal

The problem
Consider an implementation of a graph, SampleGraph<N>.
Consider an implementation of the graph nodes, Node extends N, correctly overriding hashCode and equals to mirror logical equality between two nodes.
Now, let's say we want to add some property p to a node. Such a property is bound to logical instances of a node, i.e. for Node n1, n2, n1.equals(n2) implies p(n1) = p(n2)
If I simply add the property as a field of the Node class, this has happened to me:
I define Node n1, n2 such that n1.equals(n2) but n1 != n2
I add n1 and n2 to a graph: n1 when inserting the logical node, and n2 when referencing to the node during insertion of edges. The graph stores both instances.
Later, I retrieve the node from the graph (n1 is returned) and set the property p on it to some value. Later, I traverse all the edges of the graph, and retrieve the node from one of them (n2 is returned). The property p is not set, causing a logical error in my model.
To summarize, current behavior:
graph.addNode(n1) // n1 is added
graph.addEdge(n2,nOther) // graph stores n2
graph.queryForNode({query}) // n1 is returned
graph.queryForEdge({query}).sourceNode() // n2 is returned
The question
All the following statements seem reasonable to me. None of them fully convinces me over the others, so I'm looking for best practice guidelines based on software engineering canons.
S1 - The graph implementation is poor. Upon adding a node, the graph should always internally check if it has an instance of the same node (equals evaluates to true) memorized. If so, such instance should always be the only reference used by the graph.
graph.addNode(n1) // n1 is added
graph.addEdge(n2,nOther) // graph internally checks that n2.equals(n1), doesn't store n2
graph.queryForNode({query}) // n1 is returned
graph.queryForEdge({query}).sourceNode() // n1 is returned
S2 - Assuming the graph behaves as in S1 is a mistake. The programmer should take care that always the same instance of a node is passed to the graph.
graph.addNode(n1) // n1 is added
graph.addEdge(n1,nOther) // the programmer uses n1 every time he refers to the node
graph.queryForNode({query}) // n1 is returned
graph.queryForEdge({query}).sourceNode() // n1 is returned
S3 - The property is not implemented the right way. It should be an information which is external to the class Node. A collection, such as a HashMap<N, Property>, would work just fine, treating different instances as the same object based on hashCode.
HashMap<N, Property> properties;
graph.addNode(n1) // n1 is added
graph.addEdge(n2,nOther) // graph stores n2
graph.queryForNode({query}) // n1 is returned
graph.queryForEdge({query}).sourceNode() // n2 is returned
// get the property. Difference in instances does not matter
properties.get(n1)
properties.get(n2) //same property is returned
S4 - Same as S3, but we could hide the implementation inside Node, this way:
class Node {
private static HashMap<N, Property> properties;
public Property getProperty() {
return properties.get(this);
}
}
Edit: added code snippets for current behavior and tentative solutions following Stephen C's answer. To clarify, the whole example comes from using a real graph data structure from an open source Java project.

It seems that S1 makes the most sense. Some Graph implementations internally use a Set<Node> (or some equivalent) to store the nodes. Of course, using a structure like a Set will ensure that there are no duplicate Nodes, where Node n1 and Node n2 are considered duplicates if and only if n1.equals(n2). Of course, the implementation of Node should ensure that all relevant properties are considered when comparing two instances (ie. when implementing equals() and hashCode()).
Some of the issues with the other statements:
S2, while perhaps reasonable, yields an implementation in which the burden falls to the client to understand and safeguard against a potential pitfall of the internal Graph implementation, which is a clear sign of a poorly designed API for the Graph object.
S3 and S4 both seem weird, although perhaps I don't quite understand the situation. In general, if a Node holds on to some data, it seems perfectly reasonable to define a member variable inside class Node to reflect that. Why should this extra property be treated any differently?

For my mind, it comes down choosing between APIs with strong or weak abstraction.
If you choose strong abstraction, the API would hide the fact that Node objects have identity, and would canonicalize them when they are added to the SimpleGraph.
If you choose weak abstraction, the API would assume that Node objects have identity, and it would be up to the caller to canonicalize them before adding them to the SimpleGraph.
The two approaches lead to different API contracts and require different implementation strategies. The choice is likely to have performance implications ... if that is significant.
Then there are finer details of the API design that may or may not match your specific use-case for the graphs.
The point is that you need to make the choice.
(This a bit is like deciding to use the collections List interface and its clean model, versus implementing your own linked list data structure so that you can efficiently "splice" 2 lists together. Either approach could be correct, depending on the requirements of your application.)
Note that you usually can make a choice, though the choice may be a difficult one. For example, if you are using an API designed by someone else:
You can choose to use it as-is. (Suck it up!)
You can choose to try to influence the design. (Good luck!)
You can choose to switch to a different API; i.e. a different vendor.
You can choose to fork the API and adjust it to your own requirements (or preferences if this is what this is about)
You can choose to design and implement your own API from scratch.
And if you really don't have a choice, then this question is moot. Just use the API.
If this is a open-source API then you probably don't have the choice of getting the designers to change it. Significant API overhauls have a tendency of creating a lot of work for other people; i.e. the many other projects that depend on the API. A responsible API designer / design team takes this into account. Or else they find that they lose relevance because their APIs get a reputation for being unstable.
So ... if you are aiming to influence an existing open-source API design ... 'cos you think they are doing it incorrectly (for some definition of incorrect) ... you are probably better off "forking" the API and dealing with the consequences.
And finally, if you are looking for "best practice" advice, be aware that there are no best practices. And this is not just a philosophical issue. This is about why you will get screwed if you go asking for / looking for "best practice" advice, and then follow it.
As a footnote: have you ever wondered why the Java and Android standard class libraries don't offer any general-purpose graph APIs or implementations? And why they took such a long time to appear in 3rd party libraries (Guava version 20.0)?
The answer is that there is no consensus on what such an API should be like. There are just too many conflicting use-cases and requirement sets.

Applying the visitor pattern for detecting cycles in a graph

i need to detect if in a directed graph there is a cycle , something likes the topological sort , but i wanna use the visitor pattern.. Do you have some ideas ? I can use the arraylist of nodes , and edges or other structures (not array) .

The visitor pattern really can't achieve such a thing in its purest form.
Remember that a visitor pattern typically has the "visitor" travelling the web of objects, but the web of objects "directing" the visitor. Since the visitor is effectively path-unaware, it prevents certain kinds of breakage.
from the wikipedia example of the Visitor pattern (in Java)
class Car implements CarElement {
CarElement[] elements;
public Car() {
//create new Array of elements
this.elements = new CarElement[] { new Wheel("front left"),
new Wheel("front right"), new Wheel("back left") ,
new Wheel("back right"), new Body(), new Engine() };
}
public void accept(CarElementVisitor visitor) {
for(CarElement elem : elements) {
elem.accept(visitor);
}
visitor.visit(this);
}
}
note the Car accept method. It ensures that all the sub-elements of the car are covered, encapsulating navigation, yet exposing the ability to apply external functions against the entire data structure.
Since your code requires knowledge of how the data structure is wired together, the visitor pattern is poorly suited to the task. If a visitor encounters a circular data structure, the future visitors will get stuck in the same loop, not visiting some of the data, breaking the contract between the Visitor and the VisitAcceptors.
Now you might be able to somewhat achieve the goal, provided you had "possibly circular" links in the visiting path not followed. You'd still have to ensure all nodes of the graph were followed in the visiting path, just by other branches of the visiting path. Then your visitor would basically become a large collection of nodes that could be hit by the non-travelled back links, but by the time you implemented such an odd solution, you'd wonder why you bothered with the visitor part.

How to get descendants up to a certain level in a Tree?

Which Tree data structure in Java allows querying for different levels of children? I have looked at TreeNode, JTree. But they dont seem to support multi level querying.
Given a Tree, for a specific node, I want to get the descendants up to a certain level n. Is there an existing implementation that I can use or should I write my own?
Thanks!

It's not that hard to write a breadth-first traversal and visit all the children up to a specified level. Here is some pseudocode. Assume you have a new class:
public class NodeWithLevel {
Node node;
int level;
}
This class is only a wrapper used for this algorithm.
Then the "get all nodes up to level N" method would be:
Queue<NodeWithLevel> queue;
queue.enqueue(<0, tree.root>);
currentLevel = 0;
while(currentLevel < N) {
NodeWithLevel current = queue.dequeue();
currentLevel = current.level;
// do whatever with current
for(Node child: current.node.children) {
queue.enqueue(<current.level + 1, child>);
}
}

DefaultMutableTreeNode supports several traversals, using any one of them to reach your goal is left (no pun intended, it's by the api :) to the user.

If you're not afraid of a complex API, the DOM might be what you need. You can query it through XPath, apply events to its nodes, etc...

The only thing that springs to mind is swing's DefaultTreeModel but that would still require a bit of coding on your side for the logic to get children up to a certain level.
It shouldn't be too hard to roll your own implementation.

How can I insert a node before an other using dom4j?

I have a org.dom4j.Document instance that is a DefaultDocument implementation to be specific. I would like to insert a new node just before an other one. I do not really understand the dom4j api, I am confused of the differences between Element and DOMElement and stuff.
org.dom4j.dom.DOMElement.insertBefore is not working for me because the Node I have is not a DOMElement. DOMNodeHelper.insertBefore is nor good because I have org.dom4j.Node instances and not org.w3c.dom.Node instances. OMG.
Could you give me a little code snippet that does this job for me?
This is what I have now:
// puts lr's to the very end in the xml, but I'd like to put them before 'e'
for(Element lr : loopResult) {
e.getParent().add(lr);
}

It's an "old" question, but the answer may still be relevant. One problem with the DOM4J API is that there are too many ways to do the same thing; too many convenience methods with the effect that you cannot see the forest for the trees. In your case, you should get a List of child elements and insert your element at the desired position: Something like this (untested):
// get a list of e's sibling elements, including e
List elements = e.getParent().elements();
// insert new element at e' position, i.e. before e
elements.add(elements.indexOf(e), lr);
Lists in DOM4J are live lists, i.e. a mutating list operation affects the document tree and vice versa
As a side note, DOMElement and all the other classes in org.dom4j.dom is a DOM4J implementation that also supports the w3c DOM API. This is rarely needed (I would not have put it and a bunch of the other "esoteric" packges like bean, datatype, jaxb, swing etc, in the same distribution unit). Concentrate on the core org.dom4j, org.dom4j.tree, org.dom4j.io and org.dom4j.xpathpackages.

Sharing children among parents in a JTree

I have a custom DefaultMutableTreeNode class that is designed to support robust connections between many types of data attributes (for me those attributes could be strings, user-defined tags, or timestamps).
As I aggregate data, I'd like to give the user a live preview of the stored data we've seen so far. For efficiency reasons, I'd like to only keep one copy of a particular attribute, that may have many connections to other attributes.
Example: The user-defined tag "LOL" occurs at five different times (represented by TimeStamps). So my JTree (the class that is displaying this information) will have five parent nodes (one for each time that tag occured). Those parent nodes should ALL SHARE ONE INSTANCE of the DefaultMutableTreeNode defined for the "LOL" tag.
Unfortunately, using the add(MutableTreeNode newChild) REMOVES newChild from WHATEVER the current parent node is. That's really too bad, since I want ALL of the parent nodes to have THE SAME child node.
Here is a picture of DOING IT WRONG (Curtis is the author and he should appear FOR ALL THE SHOWS):
How can I accomplish this easily in Java?
Update
I've been looking at the code for DefaultMutableTreeNode.add()... and I'm surprised it works the way it does (comments are mine):
public void add(MutableTreeNode child)
{
if (! allowsChildren)
throw new IllegalStateException();
if (child == null)
throw new IllegalArgumentException();
if (isNodeAncestor(child))
throw new IllegalArgumentException("Cannot add ancestor node.");
// one of these two lines contains the magic that prevents a single "pointer" from being
// a child within MANY DefaultMutableTreeNode Vector<MutableTreeNode> children arrays...
children.add(child); // just adds a pointer to the child to the Vector array?
child.setParent(this); // this just sets the parent for this particular instance
}

If you want easily, you should probably give up on sharing the actual TreeNodes themselves. The whole model is built on the assumption that each node has only one parent. I'd focus instead on designing your custom TreeNode so that multiple nodes can all read their data from the same place, thereby keeping them synced.

I'm not sure it qualifies as easy, but you might look at Creating a Data Model by implementing TreeModel, which "does not require that nodes be represented by DefaultMutableTreeNode objects, or even that nodes implement the TreeNode interface." In addition to the tutorial example, there's a file system example cited here.

Unfortunately, I believe the answer is no. In order to do what you're talking about, you would need to have DefaultMutableTreeNode's internal userObject be a pointer to some String, so that all the corresponding DefaultMutableTreeNode's could point to and share the same String object.
However, you can't call DefaultMutableTreeNode.setUserObject() with any such String pointer, because Java does not have such a concept on the level that C or C++ does. Check out this outstanding blog article on the confusing misconceptions about pass-by-value and pass-by-reference in Java.
Update: Responding to your comment here in the answer space, so I can include a code example. Yes, it's true that Java works with pointers internally... and sometimes you have to clone an object reference to avoid unwanted changes to the original. However, to make a long story short (read the blog article above), this isn't one of those occasions.
public static void main(String[] args) {
// This HashMap is a simplification of your hypothetical collection of values,
// shared by all DefaultMutableTreeNode's
HashMap<String, String> masterObjectCollection = new HashMap<String, String>();
masterObjectCollection.put("testString", "The original string");
// Here's a simplification of some other method elsewhere making changes to
// an object in the master collection
modifyString(masterObjectCollection.get("testString"));
// You're still going to see the original String printed. When you called
// that method, a reference to you object was passed by value... the ultimate
// result being that the original object in you master collection does
// not get changed based on what happens in that other method.
System.out.println(masterObjectCollection.get("testString"));
}
private static void modifyString(String theString) {
theString += "... with its value modified";
}
You might want to check out the JIDE Swing extensions, of which some components are commercial while others are open source and free as in beer. You might find some kind of component that comes closer to accomplishing exactly what you want.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.