JGraphT: Inconsistent graph size for same dataset

JGraphT: Inconsistent graph size for same dataset - java

Version: 1.3.0
Graph Requirement: Child will have many parents or none.
Data Node: Id + List [Parent ids]
class Node {
String id;
List<String> parents;
}
Total dataset: 3500 nodes.
GraphType is selected using : Directed + No Multiple Edges + No Self Loops + No Weights + DefaultEdge
Graph Building logic:
Iterate through 3500 nodes
Create the vertex using node ID.
Graph.addVertex(childVertex)
then check if parents exists
if they do, iterate through parents
Create parent id vertex using parent id.
Graph.addVertex(parentVertex)
Graph.addEdge(parentVertex, childvertex)
However, running the same dataset (3500), for 5 times, I am getting different graph size calculated every time using: graph.vertexSet().size(). Expectation is 3500 everytime but its inconsistent.
All 3500 ids are unique and we should have graph size = 3500 but we got something like:
GraphType is - SimpleDirectedGraph and size is variable - 3500, 3208, 3283, 2856, 3284.
Any help would appreciated.
Thanks

Related

OrientDB Java Batch Import

I have a bigger problem with the batch import in OrientDB when I use Java.
My data is a collection of recordID's and tokens. For each ID exists a set of tokens but tokens can be in several ID'S.
Example:
ID Tokens
1 2,3,4
2 3,5,7
3 1,2,4
My graph should have two types of verticies: rIDClass and tokenClass. I want to give each vertex an ID corresponding to the recordID and the token. So the total number of tokenClass vertices should be the total number of unique tokens in the data. (Each token is only created once!)
How can I realize this problem? I tried the "Custom Batch Insert" from the original documentation and I tried the method "Batch Implementation", described in the blueprints documentation.
The problem at the first method is that OrientDB creates for each inserted token a separate vertex with a custom ID, which is set by the system itself.
The problem at the second method is when I try to add a vertex to the batchgraph I can't set the corresponding vertex Class and additionally I get an Exception. This is my code from the second method:
BatchGraph<OrientGraph> bgraph = new BatchGraph<OrientGraph>(graph, VertexIDType.STRING, 1);
Vertex vertex1 = bgraph.addVertex(1);
vertex1.setProperty("uid", 1);
Maybe someone has a solution.
Vertex vertex2 = bgraph.addVertex(2);
vertex2.setProperty("uid", 2);
Edge edge1 = graph.addEdge(12, vertex1 , vertex2, "EdgeConnectClass");
And I get the following Exception:
Exception in thread "main" java.lang.ClassCastException:
com.tinkerpop.blueprints.util.wrappers.batch.BatchGraph$BatchVertex cannot be cast to com.tinkerpop.blueprints.impls.orient.OrientVertex
at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.addEdge(OrientBaseGraph.java:612)
at App.indexRecords3(App.java:83)
at App.main(App.java:47)

I don't know if I understood correctly but, if you want a schema like this:
try this:
Vertex vertex1 = g.addVertex("class:rIDClass");
vertex1.setProperty("uid", 1);
Vertex token2 = g.addVertex("class:tokenClass");
token2.setProperty("uid", 2);
Edge edge1 = g.addEdge("class:rIDClass", vertex1, token2, "EdgeConnectClass");
Hope it helps
Regards

How to interpret K-Means clusters

I have written code in java using Apache Spark for K-Means clustering.
I want to analyze network data. I created K-Means model using some training data. k=5 and iteration=50.
Now I want detect anomaly record using distance of a record from the center of the cluster. If it is far from the center then it is an anomaly record.
Also I want to find out what type of data each cluster stores. To give an example--- in case of Movie clustering, detecting cluster having common genre or theme among the movies in the cluster
I am having trouble to interpret clusters. I am using one bad record and one good record for prediction but at times both good and bad records fall in same cluster.
Bad record means URI field of that field containing value something like HelloWorld/../../WEB-INF/web.xml.
I get array of all cluster centers from K-Means model. There is no api to get cluster center of a particular cluster. I am calculating distance of a input vector/record from all clusters but I am not able to get cluster center of a specific cluster or where that record is present.
Here is my code,
KMeansModel model = KMeans.train(trainingData.rdd(), numClusters,
numIterations);
In separate file,
model.save(sparkContext, KM_MODEL_PATH);
Vector[] clusterCenters = model.clusterCenters();
//Input for prediction is <Vector> testData
//Predict data
System.out.println("Test Data cluster ----- "
+ model.predict(vector) + " k ->> " + model.k());
//Calculate distance of a input record from each cluster center
for (Vector clusterCenter : clusterCenters) {
System.out.println(" Distance "
+ computeDistance(clusterCenter.toArray(),
vector.toArray()));
}
//Function for computing distance between input record and center of a cluster
public double computeDistance(double[] clusterCenters, double[] vector) {
org.apache.spark.mllib.linalg.DenseVector dV1 = new org.apache.spark.mllib.linalg.DenseVector(
clusterCenters);
org.apache.spark.mllib.linalg.DenseVector dV2 = new org.apache.spark.mllib.linalg.DenseVector(
vector);
return org.apache.spark.mllib.linalg.Vectors.sqdist(dV1, dV2);
}

FetchPlan orientDB incorrect results

I am using orient 2.1-rc4, executed the following commands: My motive is to fetch only the outgoing vertices path.
1. Correct Result with Simple Graph
create class Depends extends E
create vertex set name="persians"
create vertex set name="vikings"
create vertex set name="teutons"
create vertex set name="mayans"
create vertex set name="aztecs"
select * from v
\# |#RID|#CLASS|name
0 |#9:0|V |persians
1 |#9:1|V |vikings
2 |#9:2|V |teutons
3 |#9:3|V |mayans
4 |#9:4|V |aztecs
create edge Depends from #9:0 to #9:2
create edge Depends from #9:1 to #9:2
create edge Depends from #9:2 to #9:3
create edge Depends from #9:2 to #9:4
SELECT #this.toJSON('fetchPlan:in_*:-2 *:-1') FROM #9:2
{"out_Depends":[{"out":"#9:2","in":{"name":"mayans","in_Depends":["#11:2"]}},{"out":"#9:2","in":{"name":"aztecs","in_Depends":["#11:3"]}}],"name":"teutons"}
Only the outgoing nodes are fetched as expected.
2. Incorrect result
Adding two more vertices:
create vertex set name="britons"
create vertex set name="mongols"
create edge Depends from #9:5 to #9:0
create edge Depends from #9:6 to #9:4
select * from e
----+-----+-------+----+----
\# |#RID |#CLASS |out |in
----+-----+-------+----+----
0 |#11:0|Depends|#9:0|#9:2
1 |#11:1|Depends|#9:1|#9:2
2 |#11:2|Depends|#9:2|#9:3
3 |#11:3|Depends|#9:2|#9:4
4 |#11:4|Depends|#9:5|#9:0
5 |#11:5|Depends|#9:6|#9:4
----+-----+-------+----+----
Trying to fetch the out vertices as per http://orientdb.com/docs/last/Fetching-Strategies.html
SELECT #this.toJSON('fetchPlan:in_*:-2') FROM #9:2
{"out_Depends":["#11:2","#11:3"],"name":"teutons"}
Not all out vertices are fetched
SELECT #this.toJSON('fetchPlan:in_*:-2 *:-1') FROM #9:2
{"out_Depends":[{"out":"#9:2","in":{"name":"mayans","in_Depends":["#11:2"]}},{"out":"#9:2","in":{"in_Depends":["#11:3",{"out":{"name":"mongols","out_Depends":["#11:5"]},"in":"#9:4"}],"name":"aztecs"}}],"name":"teutons"}
Extra vertex mongols fetched, which means the rule has not being applied at other levels. (out_Depends is excluded only from the 0th level)
Adding a [*] to apply the exclusion rule on all levels as per the documentation
SELECT #this.toJSON('fetchPlan:[*]in_*:-2 *:-1') FROM #9:2
{"out_Depends":[{"out":"#9:2","in":{"name":"mayans","in_Depends":["#11:2"]}},{"out":"#9:2","in":{"in_Depends":["#11:3",{"out":{"name":"mongols","out_Depends":["#11:5"]},"in":"#9:4"}],"name":"aztecs"}}],"in_Depends":[{"out":{"name":"persians","out_Depends":["#11:0"],"in_Depends":[{"out":{"name":"britons","out_Depends":["#11:4"]},"in":"#9:0"}]},"in":"#9:2"},{"out":{"name":"vikings","out_Depends":["#11:1"]},"in":"#9:2"}],"name":"teutons"}
This however fetches the entire tree.
Can someone give a suggestion?

I followed your instructions and your query works, can you post the graph's image of your db,please?

Composite index of edges & property (tinkerpop / orientDB)

I have a graph in OrientDB (uses Tinkerpop stack), and need to enable very fast lookups of edge values / properties / fields and edge in/out vertices.
So, basically the user will need to lookup as follows:
SELECT FROM myEdges WHERE inVertex = {VertexIdentity}, outVertex = {VertexIdentity}, property1 = 'xyz'
Essentially it's a composite index for the edge class, of 3 properties: inVertex, outVertex & property1
Basically - if the user already has a VertexIdentity for 2 vertices (maybe, in the form: #CLUSTER_ID:RECORD_ID) - and the the property value (in this case, xyz) - it will allow very fast lookup to see if the combination already exists in the graph (if 2 vertices are linked with property1) - without making a traversal.
So far I found the following code to help with composite indexes, but I cant see if it's possible to include in/out vertices in this (for a graph edge).
https://github.com/orientechnologies/orientdb/blob/master/tests/src/test/java/com/orientechnologies/orient/test/database/auto/SQLSelectCompositeIndexDirectSearchTest.java
Is it possible??

This is working fine for defining edge uniqueness:
OCommandSQL declareIn= new OCommandSQL();
declareIn.setText("CREATE PROPERTY E.in LINK");
OCommandSQL declareOut= new OCommandSQL();
declareOut.setText("CREATE PROPERTY E.out LINK");
OCommandSQL createIndexUniqueEdge= new OCommandSQL();
createIndexUniqueEdge.setText("CREATE INDEX unique_edge ON E (in, out) UNIQUE");
graph.command(declareIn).execute();
graph.command(declareOut).execute();
graph.command(createIndexUniqueEdge).execute();
In you case just add another property to the Edge class and consequently in the index

You can do it with OrientDB, just create the composite index against the in and out properties too (declare them in E class before).
This is used also as constraints to avoid multiple edges connect the same vertices.

Java heap space errors using bigger amounts of data in neo4j

I am currently evaluating neo4j in terms of inserting big amounts of nodes/relationships into the graph. It is not about initial inserts which could be achieved with batch inserts. It is about inserts that are processed frequently during runtime in a java application that uses neo4j in embedded mode (currently version 1.8.1 as it is shipped with spring-data-neo4j 2.2.2.RELEASE).
These inserts are usually nodes that follow the star schema. One single node (the root node of the imported dataset) has up to 1000000 (one million!) connected child nodes. The child nodes normally have relationships to other additional nodes, too. But those relationships are not covered by this test so far. The overall goal is to import that amount of data in at most five minutes!
To simulate such kind of inserts I wrote a small junit test that uses the Neo4jTemplate for creating the nodes and relationships. Each inserted leaf has a key associated for later processing:
#Test
#Transactional
#Rollback
public void generateUngroupedNode()
{
long numberOfLeafs = 1000000;
Assert.assertTrue(this.template.transactionIsRunning());
Node root = this.template.createNode(map(NAME, UNGROUPED));
String groupingKey = null;
for (long index = 0; index < numberOfLeafs; index++)
{
// Just a sample division of leafs to possible groups
// Creates keys to be grouped by to groups containing 2 leafs each
if (index % 2 == 0)
{
groupingKey = UUID.randomUUID().toString();
}
Node leaf = this.template.createNode(map(GROUPING_KEY, groupingKey, NAME, LEAF));
this.template.createRelationshipBetween(root, leaf, Relationships.LEAF.name(),
map());
}
}
For this test I use the gcr cache to avoid Garbage Collector issues:
cache_type=gcr
node_cache_array_fraction=7
relationship_cache_array_fraction=5
node_cache_size=400M
relationship_cache_size=200M
Additionally I set my MAVEN_OPTS to:
export MAVEN_OPTS="-Xmx4096m -Xms2046m -XX:PermSize=256m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"
But anyway when running that test I always get a Java heap space error:
java.lang.OutOfMemoryError: Java heap space
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
at java.lang.Class.getMethod0(Class.java:2670)
at java.lang.Class.getMethod(Class.java:1603)
at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:896)
at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:862)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:859)
at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:423)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.springframework.transaction.support.TransactionTemplate.<init>(TransactionTemplate.java:67)
at org.springframework.data.neo4j.support.Neo4jTemplate.exec(Neo4jTemplate.java:403)
at org.springframework.data.neo4j.support.Neo4jTemplate.createRelationshipBetween(Neo4jTemplate.java:367)
I did some tests with fewer amounts of data which result into the following outcomes. 1 node connected to:
50000 leafs: 3035ms
100000 leafs: 4290ms
200000 leafs: 10268ms
400000 leafs: 20913ms
800000 leafs: Java heap space
Here is a screenshot of the system monitor during those operations:
To get a better impression on what exactly is running and is stored in the heap I ran the JProfiler with the last test (800000 leafs). Here are some screenshots:
Heap usage:
CPU usage:
The big question for me is: Is neo4j not designed for using that kind of huge amount of data? Or are there some other ways to achieve those kind of inserts (and later operations)? On the official neo4j website and various screencasts I found the information that neo4j is able to run with billions of nodes and relationships (e.g. http://docs.neo4j.org/chunked/stable/capabilities-capacity.html). I didn't find any functionalities like flush() and clean() methods that are available e.g. in JPA to keep the heap clean manually.
It would be great to be able to use neo4j with those amounts of data. Already with 200000 leafs stored in the graph I noticed a performance improvment of factor 10 and more compared to an embedded classic RDBMS. I don't want to give up the nice way of data modeling and querying those data like neo4j provides.

By just using the Neo4j core API it takes between 18 and 26 seconds to create the children, without any optimizations on my MacBook Air:
Output: import of 1000000 children took 26 seconds.
public class CreateManyRelationships {
public static final int COUNT = 1000 * 1000;
public static final DynamicRelationshipType CHILD = DynamicRelationshipType.withName("CHILD");
public static final File DIRECTORY = new File("target/test.db");
public static void main(String[] args) throws IOException {
FileUtils.deleteRecursively(DIRECTORY);
GraphDatabaseService gdb = new GraphDatabaseFactory().newEmbeddedDatabase(DIRECTORY.getAbsolutePath());
long time=System.currentTimeMillis();
Transaction tx = gdb.beginTx();
Node root = gdb.createNode();
for (int i=1;i<= COUNT;i++) {
Node child = gdb.createNode();
root.createRelationshipTo(child, CHILD);
if (i % 50000 == 0) {
tx.success();tx.finish();
tx = gdb.beginTx();
}
}
tx.success();tx.finish();
time = System.currentTimeMillis()-time;
System.out.println("import of "+COUNT+" children took " + time/1000 + " seconds.");
gdb.shutdown();
}
}
And Spring Data Neo4j docs state, that it is not made for this type of task

If you are connecting 800K child nodes to one node, you are effectively creating a dense node, a.k.a. Key-Value like structure. Neo4j right now is not optimized to handle these structures effectively as all connected relationships are loaded into memory upon traversal of a node. This will be addressed by Neo4j 2.1 with configurable optimizations if you only want to load parts of relationships when touching these structures.
For the time being, I would recommend either putting these structures into indexes instead and do a lookup for the connected nodes, or balancing the dense structure along one value (e.g. build a subtree with say 100 subcategories along one of the properties on the relationships, e.g. time, see http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html for instance.
Would that help?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JGraphT: Inconsistent graph size for same dataset - java

Related

OrientDB Java Batch Import

How to interpret K-Means clusters

FetchPlan orientDB incorrect results

Composite index of edges & property (tinkerpop / orientDB)

Java heap space errors using bigger amounts of data in neo4j

Categories

Resources