How to index Titan graph present in cassandra to solr - java

I have stored a titan graph in cassandra. Below is the code.
public class Example1 {
public static void main(String[] args) {
//Initliase graph
BaseConfiguration baseConfiguration = new BaseConfiguration();
baseConfiguration.setProperty("storage.backend", "cassandra");
baseConfiguration.setProperty("storage.hostname", "192.168.3.82");
baseConfiguration.setProperty("storage.cassandra.keyspace", "mycustomerdata");
TitanGraph graph = TitanFactory.open(baseConfiguration);
//---------------- Adding Data -------------------
//Create some customers
Vertex alice = graph.addVertex("customer");
alice.property("name", "Alice Mc Alice");
alice.property("birthdat", "100000 BC");
Vertex bob = graph.addVertex("customer");
bob.property("name", "Bob Mc Bob");
bob.property("birthdat", "1000 BC");
//Create Some Products
Vertex meat = graph.addVertex("product");
meat.property("name", "Meat");
meat.property("description", "Delicious Meat");
Vertex lettuce = graph.addVertex("product");
lettuce.property("name", "Lettuce");
lettuce.property("description", "Delicious Lettuce which is green");
//Alice Bought some meat:
alice.addEdge("bought", meat);
//Bob Bought some meat and lettuce:
bob.addEdge("bought",lettuce);
//---------------- Querying (aka traversing whcih is what you do in graph dbs) Data -------------------
//Now who has bought meat?
graph.traversal().V().has("name", "meat").in("bought").forEachRemaining(v -> System.out.println(v.value("name")));
//Who are all our customers
/*graph.traversal().V().hasLabel("customer").forEachRemaining(v -> System.out.println(v.value("name")));
//What products do we have
graph.traversal().V().hasLabel("product").forEachRemaining(v -> System.out.println(v.value("name")));*/
graph.close();
}
}
I would like index the same graph in solr .
How to do this in using java ?
Do I have query the tables of the keyspace and index ? what's the approach for having the same graph indexed in solr ?

Titan integrates directly with solr. Which means that you never have to talk to solr directly. Rather, you let titan talk with it for you and this happens naturally whenever traversing the graph.
All you have to do is setup your indexing as defined here. I provide an example of using a Mixed index optimised by Solr/Elastic search here.
So in the above example whenever you execute a specific type of traversals titan together with solr will respond quickly.
Just remember you have to create a Mixed Index.
In addition to defining the indices you also have to get titan running with solr. Unfortunately this is not so simple. You have to get solr running and then get titan talking to solr as I have done here

Related

DFS performance of Spark GraphX vs simple Java DFS implementation

Considering a graph with 14,000 vertices and 14,000 edges, I wonder why GraphX takes much more time than the java implementation of a graph to get all the paths from a vertex to the leaf?
The java implementation: A few seconds
The Graphx implementation: Several minutes
Is spark GraphX really suitable for this kind of treatment?
My system:
i5-7500 #3.40GHz,
8GB RAM
The pregel's algorythm:
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have canReach = false.
val initialGraph: Graph[Boolean, Double] = graph.mapVertices((id, _) => id == sourceId)
val sssp = initialGraph.pregel(false)(
(id, canReach, newCanReach) => canReach || newCanReach, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr && !triplet.dstAttr) {
Iterator((triplet.dstId, true))
} else {
Iterator.empty
}
},
(a, b) => a || b // Merge Message
It happened to me when implementing some algorithms on Graphx, I believe that GraphX is well adapted for a distributed environment when you have big graphs split accross multiple machines.
But now while you say that you use one node, have you checked the number of workers used? number of executors? Amount of memory used by each excutor? These configuration parameters definitely plays an important role in increasing or decreasing the performance of your GraphX application.

OrientDB Java Batch Import

I have a bigger problem with the batch import in OrientDB when I use Java.
My data is a collection of recordID's and tokens. For each ID exists a set of tokens but tokens can be in several ID'S.
Example:
ID Tokens
1 2,3,4
2 3,5,7
3 1,2,4
My graph should have two types of verticies: rIDClass and tokenClass. I want to give each vertex an ID corresponding to the recordID and the token. So the total number of tokenClass vertices should be the total number of unique tokens in the data. (Each token is only created once!)
How can I realize this problem? I tried the "Custom Batch Insert" from the original documentation and I tried the method "Batch Implementation", described in the blueprints documentation.
The problem at the first method is that OrientDB creates for each inserted token a separate vertex with a custom ID, which is set by the system itself.
The problem at the second method is when I try to add a vertex to the batchgraph I can't set the corresponding vertex Class and additionally I get an Exception. This is my code from the second method:
BatchGraph<OrientGraph> bgraph = new BatchGraph<OrientGraph>(graph, VertexIDType.STRING, 1);
Vertex vertex1 = bgraph.addVertex(1);
vertex1.setProperty("uid", 1);
Maybe someone has a solution.
Vertex vertex2 = bgraph.addVertex(2);
vertex2.setProperty("uid", 2);
Edge edge1 = graph.addEdge(12, vertex1 , vertex2, "EdgeConnectClass");
And I get the following Exception:
Exception in thread "main" java.lang.ClassCastException:
com.tinkerpop.blueprints.util.wrappers.batch.BatchGraph$BatchVertex cannot be cast to com.tinkerpop.blueprints.impls.orient.OrientVertex
at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.addEdge(OrientBaseGraph.java:612)
at App.indexRecords3(App.java:83)
at App.main(App.java:47)
I don't know if I understood correctly but, if you want a schema like this:
try this:
Vertex vertex1 = g.addVertex("class:rIDClass");
vertex1.setProperty("uid", 1);
Vertex token2 = g.addVertex("class:tokenClass");
token2.setProperty("uid", 2);
Edge edge1 = g.addEdge("class:rIDClass", vertex1, token2, "EdgeConnectClass");
Hope it helps
Regards

How to interpret K-Means clusters

I have written code in java using Apache Spark for K-Means clustering.
I want to analyze network data. I created K-Means model using some training data. k=5 and iteration=50.
Now I want detect anomaly record using distance of a record from the center of the cluster. If it is far from the center then it is an anomaly record.
Also I want to find out what type of data each cluster stores. To give an example--- in case of Movie clustering, detecting cluster having common genre or theme among the movies in the cluster
I am having trouble to interpret clusters. I am using one bad record and one good record for prediction but at times both good and bad records fall in same cluster.
Bad record means URI field of that field containing value something like HelloWorld/../../WEB-INF/web.xml.
I get array of all cluster centers from K-Means model. There is no api to get cluster center of a particular cluster. I am calculating distance of a input vector/record from all clusters but I am not able to get cluster center of a specific cluster or where that record is present.
Here is my code,
KMeansModel model = KMeans.train(trainingData.rdd(), numClusters,
numIterations);
In separate file,
model.save(sparkContext, KM_MODEL_PATH);
Vector[] clusterCenters = model.clusterCenters();
//Input for prediction is <Vector> testData
//Predict data
System.out.println("Test Data cluster ----- "
+ model.predict(vector) + " k ->> " + model.k());
//Calculate distance of a input record from each cluster center
for (Vector clusterCenter : clusterCenters) {
System.out.println(" Distance "
+ computeDistance(clusterCenter.toArray(),
vector.toArray()));
}
//Function for computing distance between input record and center of a cluster
public double computeDistance(double[] clusterCenters, double[] vector) {
org.apache.spark.mllib.linalg.DenseVector dV1 = new org.apache.spark.mllib.linalg.DenseVector(
clusterCenters);
org.apache.spark.mllib.linalg.DenseVector dV2 = new org.apache.spark.mllib.linalg.DenseVector(
vector);
return org.apache.spark.mllib.linalg.Vectors.sqdist(dV1, dV2);
}

Composite index of edges & property (tinkerpop / orientDB)

I have a graph in OrientDB (uses Tinkerpop stack), and need to enable very fast lookups of edge values / properties / fields and edge in/out vertices.
So, basically the user will need to lookup as follows:
SELECT FROM myEdges WHERE inVertex = {VertexIdentity}, outVertex = {VertexIdentity}, property1 = 'xyz'
Essentially it's a composite index for the edge class, of 3 properties: inVertex, outVertex & property1
Basically - if the user already has a VertexIdentity for 2 vertices (maybe, in the form: #CLUSTER_ID:RECORD_ID) - and the the property value (in this case, xyz) - it will allow very fast lookup to see if the combination already exists in the graph (if 2 vertices are linked with property1) - without making a traversal.
So far I found the following code to help with composite indexes, but I cant see if it's possible to include in/out vertices in this (for a graph edge).
https://github.com/orientechnologies/orientdb/blob/master/tests/src/test/java/com/orientechnologies/orient/test/database/auto/SQLSelectCompositeIndexDirectSearchTest.java
Is it possible??
This is working fine for defining edge uniqueness:
OCommandSQL declareIn= new OCommandSQL();
declareIn.setText("CREATE PROPERTY E.in LINK");
OCommandSQL declareOut= new OCommandSQL();
declareOut.setText("CREATE PROPERTY E.out LINK");
OCommandSQL createIndexUniqueEdge= new OCommandSQL();
createIndexUniqueEdge.setText("CREATE INDEX unique_edge ON E (in, out) UNIQUE");
graph.command(declareIn).execute();
graph.command(declareOut).execute();
graph.command(createIndexUniqueEdge).execute();
In you case just add another property to the Edge class and consequently in the index
You can do it with OrientDB, just create the composite index against the in and out properties too (declare them in E class before).
This is used also as constraints to avoid multiple edges connect the same vertices.

Java heap space errors using bigger amounts of data in neo4j

I am currently evaluating neo4j in terms of inserting big amounts of nodes/relationships into the graph. It is not about initial inserts which could be achieved with batch inserts. It is about inserts that are processed frequently during runtime in a java application that uses neo4j in embedded mode (currently version 1.8.1 as it is shipped with spring-data-neo4j 2.2.2.RELEASE).
These inserts are usually nodes that follow the star schema. One single node (the root node of the imported dataset) has up to 1000000 (one million!) connected child nodes. The child nodes normally have relationships to other additional nodes, too. But those relationships are not covered by this test so far. The overall goal is to import that amount of data in at most five minutes!
To simulate such kind of inserts I wrote a small junit test that uses the Neo4jTemplate for creating the nodes and relationships. Each inserted leaf has a key associated for later processing:
#Test
#Transactional
#Rollback
public void generateUngroupedNode()
{
long numberOfLeafs = 1000000;
Assert.assertTrue(this.template.transactionIsRunning());
Node root = this.template.createNode(map(NAME, UNGROUPED));
String groupingKey = null;
for (long index = 0; index < numberOfLeafs; index++)
{
// Just a sample division of leafs to possible groups
// Creates keys to be grouped by to groups containing 2 leafs each
if (index % 2 == 0)
{
groupingKey = UUID.randomUUID().toString();
}
Node leaf = this.template.createNode(map(GROUPING_KEY, groupingKey, NAME, LEAF));
this.template.createRelationshipBetween(root, leaf, Relationships.LEAF.name(),
map());
}
}
For this test I use the gcr cache to avoid Garbage Collector issues:
cache_type=gcr
node_cache_array_fraction=7
relationship_cache_array_fraction=5
node_cache_size=400M
relationship_cache_size=200M
Additionally I set my MAVEN_OPTS to:
export MAVEN_OPTS="-Xmx4096m -Xms2046m -XX:PermSize=256m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"
But anyway when running that test I always get a Java heap space error:
java.lang.OutOfMemoryError: Java heap space
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
at java.lang.Class.getMethod0(Class.java:2670)
at java.lang.Class.getMethod(Class.java:1603)
at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:896)
at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:862)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:859)
at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:423)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.springframework.transaction.support.TransactionTemplate.<init>(TransactionTemplate.java:67)
at org.springframework.data.neo4j.support.Neo4jTemplate.exec(Neo4jTemplate.java:403)
at org.springframework.data.neo4j.support.Neo4jTemplate.createRelationshipBetween(Neo4jTemplate.java:367)
I did some tests with fewer amounts of data which result into the following outcomes. 1 node connected to:
50000 leafs: 3035ms
100000 leafs: 4290ms
200000 leafs: 10268ms
400000 leafs: 20913ms
800000 leafs: Java heap space
Here is a screenshot of the system monitor during those operations:
To get a better impression on what exactly is running and is stored in the heap I ran the JProfiler with the last test (800000 leafs). Here are some screenshots:
Heap usage:
CPU usage:
The big question for me is: Is neo4j not designed for using that kind of huge amount of data? Or are there some other ways to achieve those kind of inserts (and later operations)? On the official neo4j website and various screencasts I found the information that neo4j is able to run with billions of nodes and relationships (e.g. http://docs.neo4j.org/chunked/stable/capabilities-capacity.html). I didn't find any functionalities like flush() and clean() methods that are available e.g. in JPA to keep the heap clean manually.
It would be great to be able to use neo4j with those amounts of data. Already with 200000 leafs stored in the graph I noticed a performance improvment of factor 10 and more compared to an embedded classic RDBMS. I don't want to give up the nice way of data modeling and querying those data like neo4j provides.
By just using the Neo4j core API it takes between 18 and 26 seconds to create the children, without any optimizations on my MacBook Air:
Output: import of 1000000 children took 26 seconds.
public class CreateManyRelationships {
public static final int COUNT = 1000 * 1000;
public static final DynamicRelationshipType CHILD = DynamicRelationshipType.withName("CHILD");
public static final File DIRECTORY = new File("target/test.db");
public static void main(String[] args) throws IOException {
FileUtils.deleteRecursively(DIRECTORY);
GraphDatabaseService gdb = new GraphDatabaseFactory().newEmbeddedDatabase(DIRECTORY.getAbsolutePath());
long time=System.currentTimeMillis();
Transaction tx = gdb.beginTx();
Node root = gdb.createNode();
for (int i=1;i<= COUNT;i++) {
Node child = gdb.createNode();
root.createRelationshipTo(child, CHILD);
if (i % 50000 == 0) {
tx.success();tx.finish();
tx = gdb.beginTx();
}
}
tx.success();tx.finish();
time = System.currentTimeMillis()-time;
System.out.println("import of "+COUNT+" children took " + time/1000 + " seconds.");
gdb.shutdown();
}
}
And Spring Data Neo4j docs state, that it is not made for this type of task
If you are connecting 800K child nodes to one node, you are effectively creating a dense node, a.k.a. Key-Value like structure. Neo4j right now is not optimized to handle these structures effectively as all connected relationships are loaded into memory upon traversal of a node. This will be addressed by Neo4j 2.1 with configurable optimizations if you only want to load parts of relationships when touching these structures.
For the time being, I would recommend either putting these structures into indexes instead and do a lookup for the connected nodes, or balancing the dense structure along one value (e.g. build a subtree with say 100 subcategories along one of the properties on the relationships, e.g. time, see http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html for instance.
Would that help?

Categories