How to interpret K-Means clusters - java

I have written code in java using Apache Spark for K-Means clustering.
I want to analyze network data. I created K-Means model using some training data. k=5 and iteration=50.
Now I want detect anomaly record using distance of a record from the center of the cluster. If it is far from the center then it is an anomaly record.
Also I want to find out what type of data each cluster stores. To give an example--- in case of Movie clustering, detecting cluster having common genre or theme among the movies in the cluster
I am having trouble to interpret clusters. I am using one bad record and one good record for prediction but at times both good and bad records fall in same cluster.
Bad record means URI field of that field containing value something like HelloWorld/../../WEB-INF/web.xml.
I get array of all cluster centers from K-Means model. There is no api to get cluster center of a particular cluster. I am calculating distance of a input vector/record from all clusters but I am not able to get cluster center of a specific cluster or where that record is present.
Here is my code,
KMeansModel model = KMeans.train(trainingData.rdd(), numClusters,
numIterations);
In separate file,
model.save(sparkContext, KM_MODEL_PATH);
Vector[] clusterCenters = model.clusterCenters();
//Input for prediction is <Vector> testData
//Predict data
System.out.println("Test Data cluster ----- "
+ model.predict(vector) + " k ->> " + model.k());
//Calculate distance of a input record from each cluster center
for (Vector clusterCenter : clusterCenters) {
System.out.println(" Distance "
+ computeDistance(clusterCenter.toArray(),
vector.toArray()));
}
//Function for computing distance between input record and center of a cluster
public double computeDistance(double[] clusterCenters, double[] vector) {
org.apache.spark.mllib.linalg.DenseVector dV1 = new org.apache.spark.mllib.linalg.DenseVector(
clusterCenters);
org.apache.spark.mllib.linalg.DenseVector dV2 = new org.apache.spark.mllib.linalg.DenseVector(
vector);
return org.apache.spark.mllib.linalg.Vectors.sqdist(dV1, dV2);
}

Related

JGraphT: Inconsistent graph size for same dataset

Version: 1.3.0
Graph Requirement: Child will have many parents or none.
Data Node: Id + List [Parent ids]
class Node {
String id;
List<String> parents;
}
Total dataset: 3500 nodes.
GraphType is selected using : Directed + No Multiple Edges + No Self Loops + No Weights + DefaultEdge
Graph Building logic:
Iterate through 3500 nodes
Create the vertex using node ID.
Graph.addVertex(childVertex)
then check if parents exists
if they do, iterate through parents
Create parent id vertex using parent id.
Graph.addVertex(parentVertex)
Graph.addEdge(parentVertex, childvertex)
However, running the same dataset (3500), for 5 times, I am getting different graph size calculated every time using: graph.vertexSet().size(). Expectation is 3500 everytime but its inconsistent.
All 3500 ids are unique and we should have graph size = 3500 but we got something like:
GraphType is - SimpleDirectedGraph and size is variable - 3500, 3208, 3283, 2856, 3284.
Any help would appreciated.
Thanks

How to index Titan graph present in cassandra to solr

I have stored a titan graph in cassandra. Below is the code.
public class Example1 {
public static void main(String[] args) {
//Initliase graph
BaseConfiguration baseConfiguration = new BaseConfiguration();
baseConfiguration.setProperty("storage.backend", "cassandra");
baseConfiguration.setProperty("storage.hostname", "192.168.3.82");
baseConfiguration.setProperty("storage.cassandra.keyspace", "mycustomerdata");
TitanGraph graph = TitanFactory.open(baseConfiguration);
//---------------- Adding Data -------------------
//Create some customers
Vertex alice = graph.addVertex("customer");
alice.property("name", "Alice Mc Alice");
alice.property("birthdat", "100000 BC");
Vertex bob = graph.addVertex("customer");
bob.property("name", "Bob Mc Bob");
bob.property("birthdat", "1000 BC");
//Create Some Products
Vertex meat = graph.addVertex("product");
meat.property("name", "Meat");
meat.property("description", "Delicious Meat");
Vertex lettuce = graph.addVertex("product");
lettuce.property("name", "Lettuce");
lettuce.property("description", "Delicious Lettuce which is green");
//Alice Bought some meat:
alice.addEdge("bought", meat);
//Bob Bought some meat and lettuce:
bob.addEdge("bought",lettuce);
//---------------- Querying (aka traversing whcih is what you do in graph dbs) Data -------------------
//Now who has bought meat?
graph.traversal().V().has("name", "meat").in("bought").forEachRemaining(v -> System.out.println(v.value("name")));
//Who are all our customers
/*graph.traversal().V().hasLabel("customer").forEachRemaining(v -> System.out.println(v.value("name")));
//What products do we have
graph.traversal().V().hasLabel("product").forEachRemaining(v -> System.out.println(v.value("name")));*/
graph.close();
}
}
I would like index the same graph in solr .
How to do this in using java ?
Do I have query the tables of the keyspace and index ? what's the approach for having the same graph indexed in solr ?
Titan integrates directly with solr. Which means that you never have to talk to solr directly. Rather, you let titan talk with it for you and this happens naturally whenever traversing the graph.
All you have to do is setup your indexing as defined here. I provide an example of using a Mixed index optimised by Solr/Elastic search here.
So in the above example whenever you execute a specific type of traversals titan together with solr will respond quickly.
Just remember you have to create a Mixed Index.
In addition to defining the indices you also have to get titan running with solr. Unfortunately this is not so simple. You have to get solr running and then get titan talking to solr as I have done here

Create real-time report using ZoomData (visualization problems)

I'm trying to create simple real-time report using ZoomData.
I create DataSource (Upload API) in ZoomData admin interface & add visualization to it (vertical bars).
Also I disable else visualizations for this DS.
My DS has 2 fields:
timestamp - ATTRIBUTE
count - INTEGER AVG
In visualization
group by: timestamp
group by sort: count
y axis: count avg
colors: count avg
Every second i send post request to zoomdata server to add info in DS.
I do it from java (also trying to send from postman).
My problem is: data came from post and added to DS but visualization properties become to default
group by sort: volume
y axis: volume
colors: volume
but group by stay timestamp
I can't understand why visualization properties always change after data came in POST request.

Get cluster assignments in Weka

I have a CSV file as follows:
id,at1,at2,at3
1072,0.5,0.2,0.7
1092,0.2,0.5,0.7
...
I've loaded it in in Weka for clustering:
DataSource source = new DataSource("test.csv");
Instances data = source.getDataSet();
kmeans.buildClusterer(data);
Question #1: How do I set the first column as an ID? ie. ignoring the first column for clustering purposes.
I then try to print out the assignments:
int[] assignments = kmeans.getAssignments();
int i = 0;
for (int clusterNum : assignments) {
System.out.printf("Instance %d -> Cluster %d \n", i, clusterNum);
i++;
}
This prints:
Instance 1 -> Cluster 0
Instance 2 -> Cluster 2
...
Question #2: How do I refer to the ID when printing out the assignments? For example:
Instance 1072 -> Cluster 0
Instance 1092 -> Cluster 2
I realize this is an old question, but I came here looking for an answer as well, and then was able to figure it out myself, so putting my solution here for the next person with this problem. In my case, the clustering component is part of a Java application, so I don't have the option of using the Weka workbench. Here is what I did to pull out the id along with the cluster assignments.
int[] assignments = kmeans.getAssignments();
for (int i = 0; i < assignments.length; i++) {
int id = (int) data.instance(i).value(0); // cast from double
System.out.printf("ID %d -> Cluster %d \n", id, assignments[i]);
}
Unlike the OP, I did not build my Instances from DataSource.getDataSet(), I built this manually from a database table, but the id field was the first one in my case as well, so I think the code above should work. I had a custom distance function that skipped the id field when computing similarity.
Your life would be much easier if you use Windows version of Weka with GUI.
In cluster tab there is a button for ignoring attributes like ID.
And for Id to cluster assignments; after your are done with clustering algorithm you chose, right click the result on left of the screen, then visualize results and then save.

Java heap space errors using bigger amounts of data in neo4j

I am currently evaluating neo4j in terms of inserting big amounts of nodes/relationships into the graph. It is not about initial inserts which could be achieved with batch inserts. It is about inserts that are processed frequently during runtime in a java application that uses neo4j in embedded mode (currently version 1.8.1 as it is shipped with spring-data-neo4j 2.2.2.RELEASE).
These inserts are usually nodes that follow the star schema. One single node (the root node of the imported dataset) has up to 1000000 (one million!) connected child nodes. The child nodes normally have relationships to other additional nodes, too. But those relationships are not covered by this test so far. The overall goal is to import that amount of data in at most five minutes!
To simulate such kind of inserts I wrote a small junit test that uses the Neo4jTemplate for creating the nodes and relationships. Each inserted leaf has a key associated for later processing:
#Test
#Transactional
#Rollback
public void generateUngroupedNode()
{
long numberOfLeafs = 1000000;
Assert.assertTrue(this.template.transactionIsRunning());
Node root = this.template.createNode(map(NAME, UNGROUPED));
String groupingKey = null;
for (long index = 0; index < numberOfLeafs; index++)
{
// Just a sample division of leafs to possible groups
// Creates keys to be grouped by to groups containing 2 leafs each
if (index % 2 == 0)
{
groupingKey = UUID.randomUUID().toString();
}
Node leaf = this.template.createNode(map(GROUPING_KEY, groupingKey, NAME, LEAF));
this.template.createRelationshipBetween(root, leaf, Relationships.LEAF.name(),
map());
}
}
For this test I use the gcr cache to avoid Garbage Collector issues:
cache_type=gcr
node_cache_array_fraction=7
relationship_cache_array_fraction=5
node_cache_size=400M
relationship_cache_size=200M
Additionally I set my MAVEN_OPTS to:
export MAVEN_OPTS="-Xmx4096m -Xms2046m -XX:PermSize=256m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"
But anyway when running that test I always get a Java heap space error:
java.lang.OutOfMemoryError: Java heap space
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
at java.lang.Class.getMethod0(Class.java:2670)
at java.lang.Class.getMethod(Class.java:1603)
at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:896)
at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:862)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:859)
at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:423)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.springframework.transaction.support.TransactionTemplate.<init>(TransactionTemplate.java:67)
at org.springframework.data.neo4j.support.Neo4jTemplate.exec(Neo4jTemplate.java:403)
at org.springframework.data.neo4j.support.Neo4jTemplate.createRelationshipBetween(Neo4jTemplate.java:367)
I did some tests with fewer amounts of data which result into the following outcomes. 1 node connected to:
50000 leafs: 3035ms
100000 leafs: 4290ms
200000 leafs: 10268ms
400000 leafs: 20913ms
800000 leafs: Java heap space
Here is a screenshot of the system monitor during those operations:
To get a better impression on what exactly is running and is stored in the heap I ran the JProfiler with the last test (800000 leafs). Here are some screenshots:
Heap usage:
CPU usage:
The big question for me is: Is neo4j not designed for using that kind of huge amount of data? Or are there some other ways to achieve those kind of inserts (and later operations)? On the official neo4j website and various screencasts I found the information that neo4j is able to run with billions of nodes and relationships (e.g. http://docs.neo4j.org/chunked/stable/capabilities-capacity.html). I didn't find any functionalities like flush() and clean() methods that are available e.g. in JPA to keep the heap clean manually.
It would be great to be able to use neo4j with those amounts of data. Already with 200000 leafs stored in the graph I noticed a performance improvment of factor 10 and more compared to an embedded classic RDBMS. I don't want to give up the nice way of data modeling and querying those data like neo4j provides.
By just using the Neo4j core API it takes between 18 and 26 seconds to create the children, without any optimizations on my MacBook Air:
Output: import of 1000000 children took 26 seconds.
public class CreateManyRelationships {
public static final int COUNT = 1000 * 1000;
public static final DynamicRelationshipType CHILD = DynamicRelationshipType.withName("CHILD");
public static final File DIRECTORY = new File("target/test.db");
public static void main(String[] args) throws IOException {
FileUtils.deleteRecursively(DIRECTORY);
GraphDatabaseService gdb = new GraphDatabaseFactory().newEmbeddedDatabase(DIRECTORY.getAbsolutePath());
long time=System.currentTimeMillis();
Transaction tx = gdb.beginTx();
Node root = gdb.createNode();
for (int i=1;i<= COUNT;i++) {
Node child = gdb.createNode();
root.createRelationshipTo(child, CHILD);
if (i % 50000 == 0) {
tx.success();tx.finish();
tx = gdb.beginTx();
}
}
tx.success();tx.finish();
time = System.currentTimeMillis()-time;
System.out.println("import of "+COUNT+" children took " + time/1000 + " seconds.");
gdb.shutdown();
}
}
And Spring Data Neo4j docs state, that it is not made for this type of task
If you are connecting 800K child nodes to one node, you are effectively creating a dense node, a.k.a. Key-Value like structure. Neo4j right now is not optimized to handle these structures effectively as all connected relationships are loaded into memory upon traversal of a node. This will be addressed by Neo4j 2.1 with configurable optimizations if you only want to load parts of relationships when touching these structures.
For the time being, I would recommend either putting these structures into indexes instead and do a lookup for the connected nodes, or balancing the dense structure along one value (e.g. build a subtree with say 100 subcategories along one of the properties on the relationships, e.g. time, see http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html for instance.
Would that help?

Categories