Using ELKI with Mongodb - java

Using test cases I was able to see how ELKI can be used directly from Java but now I want to read my data from MongoDB and then use ELKI to cluster geographic (long, lat) data.
I can only cluster data from a CSV file using ELKI. Is it possible to connect de.lmu.ifi.dbs.elki.database.Database with MongoDB? I can see from the java debugger that there is a databaseconnection field in de.lmu.ifi.dbs.elki.database.Database.
I query MongoDB creating POJO for each row and now I want to cluster these objects using ELKI.
It is possible to read data from MongoDB and write it in a CSV file then use ELKI to read that CSV file but I would like to know if there is a simpler solution.
---------FINDINGS_1:
From ELKI - Use List<String> of objects to populate the Database I found that I need to implement de.lmu.ifi.dbs.elki.datasource.DatabaseConnection and specifically override the loadData() method which returns an instance of MultiObjectsBundle.
So I think I should wrap a list of POJO with MultiObjectsBundle. Now i'm looking at the MultiObjectsBundle and it looks like the data should be held in columns. Why columns datatype is List> shouldnt it be List? just a list of items you want to cluster?
I'm a little confused. How is ELKI going to know that it should look at the long and lat for POJO? Where do I tell ELKI to do this? Using de.lmu.ifi.dbs.elki.data.type.SimpleTypeInformation?
---------FINDINGS_2:
I have tried to use ArrayAdapterDatabaseConnection and I have tried implementing DatabaseConnection. Sorry I need thing in very simple terms for me to understand.
This is my code for clustering:
int minPts=3;
double eps=0.08;
double[][] data1 = {{-0.197574246, 51.49960695}, {-0.084605692, 51.52128377}, {-0.120973687, 51.53005939}, {-0.156876, 51.49313},
{-0.144228881, 51.51811784}, {-0.1680743, 51.53430039}, {-0.170134484,51.52834133}, { -0.096440751, 51.5073853},
{-0.092754157, 51.50597426}, {-0.122502346, 51.52395143}, {-0.136039674, 51.51991453}, {-0.123616824, 51.52994371},
{-0.127854211, 51.51772703}, {-0.125979294, 51.52635795}, {-0.109006325, 51.5216612}, {-0.12221963, 51.51477076}, {-0.131161087, 51.52505093} };
// ArrayAdapterDatabaseConnection dbcon = new ArrayAdapterDatabaseConnection(data1);
DatabaseConnection dbcon = new MyDBConnection();
ListParameterization params = new ListParameterization();
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPts);
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, eps);
params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID,
SortTileRecursiveBulkSplit.class);
params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 1000);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();
GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);
Relation<DoubleVector> rel = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<ExternalID> relID = db.getRelation(TypeUtil.EXTERNALID);
DBIDRange ids = (DBIDRange) rel.getDBIDs();
Clustering<Model> result = dbscan.run(db);
int i =0;
for(Cluster<Model> clu : result.getAllClusters()) {
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
DoubleVector v = rel.get(it);
ExternalID exID = relID.get(it);
System.out.print("DoubleVec: ["+v+"]");
System.out.print("ExID: ["+exID+"]");
final int offset = ids.getOffset(it);
System.out.print(" " + offset);
}
System.out.println();
++i;
}
The ArrayAdapterDatabaseConnection produces two clusters, I just had to play around with the value of epsilon, when I set epsilon=0.008 dbscan started creating clusters. When i set epsilon=0.04 all the items were in 1 cluster.
I have also tried to implement DatabaseConnection:
#Override
public MultipleObjectsBundle loadData() {
MultipleObjectsBundle bundle = new MultipleObjectsBundle();
List<Station> stations = getStations();
List<DoubleVector> vecs = new ArrayList<DoubleVector>();
List<ExternalID> ids = new ArrayList<ExternalID>();
for (Station s : stations){
String strID = Integer.toString(s.getId());
ExternalID i = new ExternalID(strID);
ids.add(i);
double[] st = {s.getLongitude(), s.getLatitude()};
DoubleVector dv = new DoubleVector(st);
vecs.add(dv);
}
SimpleTypeInformation<DoubleVector> type = new VectorFieldTypeInformation<>(DoubleVector.FACTORY, 2, 2, DoubleVector.FACTORY.getDefaultSerializer());
bundle.appendColumn(type, vecs);
bundle.appendColumn(TypeUtil.EXTERNALID, ids);
return bundle;
}
These long/lat are associated with an ID and I need to link them back to this ID to the values. Is the only way to go that using the ID offset (in the code above)? I have tried to add ExternalID column but I don't know how to retrieve the ExternalID for a particular NumberVector?
Also after seeing Using ELKI's Distance Function I tried to use Elki's longLatDistance but it doesn't work and I could not find any examples to implement it.

The interface for data sources is called DatabaseConnection.
JavaDoc of DatabaseConnection
You can implement a MongoDB-based interface to get the data.
It is not complicated interface, it has a single method.

Related

cassandra-spring ingest command doesn't work

I've set up a cassandra cluster and work with the spring-cassandra framework 1.53. (http://docs.spring.io/spring-data/cassandra/docs/1.5.3.RELEASE/reference/html/)
I want to write millions of datasets into my cassandra cluster. The solution with executeAsync works good but the "ingest" command from the spring framework sounds interesting aswell.
The ingest method takes advantage of static PreparedStatements that are only prepared once for performance. Each record in your data set is bound to the same PreparedStatement, then executed asynchronously for high performance.
My code:
List<List<?>> session_time_ingest = new ArrayList<List<?>>();
for (Long tokenid: listTokenID) {
List<Session_Time_Table> tempListSessionTimeTable = repo_session_time.listFetchAggregationResultMinMaxTime(tokenid);
session_time_ingest.add(tempListSessionTimeTable);
}
cassandraTemplate.ingest("INSERT into session_time (sessionid, username, eserviceid, contextroot," +
" application_type, min_processingtime, max_processingtime, min_requesttime, max_requesttime)" +
" VALUES(?,?,?,?,?,?,?,?,?)", session_time_ingest);
Throws exception:
`Exception in thread "main" com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [varchar <-> ...tracking.Tables.Session_Time_Table]
at com.datastax.driver.core.CodecRegistry.notFound(CodecRegistry.java:679)
at com.datastax.driver.core.CodecRegistry.createCodec(CodecRegistry.java:540)
at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:520)
at com.datastax.driver.core.CodecRegistry.codecFor(CodecRegistry.java:470)
at com.datastax.driver.core.AbstractGettableByIndexData.codecFor(AbstractGettableByIndexData.java:77)
at com.datastax.driver.core.BoundStatement.bind(BoundStatement.java:201)
at com.datastax.driver.core.DefaultPreparedStatement.bind(DefaultPreparedStatement.java:126)
at org.springframework.cassandra.core.CqlTemplate.ingest(CqlTemplate.java:1057)
at org.springframework.cassandra.core.CqlTemplate.ingest(CqlTemplate.java:1077)
at org.springframework.cassandra.core.CqlTemplate.ingest(CqlTemplate.java:1068)
at ...tracking.SessionAggregationApplication.main(SessionAggregationApplication.java:68)`
I coded exactly like in the spring-cassandra doku.. I've no idea how to map the values of my object to the values cassandra expects?!
Your Session_Time_Table class is probably a mapped POJO, but ingest methods do not use POJO mapping.
Instead you need to provide a matrix where each row contains as many arguments as there are variables to bind in your prepared statement, something along the lines of:
List<List<?>> rows = new ArrayList<List<?>>();
for (Long tokenid: listTokenID) {
Session_Time_Table obj = ... // obtain a Session_Time_Table instance
List<Object> row = new ArrayList<Object>();
row.add(obj.sessionid);
row.add(obj.username);
row.add(obj.eserviceid);
// etc. for all bound variables
rows.add(row);
}
cassandraTemplate.ingest(
"INSERT into session_time (sessionid, username, eserviceid, " +
"contextroot, application_type, min_processingtime, " +
"max_processingtime, min_requesttime, max_requesttime) " +
"VALUES(?,?,?,?,?,?,?,?,?)", rows);

Translating values contained in a javax Response type to a list to be formatted into a json array

So my question might be a bit silly to some of you, but I am querying for some data that must be returned as a Response, I then have to use parts of that data in the front end of my application to graph it using AngularJS and nvD3 charts. To correctly format the data for the graphing tool, I must translate this data into the correct json format. I could find no direct way to pull the numbers i needed out of the returned Response. I need to take just the values I need and translate them into a list to be then parsed into a json array. The following is my work around for this and it works, giving me the list I am looking for...
if (tableState.getIdentifier().getProperty().equals("backupSize")){
Response test4 = timeSeriesQuery.queryData("backup.data.size,", "", "1y-ago", "25", "desc");
String test5 = test4.getEntity().toString();
int test6 = test5.indexOf("value");
int charIndexStart = test6 + 9;
int charIndexEnd = test5.indexOf(",", test6);
String test7 = test5.substring(charIndexStart, charIndexEnd);
int charIndexStart2 = test5.indexOf(",", charIndexEnd);
int charIndexEnd2 = test5.indexOf(",", charIndexStart2 + 2);
String test9 = test5.substring(charIndexStart2 + 1, charIndexEnd2);
long test8 = Long.parseLong(test7);
long test10 = Long.parseLong(test9);
List<Long> graphs = new ArrayList<>();
graphs.add(test8);
graphs.add(test10);
List<List<Long>> graphs2 = new ArrayList<List<Long>>();
graphs2.add(graphs);
for(int i=1, charEnd = charIndexEnd2; i<24; i++){
int nextCharStart = test5.indexOf("}", charEnd) + 2;
int nextCharEnd = test5.indexOf(",", nextCharStart);
String test11 = test5.substring(nextCharStart + 1, nextCharEnd);
int nextCharStart2 = test5.indexOf(",", nextCharEnd) + 1;
int nextCharEnd2 = test5.indexOf(",", nextCharStart2 + 2);
String test13 = test5.substring(nextCharStart2, nextCharEnd2);
long test12 = Long.parseLong(test11);
long test14 = Long.parseLong(test13);
List<Long> graphs3 = new ArrayList<>();
graphs3.add(test12);
graphs3.add(test14);
graphs2.add(graphs3);
charEnd = test5.indexOf("}", nextCharEnd2);
} return graphs2;
here is the result of test5:
xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.DatapointsResponse#2be02a0c[start=, end=, tags={xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Tag#1600cd19[name=backup.data.size, results={xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Results#2b8a61bd[groups={xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Group#61540dbc[name=type, type=number]}, attributes=xxx.xx.xxxxxx.entity.util.map.Map#4b4eebd0[], values={{1487620485896,973956,3},{1487620454999,973806,3},{1487620424690,956617,3},{1487620397181,938677,3},{1487620368825,934494,3},{1487620339219,926125,3},{1487620309050,917753,3},{1487620279239,909384,3},{1487620251381,872864,3},{1487620222724,846518,3},{1487620196441,832150,3},{1487620168141,819563,3},{1487620142079,787264,3},{1487620115827,787264,3},{1487620091991,787264,3},{1487620067230,787264,3},{1487620042333,787264,3},{1487620018508,787264,3},{1487619994967,787264,3},{1487619973549,778740,3},{1487619950069,770205,3},{1487619926850,749106,3},{1487619902486,740729,3},{1487619877298,728184,3},{1487619851449,719666,3}}]}, stats=xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Stats#5bb68fa5[rawCount=25]]}]
and the returned list:
[[1487620485896, 973956], [1487620454999, 973806], [1487620424690, 956617], [1487620397181, 938677], [1487620368825, 934494], [1487620339219, 926125], [1487620309050, 917753], [1487620279239, 909384], [1487620251381, 872864], [1487620222724, 846518], [1487620196441, 832150], [1487620168141, 819563], [1487620142079, 787264], [1487620115827, 787264], [1487620091991, 787264], [1487620067230, 787264], [1487620042333, 787264], [1487620018508, 787264], [1487619994967, 787264], [1487619973549, 778740], [1487619950069, 770205], [1487619926850, 749106], [1487619902486, 740729], [1487619877298, 728184]]
I can then take this and shove it into a json (at least i think so! haven't gotten that far). But this code seems ridiculous, brittle, and not the right way to go about this.
Does anyone have a better way of pulling datapoints out of a response and translating them into a json array or at least a nested list?
Thank you to anyone who read and please let me know if I can provide any more information.
When we want just a few values from a query the best way to retrieve them is doing the query with resultSet and using it's powerful metadata:
ResultSet rs = stmt.executeQuery("SELECT a, b, c FROM TABLE2");
ResultSetMetaData rsmd = rs.getMetaData();
String name = rsmd.getColumnName(1);
Taken from here
So you take the columns you need by using the metadata properties and then the best you can do is use a DTO object to store each row check this to learn a bit more about DTOs
So, basically the idea is that you build an object from the data you've retrieved or just the one you need at that moment from the database and you can use the common getters and setters to access all the fields
However, when collecting data you're normally going to be using loops as you need to iterate over the resultSet values asking for the name of the column and keeping it's value if it coincides.
Hope it helps

Evaluation of precomputed clustering using ELKI in Java

I already have computed clusters and want to use ELKI library only to perform evaluation on this clustering.
So I have data in this form:
0.234 0.923 cluster_1 true_cluster1
0.543 0.874 cluster_2 true_cluster3
...
I tried to:
Create 2 databases: with result labels and with reference labels:
double [][] data;
String [] reference_labels, result_labels;
DatabaseConnection dbc1 = new ArrayAdapterDatabaseConnection(data, result_labels);
Database db1 = new StaticArrayDatabase(dbc1, null);
DatabaseConnection dbc2 = new ArrayAdapterDatabaseConnection(data, reference_labels);
Database db2 = new StaticArrayDatabase(dbc2, null);
Perform ByLabel Clustering for each database:
Clustering<Model> clustering1 = new ByLabelClustering().run(db1);
Clustering<Model> clustering2 = new ByLabelClustering().run(db2);
Use ClusterContingencyTable for comparing clusterings and getting measures:
ClusterContingencyTable ct = new ClusterContingencyTable(true, false);
ct.process(clustering1, clustering2);
PairCounting paircount = ct.getPaircount();
The problem is that measuers are not computed.
I looked into source code of ContingencyTable and PairCounting and it seems that it won't work if clusterings come from different databases and a database can have only 1 labels relation.
Is there a way to do this in ELKI?
You can modify the ByLabelClustering class easily (or implement your own) to only use the first label, or only use the second label; then you can use only one database.
Or you use the 3-parameter constructor:
DatabaseConnection dbc1 = new ArrayAdapterDatabaseConnection(data, result_labels, 0);
Database db1 = new StaticArrayDatabase(dbc1, null);
DatabaseConnection dbc2 = new ArrayAdapterDatabaseConnection(data, reference_labels, 0);
Database db2 = new StaticArrayDatabase(dbc2, null);
so that the DBIDs are the same. Then ClusterContingencyTable should work.
By default, ELKI would continue enumerating objects, so the first database would have IDs 1..n, and the second n+1..2n. But in order to compare clusterings, they need to contain the same objects, not disjoint sets.

Powerbuild Dateobject to java

i am doing a task converting VB script written from Powerbuild to java,
i am struggled at converting the DataStore Object into java ,
i have something like this :
lds_appeal_application = Create DataStore
lds_appeal_application.DataObject = "ds_appeal_application_report"
lds_appeal_application.SetTransObject(SQLCA)
ll_row = lds_appeal_application.retrieve(as_ksdyh, adt_start_date, adt_end_date, as_exam_name, as_subject_code)
for ll_rc = 1 to ll_row
ldt_update_date = lds_appeal_application.GetItemDatetime(ll_rc, "sqsj")
ls_caseno = trim(lds_appeal_application.GetItemString(ll_rc, "caseno"))
ls_candidate_no = trim(lds_appeal_application.GetItemString(ll_rc, "zkzh"))
ls_subjectcode = trim(lds_appeal_application.GetItemString(ll_rc, "kmcode"))
ls_papercode = trim(lds_appeal_application.GetItemString(ll_rc, "papercode"))
ls_name = trim(lds_appeal_application.GetItemString(ll_rc, "mc"))
ll_ksh = lds_appeal_application.GetItemDecimal(ll_rc, "ks_h")
ll_kmh = lds_appeal_application.GetItemDecimal(ll_rc, "km_h")
simply speaking, a datasoure is created and a data table is point to it by sql query(ds_appeal_application_report). Finally using a for loop to retrieve information from the table.
in java way of doing, i use an entities manager to createnativequery and the query can result a list of object array. However, i just dont know how to retrieve the information like VB using the DataStore Object.
please give me some advice . Thanks

Why I insert double/float column into Cassandra by hector and got incorrect value int database

I have Some question about Hector Insert double/float data into Cassandra
new Double("13.45")------->13.468259733915328
new Float("64.13") ------->119.87449
When I insert data into Cassandra by hector
TestDouble ch = new TestDouble("talend_bj",
"localhost:9160");
String family = "talend_1";
ch.ensureColumnFamily(family);
List values = new ArrayList();
values.add(HFactory.createColumn("id", 2, StringSerializer.get(),
IntegerSerializer.get()));
values.add(HFactory.createColumn("name", "zhang",
StringSerializer.get(), StringSerializer.get()));
values.add(HFactory.createColumn("salary", 13.45,
StringSerializer.get(), DoubleSerializer.get()));
ch.insertSuper("14", values, "user1", family, StringSerializer.get(),
StringSerializer.get());
StringSerializer se = StringSerializer.get();
MultigetSuperSliceQuery<String, String, String, String> q = me.prettyprint.hector.api.factory.HFactory
.createMultigetSuperSliceQuery(ch.getKeyspace(), se, se, se, se);
// q.setSuperColumn("user1").setColumnNames("id","name")
q.setKeys("12", "11","13", "14");
q.setColumnFamily(family);
q.setRange("z", "z", false, 100);
QueryResult<SuperRows<String, String, String, String>> r = q
.setColumnNames("user1", "user").execute();
Iterator iter = r.get().iterator();
while (iter.hasNext()) {
SuperRow superRow = (SuperRow) iter.next();
SuperSlice s = superRow.getSuperSlice();
List<HSuperColumn> superColumns = s.getSuperColumns();
for (HSuperColumn superColumn : superColumns) {
List<HColumn> columns = superColumn.getColumns();
System.out.println(DoubleSerializer.get().fromBytes(((String) superColumn.getSubColumnByName("salary").getValue()).getBytes()));
}
}
You can see 13.45 but I get the column value is 13.468259733915328
You should break the problem in two. After writing, IF you defined part of your schema OR use te ASSUME keyword on the commandline cli, view the data in cassandra to see if it is correct. PlayOrm has this EXACT unit test(which is on PlayOrm on top of astyanax not hector) and it works just fine....Notice the comparison in the test of -200.23...
https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/TestColumnSlice.java
Once down, does your data in cassandra look correct? If so, the issue is on your reading the value in code, otherwise, it is the writes.

Categories