Evaluation of precomputed clustering using ELKI in Java

Evaluation of precomputed clustering using ELKI in Java - java

I already have computed clusters and want to use ELKI library only to perform evaluation on this clustering.
So I have data in this form:
0.234 0.923 cluster_1 true_cluster1
0.543 0.874 cluster_2 true_cluster3
...
I tried to:
Create 2 databases: with result labels and with reference labels:
double [][] data;
String [] reference_labels, result_labels;
DatabaseConnection dbc1 = new ArrayAdapterDatabaseConnection(data, result_labels);
Database db1 = new StaticArrayDatabase(dbc1, null);
DatabaseConnection dbc2 = new ArrayAdapterDatabaseConnection(data, reference_labels);
Database db2 = new StaticArrayDatabase(dbc2, null);
Perform ByLabel Clustering for each database:
Clustering<Model> clustering1 = new ByLabelClustering().run(db1);
Clustering<Model> clustering2 = new ByLabelClustering().run(db2);
Use ClusterContingencyTable for comparing clusterings and getting measures:
ClusterContingencyTable ct = new ClusterContingencyTable(true, false);
ct.process(clustering1, clustering2);
PairCounting paircount = ct.getPaircount();
The problem is that measuers are not computed.
I looked into source code of ContingencyTable and PairCounting and it seems that it won't work if clusterings come from different databases and a database can have only 1 labels relation.
Is there a way to do this in ELKI?

You can modify the ByLabelClustering class easily (or implement your own) to only use the first label, or only use the second label; then you can use only one database.
Or you use the 3-parameter constructor:
DatabaseConnection dbc1 = new ArrayAdapterDatabaseConnection(data, result_labels, 0);
Database db1 = new StaticArrayDatabase(dbc1, null);
DatabaseConnection dbc2 = new ArrayAdapterDatabaseConnection(data, reference_labels, 0);
Database db2 = new StaticArrayDatabase(dbc2, null);
so that the DBIDs are the same. Then ClusterContingencyTable should work.
By default, ELKI would continue enumerating objects, so the first database would have IDs 1..n, and the second n+1..2n. But in order to compare clusterings, they need to contain the same objects, not disjoint sets.

Related

How can I create a table using Mybatis and SQLite?

I am trying to create a new database and new table using Mybatis and SQLite. I found from previous answers (1, 2, 3) that Mybatis does support using CREATE and ALTER statements, by marking them as "UPDATE" within Mybatis mapper syntax. However, those questions/answers were using Mapper XML whereas I'm using annotations, and also none were using SQLite.
SQLite creates a new database as soon as you open a new connection to it, so it doesn't matter if the DB exists before or not. A new database is created with a size of zero bytes, which is fine (SQLite treats a 0 byte file as an empty database). But after the table creation I would expect the database size to be non-zero as it stores the table structure for that table. After running my code which I think should create the table (I'm checking my syntax against this answer), the database size still reads as 0 bytes, which says to me that the table has not actually been created. What am I doing wrong?
My Java code to test this scenario:
public class Example {
public static void main(String[] args) {
String userHomePath = System.getProperty("user.home");
File exampleDb = new File(userHomePath, "example.sqlite3");
String jdbcConnectionString = "jdbc:sqlite:" + exampleDb.getAbsolutePath();
DataSource dataSource = new PooledDataSource("org.sqlite.JDBC", jdbcConnectionString, null, null);
Environment environment = new Environment("Main", new JdbcTransactionFactory(), dataSource);
Configuration configuration = new Configuration(environment);
configuration.addMapper(GenericMapper.class);
SqlSessionFactoryBuilder builder = new SqlSessionFactoryBuilder();
SqlSessionFactory sessionFactory = builder.build(configuration);
try (SqlSession session = sessionFactory.openSession()) {
GenericMapper genericMapper = session.getMapper(GenericMapper.class);
genericMapper.createExampleTableIfMissing();
}
}
}
My mapper:
public interface GenericMapper {
#Update("CREATE TABLE IF NOT EXISTS extbl (id INTEGER PRIMARY KEY AUTOINCREMENT)")
void createExampleTableIfMissing();
}
Checking the file after this code has run:
C:\Users\me>dir example.sqlite3
Volume in drive C is Windows
Volume Serial Number is D4DE-B46A
Directory of C:\Users\me
12/04/2021 18:14 0 example.sqlite3
1 File(s) 0 bytes
0 Dir(s) 27,326,779,392 bytes free
C:\Users\me>

How to automate function according to Array list size

I'm sorry this question header is not 100% correct. Because of that, I'll explain my scenario here.
I created a function to merge 4 data sets into one return format. Because that's the format front-end side needed. So this is working fine now.
public ReturnFormat makeThribleLineChart(List<NameCountModel> totalCount, List<NameCountModel>,p1Count, List<NameCountModel> p2Count, List<NameCountModel> average) {
ReturnFormat returnFormat = new ReturnFormat(null,null);
try {
String[] totalData = new String[totalCount.size()];
String[] p1Data = new String[p1Count.size()];
String[] p2Data = new String[p2Count.size()];
String[] averageData = new String[p2Count.size()];
String[] lableList = new String[totalCount.size()];
for (int x = 0; x < totalCount.size(); x++) {
totalData[x] = totalCount.get(x).getCount();
p1Data[x] = p1Count.get(x).getCount();
p2Data[x] = p2Count.get(x).getCount();
averageData[x] = average.get(x).getCount();
lableList[x] = totalCount.get(x).getName();
}
FormatHelper<String[]> totalFormatHelper= new FormatHelper<String[]>();
totalFormatHelper.setData(totalData);
totalFormatHelper.setType("line");
totalFormatHelper.setLabel("Uudet");
totalFormatHelper.setyAxisID("y-axis-1");
FormatHelper<String[]> p1FormatHelper= new FormatHelper<String[]>();
p1FormatHelper.setData(p1Data);
p1FormatHelper.setType("line");
p1FormatHelper.setLabel("P1 päivystykseen heti");
FormatHelper<String[]> p2FormatHelper= new FormatHelper<String[]>();
p2FormatHelper.setData(p2Data);
p2FormatHelper.setType("line");
p2FormatHelper.setLabel("P2 päivystykseen muttei yöllä");
FormatHelper<String[]> averageFormatHelper= new FormatHelper<String[]>();
averageFormatHelper.setData(averageData);
averageFormatHelper.setType("line");
averageFormatHelper.setLabel("Jonotusaika keskiarvo");
averageFormatHelper.setyAxisID("y-axis-2");
List<FormatHelper<String[]>> formatHelpObj = new ArrayList<FormatHelper<String[]>>();
formatHelpObj.add(totalFormatHelper);
formatHelpObj.add(p1FormatHelper);
formatHelpObj.add(p2FormatHelper);
formatHelpObj.add(averageFormatHelper);
returnFormat.setData(formatHelpObj);
returnFormat.setLabels(lableList);
returnFormat.setMessage(Messages.Success);
returnFormat.setStatus(ReturnFormat.Status.SUCCESS);
} catch (Exception e) {
returnFormat.setData(null);
returnFormat.setMessage(Messages.InternalServerError);
returnFormat.setStatus(ReturnFormat.Status.ERROR);
}
return returnFormat;
}
so, as you can see here, all the formatting is hardcoded. So my question is how to automate this code for list count. Let's assume next time I have to create chart formatting for five datasets. So I have to create another function to it. That's the thing I want to reduce. So I hope you can understand my question.
Thank you.

You're trying to solve the more general problem of composing a result object (in this case ReturnFormat) based on dynamic information. In addition, there's some metadata being setup along with each dataset - the type, label, etc. In the example that you've posted, you've hardcoded the relationship between a dataset and this metadata, but you'd need some way to establish this relationship for data dynamically if you have a variable number of parameters here.
Therefore, you have a couple of options:
Make makeThribleLineChart a varargs method to accept a variable number of parameters representing your data. Now you have the problem of associating metadata with your parameters - best option is probably to wrap the data and metadata together in some new object that is provided as each param of makeThribleLineChart.
So you'll end up with a signature that looks a bit like ReturnFormat makeThribleLineChart(DataMetadataWrapper... allDatasets), where DataMetadataWrapper contains everything required to build one FormatHelper instance.
Use a builder pattern, similar to the collection builders in guava, for example something like so:
class ThribbleLineChartBuilder {
List<FormatHelper<String[]>> formatHelpObj = new ArrayList<>();
ThribbleLineChartBuilder addDataSet(String describeType, String label, String yAxisId, List<NameCountModel> data) {
String[] dataArray = ... ; // build your array of data
FormatHelper<String[]> formatHelper = new FormatHelper<String[]>();
formatHelper.setData(dataArray);
formatHelper.setType(describeType);
... // set any other parameters that the FormatHelper requires here
formatHelpObj.add(formatHelper);
return this;
}
ReturnFormat build() {
ReturnFormat returnFormat = new ReturnFormat(null, null);
returnFormat.setData(this.formatHelpObj);
... // setup any other fields you need in ReturnFormat
return returnFormat;
}
}
// usage:
new ThribbleLineChartBuilder()
.addDataSet("line", "Uudet", "y-axis-1", totalCount)
.addDataSet("line", "P1 päivystykseen heti", null, p1Count)
... // setup your other data sources
.build()

Query on my data datastore that mixes a filter StContainsFilter and FilterPredicate

Sorry for my English.
I'm working on an android application that stores data on the Datastore Google cloud. I want to carry out a query on my datastore that mixes StContainsFilter and FilterPredicate. It does not work! Here is my code:
DatastoreService service = DatastoreServiceFactory.getDatastoreService();
Query q = new Query("utilisateurs");
Query.Filter filtrage1 = new Query.FilterPredicate("sexe", Query.FilterOperator.EQUAL, "M");
Query.Filter filtrage2 = new Query.FilterPredicate("datenaissance", Query.FilterOperator.LESS_THAN_OR_EQUAL, datemin);
Query.Filter filtrage3 = new Query.FilterPredicate("datenaissance", Query.FilterOperator.GREATER_THAN_OR_EQUAL, datemax);
GeoPt center = new GeoPt(Float.parseFloat(lat), Float.parseFloat(lng));
double radius = km*1000;
Query.Filter filtrage4 = new Query.StContainsFilter("location", new GeoRegion.Circle(center, radius));
Query.Filter present = Query.CompositeFilterOperator.and(filtrage2,filtrage3,filtrage1,filtrage4);
q.setFilter(present);
PreparedQuery pq = service.prepare(q);
List<Entity> results = pq.asList(FetchOptions.Builder.withDefaults());

To mix different filters you can use a CompositeFilter. You can read more about Datastore Queries here. With the CompositeFilter, you can connect multiple Filters, which then act as one. However, you still have to consider not to set inequality Filters on more than one property.
To create a CompositeFilter use this syntax:
CompositeFilter nameOfFilter = CompositeFilterOperator.and(Collection<Filter>);
Collection can also be a List, an Array or you can seperate Filters by comma
Here an example on how to create a CompositeFilter:
Filter filter1 = new FilterPredicate("someProperty", FilterOperator.Equal, someValue)
Filter filtrage4 = new StContainsFilter("location", new GeoRegion.Circle(center, radius));
Filter filtrage2 = new FilterPredicate("datenaissance", Query.FilterOperator.LESS_THAN_OR_EQUAL, datemin);
CompositeFilter filter = CompositeFilterOperator.or(filter1, filtrage4, filtrage2);
Use CompositeFilterOperator.and if you need all Filters to apply and .or if one applying Filter is enough.
Technically your solution should work because StContainsFilter is a direct subclass from Query.Filter. The reason for your problem is a wrong import. You should check your imports and change them if they say anything with "repackaged" (I hade the same problem too)

Using ELKI with Mongodb

Using test cases I was able to see how ELKI can be used directly from Java but now I want to read my data from MongoDB and then use ELKI to cluster geographic (long, lat) data.
I can only cluster data from a CSV file using ELKI. Is it possible to connect de.lmu.ifi.dbs.elki.database.Database with MongoDB? I can see from the java debugger that there is a databaseconnection field in de.lmu.ifi.dbs.elki.database.Database.
I query MongoDB creating POJO for each row and now I want to cluster these objects using ELKI.
It is possible to read data from MongoDB and write it in a CSV file then use ELKI to read that CSV file but I would like to know if there is a simpler solution.
---------FINDINGS_1:
From ELKI - Use List<String> of objects to populate the Database I found that I need to implement de.lmu.ifi.dbs.elki.datasource.DatabaseConnection and specifically override the loadData() method which returns an instance of MultiObjectsBundle.
So I think I should wrap a list of POJO with MultiObjectsBundle. Now i'm looking at the MultiObjectsBundle and it looks like the data should be held in columns. Why columns datatype is List> shouldnt it be List? just a list of items you want to cluster?
I'm a little confused. How is ELKI going to know that it should look at the long and lat for POJO? Where do I tell ELKI to do this? Using de.lmu.ifi.dbs.elki.data.type.SimpleTypeInformation?
---------FINDINGS_2:
I have tried to use ArrayAdapterDatabaseConnection and I have tried implementing DatabaseConnection. Sorry I need thing in very simple terms for me to understand.
This is my code for clustering:
int minPts=3;
double eps=0.08;
double[][] data1 = {{-0.197574246, 51.49960695}, {-0.084605692, 51.52128377}, {-0.120973687, 51.53005939}, {-0.156876, 51.49313},
{-0.144228881, 51.51811784}, {-0.1680743, 51.53430039}, {-0.170134484,51.52834133}, { -0.096440751, 51.5073853},
{-0.092754157, 51.50597426}, {-0.122502346, 51.52395143}, {-0.136039674, 51.51991453}, {-0.123616824, 51.52994371},
{-0.127854211, 51.51772703}, {-0.125979294, 51.52635795}, {-0.109006325, 51.5216612}, {-0.12221963, 51.51477076}, {-0.131161087, 51.52505093} };
// ArrayAdapterDatabaseConnection dbcon = new ArrayAdapterDatabaseConnection(data1);
DatabaseConnection dbcon = new MyDBConnection();
ListParameterization params = new ListParameterization();
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPts);
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, eps);
params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID,
SortTileRecursiveBulkSplit.class);
params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 1000);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();
GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);
Relation<DoubleVector> rel = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<ExternalID> relID = db.getRelation(TypeUtil.EXTERNALID);
DBIDRange ids = (DBIDRange) rel.getDBIDs();
Clustering<Model> result = dbscan.run(db);
int i =0;
for(Cluster<Model> clu : result.getAllClusters()) {
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
DoubleVector v = rel.get(it);
ExternalID exID = relID.get(it);
System.out.print("DoubleVec: ["+v+"]");
System.out.print("ExID: ["+exID+"]");
final int offset = ids.getOffset(it);
System.out.print(" " + offset);
}
System.out.println();
++i;
}
The ArrayAdapterDatabaseConnection produces two clusters, I just had to play around with the value of epsilon, when I set epsilon=0.008 dbscan started creating clusters. When i set epsilon=0.04 all the items were in 1 cluster.
I have also tried to implement DatabaseConnection:
#Override
public MultipleObjectsBundle loadData() {
MultipleObjectsBundle bundle = new MultipleObjectsBundle();
List<Station> stations = getStations();
List<DoubleVector> vecs = new ArrayList<DoubleVector>();
List<ExternalID> ids = new ArrayList<ExternalID>();
for (Station s : stations){
String strID = Integer.toString(s.getId());
ExternalID i = new ExternalID(strID);
ids.add(i);
double[] st = {s.getLongitude(), s.getLatitude()};
DoubleVector dv = new DoubleVector(st);
vecs.add(dv);
}
SimpleTypeInformation<DoubleVector> type = new VectorFieldTypeInformation<>(DoubleVector.FACTORY, 2, 2, DoubleVector.FACTORY.getDefaultSerializer());
bundle.appendColumn(type, vecs);
bundle.appendColumn(TypeUtil.EXTERNALID, ids);
return bundle;
}
These long/lat are associated with an ID and I need to link them back to this ID to the values. Is the only way to go that using the ID offset (in the code above)? I have tried to add ExternalID column but I don't know how to retrieve the ExternalID for a particular NumberVector?
Also after seeing Using ELKI's Distance Function I tried to use Elki's longLatDistance but it doesn't work and I could not find any examples to implement it.

The interface for data sources is called DatabaseConnection.
JavaDoc of DatabaseConnection
You can implement a MongoDB-based interface to get the data.
It is not complicated interface, it has a single method.

Updating mongodb with java driver takes forever?

So this is the case: I have a program that takes two large csv-files, finds the diffs and then sends a array list to a method that is supposed to update the mongodb with the lines from the array. The problem is the updates are taking forever. A test case with 5000 updates takes 36 minutes. Is this normal?
the update(List<String> changes)-method something like this:
mongoClient = new MongoClient(ip);
db = mongoClient.getDB("foo");
collection = db.getCollection("bar");
//for each line of change
for (String s : changes) {
//splits the csv-lines on ;
String[] fields = s.split(";");
//identifies wich document in the database to be updated
long id = Long.parseLong(fields[0]);
BasicDBObject sq = new BasicDBObject().append("organizationNumber",id);
//creates a new unit-object, that is converted to JSON and then inserted into the database.
Unit u = new Unit(fields);
Gson gson = new Gson();
String jsonObj = gson.toJson(u);
DBObject objectToUpdate = collection.findOne(sq);
DBObject newObject = (DBObject) JSON.parse(jsonObj);
if(objectToUpdate != null){
objectToUpdate.putAll(newObject);
collection.save(objectToUpdate);
}

That's because you are taking extra steps to update.
You don't need to parse JSONs manually and you don't have to do the query-then-update when you can just do an update with a "where" clause in a single step.
Something like this:
BasicDBObject query= new BasicDBObject().append("organizationNumber",id);
Unit unit = new Unit(fields);
BasicDBObject unitDB= new BasicDBObject().append("someField",unit.getSomeField()).append("otherField",unit.getOtherField());
collection.update(query,unitDB);
Where query specifies the "where" clause and unitDB specifies the fields that need to be updated.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Evaluation of precomputed clustering using ELKI in Java - java

Related

How can I create a table using Mybatis and SQLite?

How to automate function according to Array list size

Query on my data datastore that mixes a filter StContainsFilter and FilterPredicate

Using ELKI with Mongodb

Updating mongodb with java driver takes forever?

Categories

Resources