ElasticSearch DateHistogram Aggregation Fill Missing Data - java

I'm trying to use ElasticSearch spring data for some aggregations
Here Is my query
final FilteredQueryBuilder filteredQuery = QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(),
FilterBuilders.andFilter(FilterBuilders.termFilter("gender", "F"),
FilterBuilders.termFilter("place", "Arizona"),
FilterBuilders.rangeFilter("dob").from(from).to(to)));
final MetricsAggregationBuilder<?> aggregateArtifactcount = AggregationBuilders.sum("delivery")
.field("birth");
final AggregationBuilder<?> dailyDateHistogarm =
AggregationBuilders.dateHistogram(AggregationConstants.DAILY).field("dob")
.interval(DateHistogram.Interval.DAY).subAggregation(aggregateArtifactcount);
final SearchQuery query = new NativeSearchQueryBuilder().withIndices(index).withTypes(type)
.withQuery(filteredQuery).addAggregation(dailyDateHistogarm).build();
return elasticsearchTemplate.query(query, new DailyDeliveryAggregation());
Also this is my Aggregation
public class DailyDeliveryAggregation implements ResultsExtractor<List<DailyDeliverySum>> {
#SuppressWarnings("unchecked")
#Override
public List<DailyDeliverySum> extract(final SearchResponse response) {
final List<DailyDeliverySum> dailyDeliverySum = new ArrayList<DailyDeliverySum>();
final Aggregations aggregations = response.getAggregations();
final DateHistogram daily = aggregations.get(AggregationConstants.DAILY);
final List<DateHistogram.Bucket> buckets = (List<DateHistogram.Bucket>) daily.getBuckets();
for (final DateHistogram.Bucket bucket : buckets) {
final Sum sum = (Sum) bucket.getAggregations().getAsMap().get("delivery");
final int deliverySum = (int) sum.getValue();
final int delivery = (int) bucket.getDocCount();
final String dateString = bucket.getKeyAsText().string();
dailyDeliverySum.add(new DailyDeliverySum(deliverySum, delivery, dateString));
}
return dailyDeliverySum;
}
}
It gives me the correct data , But It doesn't satisfy all my needs
Suppose if I query for time range of 10 days , If there is no data for a date in the given time range It miss that date in Date histogram buckets ,But I want to set 0 as default value for aggregation and doc count if there is no data available
Is there any way to do it ??

Yes, you can use the "minimum document count" feature of the date_histogram aggregation and set it to 0. That way, you'll also get buckets that don't contain any data:
final AggregationBuilder<?> dailyDateHistogarm =
AggregationBuilders.dateHistogram(AggregationConstants.DAILY)
.field("dob")
.minDocCount(0) <--- add this line
.interval(DateHistogram.Interval.DAY)
.subAggregation(aggregateArtifactcount);

Example from #Val by itself did not work for me (I'm using the high-level API with ElasticSearch 6.2.x). What did work though, was telling that the aggregation should handle missing values as 0:
final AggregationBuilder<?> dailyDateHistogarm =
AggregationBuilders.dateHistogram(AggregationConstants.DAILY)
.field("dob")
.minDocCount(0)
.missing(0)
.interval(DateHistogram.Interval.DAY)
.subAggregation(aggregateArtifactcount);

Related

Unable to perform Ignite SQL query over [CustomKey, CustomValue] cache in Scala.

I am trying to setup a distributed cache using Apache Ignite with Scala.
After setting up the cache, I am able to put and get items knowing the key, but SQL queries of any type returns always a cursor with null iterator.
Here is how I setup my cache (please note that this is done before the ignition.start):
def setupTelemetryCache(): CacheConfiguration[TelemetryKey, TelemetryValue] = {
val dataRegionName = "persistent-region"
val cacheName = "telemetry-cache"
// This object is required to perform SQL queries over custom key object
val queryEntity = new QueryEntity("TelemetryKey", "TelemetryValue")
val fields: util.LinkedHashMap[String, String] = new util.LinkedHashMap[String, String]
fields.put("deviceId", classOf[String].getName)
fields.put("metricName", classOf[String].getName)
fields.put("timestamp", classOf[String].getName)
queryEntity.setFields(fields)
val keyFields: util.HashSet[String] = new util.HashSet[String]()
keyFields.add("deviceId")
keyFields.add("metricName")
keyFields.add("timestamp")
queryEntity.setKeyFields(keyFields)
queryEntity.setIndexes(Collections.emptyList[QueryIndex]())
new CacheConfiguration()
.setName(cacheName)
.setDataRegionName(dataRegionName)
.setCacheMode(CacheMode.PARTITIONED) // Data is split among nodes
.setBackups(1) // each partition has 1 backup
.setIndexedTypes(classOf[String], classOf[TelemetryKey]) // Index by ID
.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_ASYNC) // Faster, clients do not wait for cache
// synchronization. Consistency issues?
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL) // Allows transactional query
.setQueryEntities(Collections.singletonList(queryEntity))
}
And those are the code of my TelemetryKey:
case class TelemetryKey private (
#(AffinityKeyMapped #field)
#(QuerySqlField#field)(index = true)
deviceId: String,
#(QuerySqlField#field)(index = false)
metricName: String,
#(QuerySqlField#field)(index = true)
timestamp: String) extends Serializable
And TelemetryValue:
class TelemetryValue private(valueType: ValueTypes.Value, doubleValue: Option[Double],
stringValue: Option[String],
longValue: Option[Long]) extends Serializable
A sample SQL query I have to achieve could be "Select * from CACHE where deviceId = 'dev1234'" and I expect to receive all the Cache.Entry[TelemetryKey, TelemetryValue] of the same deviceId
Here is how I perform the query:
private def sqlQuery(query: SqlQuery[TelemetryKey, TelemetryValue]):
QueryCursor[Cache.Entry[TelemetryKey, TelemetryValue]] = {
cache.query(query)
}
def getEntries(ofDeviceId: String):
QueryCursor[Cache.Entry[TelemetryKey, TelemetryValue]] = {
val q = new SqlQuery[TelemetryKey, TelemetryValue](classOf[TelemetryKey], "deviceId = ?")
sqlQuery(q.setArgs(ofDeviceId))
}
Even changing the body of the query i receive a cursor object which is empty. I cannot even perform a "Select *" query.
Thanks for the help
There are two ways to configure indexes and queryable fields.
Annotation based configuration
Your key and value classes need to be annotated #QuerySqlField as follows.
case class TelemetryKey private (
#(AffinityKeyMapped #field)
#(QuerySqlField#field)(index = true)
deviceId: String,
#(QuerySqlField#field)(index = false)
metricName: String,
#(QuerySqlField#field)(index = true)
timestamp: String) extends Serializable
After indexed and queryable fields are defined, they have to be registered in the SQL engine along with the object types they belong to.
new CacheConfiguration()
.setName(cacheName)
.setDataRegionName(dataRegionName)
.setCacheMode(CacheMode.PARTITIONED)
.setBackups(1)
.setIndexedTypes(classOf[TelemetryKey], classOf[TelemetryValue])
.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_ASYNC)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
UPD:
One more thing that should be fixed as well is your SqlQuery
def getEntries(ofDeviceId: String):
QueryCursor[Cache.Entry[TelemetryKey, TelemetryValue]] = {
val q = new SqlQuery[TelemetryKey, TelemetryValue](classOf[TelemetryValue], "deviceId = ?")
sqlQuery(q.setArgs(ofDeviceId))
}
QueryEntity based approach
val queryEntity = new QueryEntity(classOf[TelemetryKey], classOf[TelemetryValue]);
new CacheConfiguration()
.setName(cacheName)
.setDataRegionName(dataRegionName)
.setCacheMode(CacheMode.PARTITIONED)
.setBackups(1)
.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_ASYNC)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
.setQueryEntities(Collections.singletonList(queryEntity))
Long story short, you should supply full JVM class names to QueryEntity.
As in:
val queryEntity = new QueryEntity("com.pany.telemetry.TelemetryKey",
"com.pany.telemetry.TelemetryValue") // or e.g. TelemetryKey.class.getName()
Ignite needs these to distinguish various types that can be stored in one cache, it's not decorative - there's got to be an exact match.
Better yet? Use setIndexedTypes() instead of setQueryEntities(). It allows you to pass classes instead of Strings and it will scan annotations, which you already have.

Translating values contained in a javax Response type to a list to be formatted into a json array

So my question might be a bit silly to some of you, but I am querying for some data that must be returned as a Response, I then have to use parts of that data in the front end of my application to graph it using AngularJS and nvD3 charts. To correctly format the data for the graphing tool, I must translate this data into the correct json format. I could find no direct way to pull the numbers i needed out of the returned Response. I need to take just the values I need and translate them into a list to be then parsed into a json array. The following is my work around for this and it works, giving me the list I am looking for...
if (tableState.getIdentifier().getProperty().equals("backupSize")){
Response test4 = timeSeriesQuery.queryData("backup.data.size,", "", "1y-ago", "25", "desc");
String test5 = test4.getEntity().toString();
int test6 = test5.indexOf("value");
int charIndexStart = test6 + 9;
int charIndexEnd = test5.indexOf(",", test6);
String test7 = test5.substring(charIndexStart, charIndexEnd);
int charIndexStart2 = test5.indexOf(",", charIndexEnd);
int charIndexEnd2 = test5.indexOf(",", charIndexStart2 + 2);
String test9 = test5.substring(charIndexStart2 + 1, charIndexEnd2);
long test8 = Long.parseLong(test7);
long test10 = Long.parseLong(test9);
List<Long> graphs = new ArrayList<>();
graphs.add(test8);
graphs.add(test10);
List<List<Long>> graphs2 = new ArrayList<List<Long>>();
graphs2.add(graphs);
for(int i=1, charEnd = charIndexEnd2; i<24; i++){
int nextCharStart = test5.indexOf("}", charEnd) + 2;
int nextCharEnd = test5.indexOf(",", nextCharStart);
String test11 = test5.substring(nextCharStart + 1, nextCharEnd);
int nextCharStart2 = test5.indexOf(",", nextCharEnd) + 1;
int nextCharEnd2 = test5.indexOf(",", nextCharStart2 + 2);
String test13 = test5.substring(nextCharStart2, nextCharEnd2);
long test12 = Long.parseLong(test11);
long test14 = Long.parseLong(test13);
List<Long> graphs3 = new ArrayList<>();
graphs3.add(test12);
graphs3.add(test14);
graphs2.add(graphs3);
charEnd = test5.indexOf("}", nextCharEnd2);
} return graphs2;
here is the result of test5:
xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.DatapointsResponse#2be02a0c[start=, end=, tags={xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Tag#1600cd19[name=backup.data.size, results={xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Results#2b8a61bd[groups={xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Group#61540dbc[name=type, type=number]}, attributes=xxx.xx.xxxxxx.entity.util.map.Map#4b4eebd0[], values={{1487620485896,973956,3},{1487620454999,973806,3},{1487620424690,956617,3},{1487620397181,938677,3},{1487620368825,934494,3},{1487620339219,926125,3},{1487620309050,917753,3},{1487620279239,909384,3},{1487620251381,872864,3},{1487620222724,846518,3},{1487620196441,832150,3},{1487620168141,819563,3},{1487620142079,787264,3},{1487620115827,787264,3},{1487620091991,787264,3},{1487620067230,787264,3},{1487620042333,787264,3},{1487620018508,787264,3},{1487619994967,787264,3},{1487619973549,778740,3},{1487619950069,770205,3},{1487619926850,749106,3},{1487619902486,740729,3},{1487619877298,728184,3},{1487619851449,719666,3}}]}, stats=xxx.xx.xxxxxx.entity.timeseries.datapoints.queryresponse.Stats#5bb68fa5[rawCount=25]]}]
and the returned list:
[[1487620485896, 973956], [1487620454999, 973806], [1487620424690, 956617], [1487620397181, 938677], [1487620368825, 934494], [1487620339219, 926125], [1487620309050, 917753], [1487620279239, 909384], [1487620251381, 872864], [1487620222724, 846518], [1487620196441, 832150], [1487620168141, 819563], [1487620142079, 787264], [1487620115827, 787264], [1487620091991, 787264], [1487620067230, 787264], [1487620042333, 787264], [1487620018508, 787264], [1487619994967, 787264], [1487619973549, 778740], [1487619950069, 770205], [1487619926850, 749106], [1487619902486, 740729], [1487619877298, 728184]]
I can then take this and shove it into a json (at least i think so! haven't gotten that far). But this code seems ridiculous, brittle, and not the right way to go about this.
Does anyone have a better way of pulling datapoints out of a response and translating them into a json array or at least a nested list?
Thank you to anyone who read and please let me know if I can provide any more information.
When we want just a few values from a query the best way to retrieve them is doing the query with resultSet and using it's powerful metadata:
ResultSet rs = stmt.executeQuery("SELECT a, b, c FROM TABLE2");
ResultSetMetaData rsmd = rs.getMetaData();
String name = rsmd.getColumnName(1);
Taken from here
So you take the columns you need by using the metadata properties and then the best you can do is use a DTO object to store each row check this to learn a bit more about DTOs
So, basically the idea is that you build an object from the data you've retrieved or just the one you need at that moment from the database and you can use the common getters and setters to access all the fields
However, when collecting data you're normally going to be using loops as you need to iterate over the resultSet values asking for the name of the column and keeping it's value if it coincides.
Hope it helps

Date insertion in Google sheet appending ' in cell

To write the data into the Google spreadsheet I am using following code.
private static void writeValuesInSpreedSheet(Sheets service, String spreadsheetId, int sheetSize) throws IOException {
String range = "A"+(sheetSize+1)+":K"+(sheetSize+1);
List<List<Object>> newData = new ArrayList<>();
List<Object> rowValues = new ArrayList();
rowValues.add(getCurentDateInESTFormat());
rowValues.add("2");
rowValues.add("3");
rowValues.add("4");
rowValues.add("5");
rowValues.add("6");
rowValues.add("7");
rowValues.add("8");
rowValues.add("9");
rowValues.add("10");
rowValues.add("11");
/* List<Object> rowValues1 = new ArrayList();
rowValues1.add("1");
rowValues1.add("2");*/
newData.add(rowValues);
//newData.add(rowValues1);
// response.setValues(newData);
ValueRange oRange = new ValueRange();
oRange.setRange(range); // I NEED THE NUMBER OF THE LAST ROW
oRange.setValues(newData);
List<ValueRange> oList = new ArrayList<>();
oList.add(oRange);
BatchUpdateValuesRequest oRequest = new BatchUpdateValuesRequest();
oRequest.setValueInputOption("RAW");
oRequest.setData(oList);
BatchUpdateValuesResponse oResp1 = service.spreadsheets().values().batchUpdate(spreadsheetId, oRequest).execute();
System.out.println("Response Values " +oResp1.values());
}
private static Object getCurentDateInESTFormat() {
SimpleDateFormat sdfAmerica = new SimpleDateFormat("MM/dd/YYYY");
sdfAmerica.setTimeZone(TimeZone.getTimeZone("America/New_York"));
String sDateInAmerica = sdfAmerica.format(new Date());
return sDateInAmerica;
}
In sheet we have defined the date and currency type of respective column.
I am able to write the data but eventually its prepending ' in the data for example - '09/04/2016
Because of this we are not able to open it into date format. I have attached one screen shot as well.
We are using Google Sheets API V4.
I am asking this question because i did not find any link/solution related to it.
When using the ValueInputOption RAW you need to pass dates in the serial number format. The Apache POI library provides a getExcelDate method that will handle most of this for you, but you'll need to add a day to account for the difference in the epoch used.

Using ELKI with Mongodb

Using test cases I was able to see how ELKI can be used directly from Java but now I want to read my data from MongoDB and then use ELKI to cluster geographic (long, lat) data.
I can only cluster data from a CSV file using ELKI. Is it possible to connect de.lmu.ifi.dbs.elki.database.Database with MongoDB? I can see from the java debugger that there is a databaseconnection field in de.lmu.ifi.dbs.elki.database.Database.
I query MongoDB creating POJO for each row and now I want to cluster these objects using ELKI.
It is possible to read data from MongoDB and write it in a CSV file then use ELKI to read that CSV file but I would like to know if there is a simpler solution.
---------FINDINGS_1:
From ELKI - Use List<String> of objects to populate the Database I found that I need to implement de.lmu.ifi.dbs.elki.datasource.DatabaseConnection and specifically override the loadData() method which returns an instance of MultiObjectsBundle.
So I think I should wrap a list of POJO with MultiObjectsBundle. Now i'm looking at the MultiObjectsBundle and it looks like the data should be held in columns. Why columns datatype is List> shouldnt it be List? just a list of items you want to cluster?
I'm a little confused. How is ELKI going to know that it should look at the long and lat for POJO? Where do I tell ELKI to do this? Using de.lmu.ifi.dbs.elki.data.type.SimpleTypeInformation?
---------FINDINGS_2:
I have tried to use ArrayAdapterDatabaseConnection and I have tried implementing DatabaseConnection. Sorry I need thing in very simple terms for me to understand.
This is my code for clustering:
int minPts=3;
double eps=0.08;
double[][] data1 = {{-0.197574246, 51.49960695}, {-0.084605692, 51.52128377}, {-0.120973687, 51.53005939}, {-0.156876, 51.49313},
{-0.144228881, 51.51811784}, {-0.1680743, 51.53430039}, {-0.170134484,51.52834133}, { -0.096440751, 51.5073853},
{-0.092754157, 51.50597426}, {-0.122502346, 51.52395143}, {-0.136039674, 51.51991453}, {-0.123616824, 51.52994371},
{-0.127854211, 51.51772703}, {-0.125979294, 51.52635795}, {-0.109006325, 51.5216612}, {-0.12221963, 51.51477076}, {-0.131161087, 51.52505093} };
// ArrayAdapterDatabaseConnection dbcon = new ArrayAdapterDatabaseConnection(data1);
DatabaseConnection dbcon = new MyDBConnection();
ListParameterization params = new ListParameterization();
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPts);
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, eps);
params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID,
SortTileRecursiveBulkSplit.class);
params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 1000);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();
GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);
Relation<DoubleVector> rel = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<ExternalID> relID = db.getRelation(TypeUtil.EXTERNALID);
DBIDRange ids = (DBIDRange) rel.getDBIDs();
Clustering<Model> result = dbscan.run(db);
int i =0;
for(Cluster<Model> clu : result.getAllClusters()) {
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
DoubleVector v = rel.get(it);
ExternalID exID = relID.get(it);
System.out.print("DoubleVec: ["+v+"]");
System.out.print("ExID: ["+exID+"]");
final int offset = ids.getOffset(it);
System.out.print(" " + offset);
}
System.out.println();
++i;
}
The ArrayAdapterDatabaseConnection produces two clusters, I just had to play around with the value of epsilon, when I set epsilon=0.008 dbscan started creating clusters. When i set epsilon=0.04 all the items were in 1 cluster.
I have also tried to implement DatabaseConnection:
#Override
public MultipleObjectsBundle loadData() {
MultipleObjectsBundle bundle = new MultipleObjectsBundle();
List<Station> stations = getStations();
List<DoubleVector> vecs = new ArrayList<DoubleVector>();
List<ExternalID> ids = new ArrayList<ExternalID>();
for (Station s : stations){
String strID = Integer.toString(s.getId());
ExternalID i = new ExternalID(strID);
ids.add(i);
double[] st = {s.getLongitude(), s.getLatitude()};
DoubleVector dv = new DoubleVector(st);
vecs.add(dv);
}
SimpleTypeInformation<DoubleVector> type = new VectorFieldTypeInformation<>(DoubleVector.FACTORY, 2, 2, DoubleVector.FACTORY.getDefaultSerializer());
bundle.appendColumn(type, vecs);
bundle.appendColumn(TypeUtil.EXTERNALID, ids);
return bundle;
}
These long/lat are associated with an ID and I need to link them back to this ID to the values. Is the only way to go that using the ID offset (in the code above)? I have tried to add ExternalID column but I don't know how to retrieve the ExternalID for a particular NumberVector?
Also after seeing Using ELKI's Distance Function I tried to use Elki's longLatDistance but it doesn't work and I could not find any examples to implement it.
The interface for data sources is called DatabaseConnection.
JavaDoc of DatabaseConnection
You can implement a MongoDB-based interface to get the data.
It is not complicated interface, it has a single method.

MongoDB - group by - aggregation - java

I have a doc in my mongodb that looks like this -
public class AppCheckInRequest {
private String _id;
private String uuid;
private Date checkInDate;
private Double lat;
private Double lon;
private Double altitude;
}
The database will contain multiple documents with the same uuid but different checkInDates
Problem
I would like to run a mongo query using java that gives me one AppCheckInRequest doc(all fields) per uuid who's checkInDate is closest to the current time.
I believe I have to the aggregation framework, but I can't figure out how to get the results I need. Thanks.
In the mongo shell :-
This will give you the whole groupings:
db.items.aggregate({$group : {_id : "$uuid" , value : { $push : "$somevalue"}}} )
And using $first instead of $push will only put one from each (which is what you want i think?):
db.items.aggregate({$group : {_id : "$uuid" , value : { $first : "$somevalue"}}} )
Can you translate this to the Java api? or i'll try to add that too.
... ok, here's some Java:
Assuming the docs in my collection are {_id : "someid", name: "somename", value: "some value"}
then this code shows them grouped by name:
Mongo client = new Mongo("127.0.0.1");
DBCollection col = client.getDB("ajs").getCollection("items");
AggregationOutput agout = col.aggregate(
new BasicDBObject("$group",
new BasicDBObject("_id", "$name").append("value", new BasicDBObject("$push", "$value"))));
Iterator<DBObject> results = agout.results().iterator();
while(results.hasNext()) {
DBObject obj = results.next();
System.out.println(obj.get("_id")+" "+obj.get("value"));
}
and if you change $push to $first, you'll only get 1 per group. You can then add the rest of the fields once you get this query working.

Categories