Cassandra: Java equivalent method to `sstabledump`? - java

I'm writing test for a java program which outputs cassandra sstable files (e.g. mc-1-big-Data.db). Now what I have is an correct output file (correct.db). To check if the program is correct or not, I need to dump the two files, and compare each field inside excep "liveness_info" (which means I cannot directly compare mc-1-big-Data.db with correct.db).
I know I can use sstabledump to do this. But this command runs in shell and I may not call it directly in Java since I'm supposed to do a unit test. Therefore, I'd like to have an equivalent method in Java. But after searching for a long time I still didn't find any. Could anyone give some suggestions? Thanks!
[update]
I've tried the method #JimWartnick mentions. When I used SSTableExport.main(...), I got this error java.lang.ExceptionInInitializerError caused by org.apache.cassandra.exceptions.ConfigurationException: Expecting URI in variable: [cassandra.config]. Found[cassandra.yaml]. Please prefix the file with [file:///] for local files and [file://<server>/] for remote files.
It seems this requires me to setup cassandra server, but I suspect for a unit test I cannot do this. Any suggestions? Thanks!
[example dump]
Just in case it's helpful, I attach the example dump below.
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 33,
"liveness_info" : { "tstamp" : "2019-08-15T23:52:11.715Z" },
"cells" : [
{ "name" : "name", "value" : "Amy" }
]
}
]
},
{
"partition" : {
"key" : [ "2" ],
"position" : 34
},
"rows" : [
{
"type" : "row",
"position" : 67,
"liveness_info" : { "tstamp" : "2019-08-15T23:52:11.738Z" },
"cells" : [
{ "name" : "name", "value" : "Bob" }
]
}
]
},
{
"partition" : {
"key" : [ "4" ],
"position" : 68
},
"rows" : [
{
"type" : "row",
"position" : 103,
"liveness_info" : { "tstamp" : "2019-08-15T23:52:11.738Z" },
"cells" : [
{ "name" : "name", "value" : "Caleb" }
]
}
]
},
{
"partition" : {
"key" : [ "3" ],
"position" : 104
},
"rows" : [
{
"type" : "row",
"position" : 133,
"liveness_info" : { "tstamp" : "2019-08-15T23:52:11.738Z" },
"cells" : [
{ "name" : "name", "value" : "" }
]
}
]
}

Related

JsonPath Jayway Java implementation cannot read values

I want to read objects from the main array filtering by refPath using JsonPath Jayway Java implementation.
My input looks like this:
[
{
"2be3d660-cab0-4db8-83b9-1baf212270c5" : {
"refPath" : [
"e0586818-ba2c-4b65-afec-3c48d817b584",
"06c089a6-4de0-43d3-8dc7-181addf4c933",
"d5413a18-ac33-426c-982d-bb25ce4e4bf6"
],
"elementId" : "12c5750e-9753-43f1-8987-9dfc3a830bbe",
"modified" : false
},
"191b1bab-c269-495f-ac4f-8b4d30df95a1" : {
"refPath" : [
"e0586818-ba2c-4b65-afec-3c48d817b584",
"f7df7cff-bf6d-49da-bc44-90d61f233d3b"
],
"elementId" : "04691514-566b-47ef-8f69-e31884bde7b2",
"modified" : false
},
"6a2acd79-135f-4688-9219-158f91d9c6cf" : {
"refPath" : [
"e0586818-ba2c-4b65-afec-3c48d817b584",
"f5177f79-e2f1-4419-b46a-7d4cc1c4fae5"
],
"elementId" : "04691514-566b-47ef-8f69-e31884bde7b2",
"modified" : false
}
}
]
and I want to find all objects containing these two refPath values: "e0586818-ba2c-4b65-afec-3c48d817b584" and "06c089a6-4de0-43d3-8dc7-181addf4c933"
So my expected result from JsonPath looks like:
[
{
"2be3d660-cab0-4db8-83b9-1baf212270c5" : {
"refPath" : [
"e0586818-ba2c-4b65-afec-3c48d817b584",
"06c089a6-4de0-43d3-8dc7-181addf4c933",
"d5413a18-ac33-426c-982d-bb25ce4e4bf6"
],
"elementId" : "12c5750e-9753-43f1-8987-9dfc3a830bbe",
"modified" : false
}
}
]
Even with if I only try to find "e0586818-ba2c-4b65-afec-3c48d817b584", I get an error message "Could not determine value type".
Does anybody have an idea how the JsonPath expression must look like for this?
Use the subsetof filter operator.
$[*][*][?(['e0586818-ba2c-4b65-afec-3c48d817b584','06c089a6-4de0-43d3-8dc7-181addf4c933'] subsetof #.refPath)]
Output will not include the key 2be3d660-cab0-4db8-83b9-1baf212270c5
[
{
"refPath" : [
"e0586818-ba2c-4b65-afec-3c48d817b584",
"06c089a6-4de0-43d3-8dc7-181addf4c933",
"d5413a18-ac33-426c-982d-bb25ce4e4bf6"
],
"elementId" : "12c5750e-9753-43f1-8987-9dfc3a830bbe",
"modified" : false
}
]

Group by two arrays and get concatenation values in Mongodb aggregations

Having a document with this format :
"_id" : ObjectId("59ce3bb32708c95ee2168e2f"),
"document1" : [
{
"value" : "doc1A"
},
{
"value" : "doc1B"
},
{
"value" : "doc1C"
},
{
"value" : "doc1D"
},
{
"value" : "doc1E"
},
{
"value" : "doc1F"
}
],
"document2" : [
{
"value" : "doc2A"
},
{
"value" : "doc2B"
},
{
"value" : "doc2C"
},
{
"value" : "doc2D"
},
"metric1" :0.0,
"metric2" : 0.0
]
}
I need to group by the concatenation of the document1 and document 2 values and perform some calculs on it in Aggregation framework at Java.
I can do group(document1,document2) but I'll get as an _id an array so I want to get it as a concatenation and as :
doc1A (doc2A) / doc1A (doc2B) / doc1A (doc2C) ...
Do you have any idea ?

Elasticsearch geo search strange behavior

A few days ago I faced with the strange behavior of geo search in Elasticsearch.
I use AWS managed ES 5.5, obviously over REST interface.
Assume we have 200k objects with location info represented as the point only. I use geo search to find the points within multiple polygons. They are shown on the image below. Coordinates were extracted from final request to the ES.
The request is built using official Java High-level REST client. The request query will be attached below.
I want to search for all objects within at least one polygon.
Here is the query (real fields names and values were replaced by stub, Except location and locationPoint.coordinates)
{
"size" : 20,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{
"terms" : {
"field1" : [
"a",
"b",
"c",
"d",
"e",
"f"
],
"boost" : 1.0
}
},
{
"term" : {
"field2" : {
"value" : "q",
"boost" : 1.0
}
}
},
{
"range" : {
"field3" : {
"from" : "10",
"to" : null,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
},
{
"range" : {
"field4" : {
"from" : "10",
"to" : null,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
},
{
"geo_shape" : {
"location" : {
"shape" : {
"type" : "geometrycollection",
"geometries" : [
{
"type" : "multipolygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
},
{
"type" : "polygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
},
{
"type" : "polygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
},
{
"type" : "polygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
}
]
},
"relation" : "intersects"
},
"ignore_unmapped" : false,
"boost" : 1.0
}
}
]
}
},
"boost" : 1.0
}
},
"_source" : {
"includes" : [
"field1",
"field2",
"field3",
"field4",
"field8"
],
"excludes" : [ ]
},
"sort" : [
{
"field1" : {
"order" : "desc"
}
}
],
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "field1",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg2" : {
"terms" : {
"field" : "field2",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg3" : {
"terms" : {
"field" : "field3",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg4" : {
"terms" : {
"field" : "field4",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg5" : {
"terms" : {
"field" : "field5",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg6" : {
"terms" : {
"field" : "field6",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg7" : {
"terms" : {
"field" : "field7",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg8" : {
"terms" : {
"field" : "field8",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"map_center" : {
"geo_centroid" : {
"field" : "locationPoint.coordinates"
}
},
"map_bound" : {
"geo_bounds" : {
"field" : "locationPoint.coordinates",
"wrap_longitude" : true
}
}
}
}
Note, that field location is mapped as geo_shape and field location.coordinates is mapped as geo_point.
So the problem is next. Below the results (hits count) of requests are presented. Only polygons are changing.
# Polygons Hits count
1) 1,2,3,4 5565
2) 1 4897
3) 3,4 75
4) 2 9
5) 1,3,4 5543
6) 1,2 5466
7) 2,3,4 84
So, if I add results of polygon 1st with 2,3,4 polygons I will not obtain the number as it was in full request.
For example, #1 != #2 + #7, also #1 != #5 + #4, but #7 == #4 + #3
I cannot understand whether it is the issue in this request or expected behavior or even bug in ES.
Can anyone help me to understand the logic of such ES behavior or point to the solution?
Thanks!
After a short conversation with Elasticsearch team member, we come up to AWS.
Build hashes of AWS and pure ES is not equal so, ES is modified by AWS team and we do not know exact changes. There can be some changes that might affect search in posted question.
Need to reproduce this behavior on pure ES cluster before we will continue our conversation.

why I get nothing about kafka metrics with jmxtrans since I can get JVM heap info?

I use kafka_2.11-0.9.0.1, I try two version of jason config files. I can get JVM info like heapmem and GC
enter image description here
But when I wanted to get kafka metrics, there is nothing out. This is the jmxtrans log.
enter image description here
And more, This is two version jason file I user:
{
"servers" : [ {
"port" : "9999",
"host" : "localhost",
"queries" : [ {
"outputWriters" : [ {
"#class" : "com.googlecode.jmxtrans.model.output.StdOutWriter",
"settings" : {
}
} ],
"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=test",
"attr" : [ "Count"]
},{
"outputWriters" : [ {
"#class" : "com.googlecode.jmxtrans.model.output.StdOutWriter",
"settings" : {
}
} ],
"obj" : "kafka.server:type=BrokerTopicMetrics,name=*",
"resultAlias": "Kafka",
"attr" : [ "Count","OneMinuteRate"]
}
],
"numQueryThreads" : 2
} ]
}
the other is :
{
"outputWriters" : [ {
"#class" : "com.googlecode.jmxtrans.model.output.KeyOutWriter",
"settings" : {
"outputFile" : "testowo-counts3.txt",
"maxLogFileSize" : "10MB",
"maxLogBackupFiles" : 200,
"delimiter" : "\t",
"debug" : true
}
} ],
"obj": "\"kafka.network\":type=\"RequestMetrics\",name=\"Produce-RequestsPerSec\"",
"resultAlias": "produce",
"attr": [
"Count",
"OneMinuteRate"
]
} ,{
"outputWriters" : [ {
"#class" : "com.googlecode.jmxtrans.model.output.KeyOutWriter",
"settings" : {
"outputFile" : "testowo-gc.txt",
"maxLogFileSize" : "10MB",
"maxLogBackupFiles" : 200,
"delimiter" : "\t",
"debug" : true
}
} ],
"obj": "java.lang:type=GarbageCollector,name=*",
"resultAlias": "GC",
"attr": [
"CollectionCount",
"CollectionTime"
]
}
This is the version problem. I recommend jconsole to see the Mbeans tree. It helps a lot.

How to retrieve a document by its own sub document or array?

I have such structure of document:
{
"_id" : "4e76fd1e927e1c9127d1d2e8",
"name" : "***",
"embedPhoneList" : [
{
"type" : "家庭",
"number" : "00000000000"
},
{
"type" : "手机",
"number" : "00000000000"
}
],
"embedAddrList" : [
{
"type" : "家庭",
"addr" : "山东省诸城市***"
},
{
"type" : "工作",
"addr" : "深圳市南山区***"
}
],
"embedEmailList" : [
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
}
]
}
What I wan't to do is find the document by it's sub document,such as email in embedEmailList field.
Or if I have structure like this
{
"_id" : "4e76fd1e927e1c9127d1d2e8",
"name" : "***",
"embedEmailList" : [
"123#gmail.com" ,
"********#gmail.com" ,
]
}
the embedEmailList is array,how to find if there is 123#gmail.com?
Thanks.
To search for a specific value in an array, mongodb supports this syntax:
db.your_collection.find({embedEmailList : "foo#bar.com"});
See here for more information.
To search for a value in an embedded object, it supports this syntax:
db.your_collection.find({"embedEmailList.email" : "foo#bar.com"});
See here for more information.

Categories