DynamoDB, method to paginate 100,000 documents with offset/limit?

DynamoDB, method to paginate 100,000 documents with offset/limit? - java

I have 100,000 documents successfully stored in a table. Currently, I'm just using a primary key (not hash/sort combo). There's no good way to split these into useful partitions for reads, because primarily customers will just be initially pulling the entire database, and then just pulling whatever items have been updated. Additionally, I would like to return results in a paginated fashion using an offset/limit method.
I'm wondering what the best way to do this is. An example item that I have stored in the table is like (id is the primary key):
{
"id": 11299,
"name": "plugin1",
"attributes": {
"plugin_version": "1.30",
"exploit_available": false,
"in_the_news": false,
"exploited_by_malware": false,
"exploited_by_nessus": false,
"risk_factor": "Medium",
"plugin_type": "remote",
"exploitability_ease": "No known exploits are available",
"plugin_publication_date": "2003-03-01T00:00:00Z",
"plugin_modification_date": "2018-07-16T00:00:00Z",
"vuln_publication_date": "2003-01-23T00:00:00Z"
"cvss_temporal_vector": {
"raw": "E:U/RL:OF/RC:C",
"ReportConfidence": "Confirmed",
"Exploitability": "Unproven",
"RemediationLevel": "Official-fix"
}
}
}
I also need to filter on plugin_modification_date, so not sure if it would be helpful to make that a sort key. What's been frustrating when investigating this so far is that everything seems to rely on using the partition key somehow, where it is basically useless when you have a solitary primary key which is unique for all items.

Related

Can I List all items that contain these values on my DynamoDB table?

I have a DynamoDB table that holds items with the following structure:
{
"applicationName": { //PARTITION KEY
"S": "FOO_APP"
},
"requestId": { // SORT KEY
"S": "zzz/yyy/xxx/58C28B6"
},
"creationTime": {
"N": "1636332219136"
},
"requestStatus": {
"S": "DENIED"
},
"resolver": {
"S": "SOMEONE"
}
}
In DynamoDB, can I query this table to List all items that match the provided values for applicationName, requestStatus and, resolver?
In other words, how can I list all items that match:
applicationName = 'FOO',
requestStatus = 'DENIED', and
resolver = 'SOMEONE'
With this table design, do I need GSIs? Can I do a Query or would it be a Scan?
What is the most cost-effective, efficient way of accomplishing this task?
I'm using Java's DynamoDBMapper.

You can add another attribute that combines the values you're querying for, like this:
GSI1PK: <applicationName>#<requestStatus>#<resolver>
Then you define a Global Secondary Index (GSI1) with the Partition Key as GSI1PK and the sort key like your current sort key requestId.
Whenever you want to find all requests that match these three conditions, you build your search thing and query the global secondary index:
Query #GSI1
Partition Key = FOO_APP#DENIED#SOMEONE
That will yield all requests that match the combination of criteria. This kind of denormalization is common in NoSQL databases like DynamoDB.

You may not be able to query this schema as your sort key - requestId is not in criteria. That means, your query will fail. For a better schema design, you should have a sort key in such a way which can help you narrow down result set obtained by just querying on PartitionKey.
So for solution, you will have to create new index as following:
applicationName -> Partition Key
requestStatus -> Sort Key
resolver
Then you can query with keyConditionExpression on applicationName and requestStatus with filterExpression on resolver.

Get Elasticsearch match based on list of values

I'm using Logstash to input data from my database to Elasticsearch.
For an specific SQL query, I have one column that retrieves values as a CSV, like "role1;role2;role3".
This column is being indexed as a regular string in Elastic.
The problem:
I need to make an Elastic query on that field based on another list of values.
For example: On the java side I have a collection with the values: "role3", "role4", "role5" and based on that I should get all the records in Elastic that matches "role3", "role4" or "role5".
In this specific case, my elastic data is like this:
"_source": {
"userName": "user1",
"roles": "role1;role2;role3"
}
"_source": {
"userName": "user2",
"roles": "role7;role8;role9"
}
In this case it should return the record for "user1", as it gets a match for the "role3".
Question:
What is the best way to do that ?
I can make a query using something like the LIKE operator for all itens of my java list:
//javaList collection has 3 items: "role3", "role4" and "role5"
for (String role: javaList) {
query = QueryBuilders.boolQuery();
query.should(QueryBuilders.wildcardQuery("roles", "*" + role + "*"));
response = client.prepareSearch(indexName).setQuery(query).setTypes(type).execute().actionGet();
hits = response.getHits();
}
And then iterate over each hit, but this sounds like a very bad apporach, because the javaList can have more than 20 itens, that would mean 20 querys to elastic.
I need a way to tell this to Elastic:
This is my list of roles, query internally and retrieve
only the records that matches at least one of those roles.
In order to do that I understand I can't index that data as a String right ? Ideally would be to have it an array or something like it...
How can I do that in the most performatic way ?

Definitely you should not use wildcard query in a loop. This solution eventually will demonstrate a poor performance.
Since roles field is a regular text field Elasticsearch splits value "role1;role2;role3" into individual tokens "role1", "role2" and "role3". The same operation is applied to a search query. You can use simple match query with query string "role3;role4;role5" and get hit because of "role3" token match.
Also you can index roles field as an array of strings and the same match query will still work.

kairosdb and elasticsearch integration

I'm using Kairosdb as my primary db. Now I want to integrate the Elasticsearch functionalities to my data inside Kairosdb. As stated inside the docs I have to duplicate all entries of my primary db inside Elasticsearch database.
Update
What I mean is that, if I want to index something inside elasticsearch, I have to do, for example:
Retrieve data of Kairosdb, a example json {"name": "hi","value": "6","tags"}
and then put it inside Elasticsearch:
curl -XPUT 'http://localhost:9200/firstIndex/test/1' -d '{"name": "hi","value": "6","tags"}'
If I want to search I have to do this:
curl 'http://localhost:9200/_search?q=name:hi&pretty=true'
I'm wondering if it is possible to not duplicate my data inside Elasticsearch, in a way which I can achieve this:
get data from KairosDB
index them using Elasticsearch without duplicate the data.
How can I go about that?

It sounds like you're hoping to use Elasticsearch as a secondary (and external) fulltext index for your primary datastore (KairosDB).
Since KairosDB is remaining your primary datastore, each record you load into Elasticsearch needs two pieces of information (at minimum):
The primary key field(s) for locating the corresponding KairosDB record(s). In the mapping, make sure to set "store": true, "index": "not_analyzed"
Any fields which you wish to be searchable (in your example, only name is searched) "store": false, "index": "analyzed"
If you want to reduce your index size further, consider disabling the _source field
Then your search workflow becomes a two-step process:
Query Elasticsearch for name:hi and retrieve the KairosDB primary key field(s) for each of the matching record(s).
Query/return KairosDB time-series data using key fields returned from Elasticsearch.
But to be clear. You don't need an exact duplicate of each KairosDB record loaded into Elasticsearch. Just the searchable fields, along with a means to locate the original record in KairosDB.

Query Dynamo table with only the secondary global index

Im trying to query a Dynamodb table using a secondary global index and I'm getting java.lang.IllegalArgumentException: Illegal query expression: No hash key condition is found in the query. All I'm trying to do is to get all items that have a timestamp greater than a value without considering the key. The timestamp is not part of a key or range key, so i created a global index for it.
Does anyone have a clue what i might be missing?
Table Definition:
{
AttributeDefinitions:[
{
AttributeName:timestamp,
AttributeType:N
},
{
AttributeName:url,
AttributeType:S
}
],
TableName:SitePageIndexed,
KeySchema:[
{
AttributeName:url,
KeyType:HASH
}
],
TableStatus:ACTIVE,
CreationDateTime: Mon May 12 18:45:57 EDT 2014,
ProvisionedThroughput:{
NumberOfDecreasesToday:0,
ReadCapacityUnits:8,
WriteCapacityUnits:4
},
TableSizeBytes:0,
ItemCount:0,
GlobalSecondaryIndexes:[
{
IndexName:TimestampIndex,
KeySchema:[
{
AttributeName:timestamp,
KeyType:HASH
}
],
Projection:{
ProjectionType:ALL,
},
IndexStatus:ACTIVE,
ProvisionedThroughput:{
NumberOfDecreasesToday:0,
ReadCapacityUnits:8,
WriteCapacityUnits:4
},
IndexSizeBytes:0,
ItemCount:0
}
]
}
Code
Condition condition1 = new Condition().withComparisonOperator(ComparisonOperator.GE).withAttributeValueList(new AttributeValue().withN(Long.toString(start)));
DynamoDBQueryExpression<SitePageIndexed> exp = new DynamoDBQueryExpression<SitePageIndexed>().withRangeKeyCondition("timestamp", condition1);
exp.setScanIndexForward(true);
exp.setLimit(100);
exp.setIndexName("TimestampIndex");
PaginatedQueryList<SitePageIndexed> queryList = client.query(SitePageIndexed.class,exp);

All I'm trying to do is to get all items that have a timestamp greater than a value without considering the key.
This is not how Global Secondary Indexes (GSI) on Amazon DynamoDB work. To query a GSI you must specify a value for its hash key and then you may filter/sort by the range key -- just like you'd do with the primary key. This is exactly what the exception is trying to tell you, and also what you will find on the documentation page for the Query API:
A Query operation directly accesses items from a table using the table primary key, or from an index using the index key. You must provide a specific hash key value.
Think of a GSI as just another key that behaves almost exactly like the primary key (the main differences being that it is updated asynchronously, and you can only perform eventually consistent reads on GSIs).
Please refer to the Amazon DynamoDB Global Secondary Index documentation page for guidelines and best practices when creating GSIs: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
One possible way to achieve what you want would be to have a dummy attribute constrained to a finite, small set of possible values, create a GSI with hash key on that dummy attribute and range key on your timestamp. When querying, you would need to issue one Query API call for each possible value on your dummy hash key attribute, and then consolidate the results on your application. By constraining the dummy attribute to a singleton (i.e., a Set with a single element, i.e., a constant value), you can send only one Query API call and you get your result dataset directly -- but keep in mind that this will cause you problems related to hot partitions and you might have performance issues! Again, refer to the document linked above to learn the best practices and some patterns.

It is possible to query DynamoDb with only the GSI; could be confirmed by going to the web interaface Query/Index.
Programatically the way it is done is as following:
DynamoDB dynamoDB = new DynamoDB(new AmazonDynamoDBClient(
new ProfileCredentialsProvider()));
Table table = dynamoDB.getTable("WeatherData");
Index index = table.getIndex("PrecipIndex");
QuerySpec spec = new QuerySpec()
.withKeyConditionExpression("#d = :v_date and Precipitation = :v_precip")
.withNameMap(new NameMap()
.with("#d", "Date"))
.withValueMap(new ValueMap()
.withString(":v_date","2013-08-10")
.withNumber(":v_precip",0));
ItemCollection<QueryOutcome> items = index.query(spec);
Iterator<Item> iter = items.iterator();
while (iter.hasNext()) {
System.out.println(iter.next().toJSONPretty());
}
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSIJavaDocumentAPI.html#GSIJavaDocumentAPI.QueryAnIndex
For doing it with DynamoDBMapper see: How to query a Dynamo DB having a GSI with only hashKeys using DynamoDBMapper

Here is how you can query in java with only GSI
Map<String, AttributeValue> eav = new HashMap<String, AttributeValue>();
eav.put(":val1", new AttributeValue().withS("PROCESSED"));
DynamoDBQueryExpression<Package> queryExpression = new DynamoDBQueryExpression<Package>()
.withIndexName("<your globalsecondaryindex key name>")
.withKeyConditionExpression("your_gsi_column_name= :val1").
withExpressionAttributeValues(eav).withConsistentRead(false).withLimit(2);
QueryResultPage<T> scanPage = dbMapper.queryPage(T.class, queryExpression);

While this is not the correct answer per say, could you possible accomplish this with a scan vs. a query? It's much more expensive, but could be a solution.

MongoDB Composite Key

I'm just getting started with MongoDb and I've noticed that I get a lot of duplicate records for entries that I meant to be unique. I would like to know how to use a composite key for my data and I'm looking for information on how to create them. Lastly, I am using Java to access mongo and morphia as my ORM layer so including those in your answers would be awesome.
Morphia: http://code.google.com/p/morphia/

You can use objects for the _id field as well. The _id field is always unique. That way you kind of get a composite primary key:
{ _id : { a : 1, b: 1} }
Just be careful when creating these ids that the order of keys (a and b in the example) matters, if you swap them around, it is considered a different object.
The other possibility is to leave _id alone and create a unique compound index.
db.things.ensureIndex({firstname: 1, lastname: 1}, {unique: true});
//Deprecated since version 3.0.0, is now an alias for db.things.createIndex()
https://docs.mongodb.org/v3.0/reference/method/db.collection.ensureIndex/

You can create Unique Indexes on the fields of the document that you'd want to test uniqueness on. They can be composite as well (called compound key indexes in MongoDB land), as you can see from the documentation. Morphia does have a #Indexed annotation to support indexing at the field level. In addition with morphia you can define compound keys at the class level with the #Indexed annotation.

I just noticed that the question is marked as "java", so you'd want to do something like:
final BasicDBObject id = new BasicDBObject("a", aVal)
.append("b", bVal)
.append("c", cVal);
results = coll.find(new BasicDBObject("_id", id));
I use Morphia too, but have found (that while it works) it generates lots of errors as it tries to marshall the composite key. I use the above when querying to avoid these errors.
My original code (which also works):
final ProbId key = new ProbId(srcText, srcLang, destLang);
final QueryImpl<Probabilities> query = ds.createQuery(Probabilities.class)
.field("id").equal(key);
Probabilities probs = (Probabilities) query.get();
My ProbId class is annotated as #Entity(noClassnameStored = true) and inside the Probabilities class, the id field is #Id ProbId id;

I will try to explain with an example:
Create a table Music
Add Artist as a primary key
Now since artist may have many songs we have to figure out a sort key.
The combination of both will be a composite key.
Meaning, the Artist + SongTitle will be unique.
something like this:
{
"Artist" : {"s" : "David Bowie"},
"SongTitle" : {"s" : "changes"},
"AlbumTitle" : {"s" : "Hunky"},
"Genre" : {"s" : "Rock"},
}
Artist key above is: Partition Key
SongTitle key above is: sort key
The combination of both is always unique or should be unique. Rest are attributes which may vary per record.
Once you have this data structure in place you can easily append and scan as per your custom queries.
Sample Mongo queries for reference:
db.products.insert(json file path)
db.collection.drop(json file path)
db.users.find(json file path)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.