Using com.netflix.astyanax, I add entries for a given row as follows:
final ColumnListMutation<String> columnList = m.withRow(columnFamily, key);
columnList.putEmptyColumn(columnName);
Later I retrieve all my columns with:
final OperationResult<ColumnList<String>> operationResult = keyspace
.prepareQuery(columnFamily).getKey(key).execute();
operationResult.getResult().getColumnNames();
The following correctly return all the columns I have added but the columns are not ordered accordingly to when they were entered in the database. Since each column has a timestamp associated to it, there ought to be a way to do exactly this but I don't see it. Is there?
Note: If there isn't, I can always change the code above to:
columnList.putColumn(ip,new Date());
and then retrieve the column values, order them accordingly, but that seems cumbersome, inefficient, and silly since each column already has a timestamp.
I know from PlayOrm that if you do column Slices, it returns those in order. In fact, playorm uses that do enable S-SQL in partitions and basically batches the column slicing which comes back in order or reverse order depending on how requested. You may want to do a column slice from 0 to MAXLONG.
I am not sure about getting the row though. I haven't tried that.
oh, and PlayOrm is just a mapping layer on top of astyanax though not really relational and more noSql'ish really as demonstrated by it's patterns pages
http://buffalosw.com/wiki/Patterns-Page/
Cassandra will never order your columns in "insertion order".
Columns are always ordered lowest first. It also depends on how cassandra interprets your column names. You can define the interpretation with the comparator you set when defining your column family.
From what you gave it looks you use String timestamp values. If you simply serialized your timestamps as e.g. "123141" and "231" be aware that with an UTF8Type comparator "231">"123131".
Better approach: Use Time-based UUIDs as column names, as many examples for Time-series data in Cassandra propose. Then you can use the UUIDType comparator.
CREATE COLUMN FAMILY timeseries_data
WITH comparator = UUIDType
AND key_validation_class=UTF8Type;
Related
How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.
I am not able to find any satisfying solution so asking here.
I need to compare data of two large tables(~50M) with the same schema definition in JAVA.
I can not use order by clause while getting the resultset object and records might be not in order in both of the tables.
Can anyone help me what can be the right way to do it?
You could extract the data of the first DB table into a text file, and create a while loop on the resultSet for the 2nd table. As you iterate through the ResultSet do a search/verify against the text file. This solution works if memory is of concern to you.
If not, then just use a HashMap to hold the data for the first table and do the while loop and look up the records of the 2nd table from the HashMap.
This really depends on what you mean by 'compare'? Are you trying to see if they both contain the exact same data? Find rows in one not in the other? Find rows with the same primary keys that have differing values?
Also, why do you have to do this in Java? Regardless of what exactly you are trying to do, it's probably easier to do with SQL.
In Java, you'll want to create an class that represents the primary key for the tables, and a second classthat represents the rest of the data, which also includes the primary key class. If you only have a single column as the primary key, then this is easier.
We'll call P the primary key class, and D the rest.
Map map = new HashMap();
Select all of the rows from the first table, and insert them into the hash map.
Query all of the rows in the second table.
For each row, create a P object.
Use that to see what data was in the first table with the same Key.
Now you know if both tables contained the same row, and you can compare the non-key values from both both.
Like I said, this is much much easier to do in straight SQL.
You basically do a full outer join between the two tables. How exactly that join looks depends on exactly what you are trying to do.
I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.
For example:
create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;
Or i can create a entire new table:
create table grouptoemail(groupid int, email text, primary key(groupid, email));
select * from grouptoemail where groupid=2;
They both do the job.
I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.
According to me your first approach is correct.
create table emailtogroup(email text, groupid int, primary key(email));
because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.
The second form of table creation is useful for range scans. For example if you have a use case like
i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.
In that case you may have to design a table like
create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));
In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)
It depends on the cardinality of groupid. The cassandra docs:
When not to use an index
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a
high-cardinality column, which has many distinct values, a query
between the fields will incur many seeks for very few results. In the
table with a billion users, looking up users by their email address (a
value that is typically unique for each user) instead of by their
state, is likely to be very inefficient. It would probably be more
efficient to manually maintain the table as a form of an index instead
of using the Cassandra built-in index. For columns containing unique
data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an
indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every
value is distinct.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.
Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.
I have recently started taking much interest in CQL as I am thinking to use Datastax Java driver. Previously, I was using column family instead of table and I was using Astyanax driver. I need to clarify something here-
I am using the below column family definition in my production cluster. And I can insert any arbitrary columns (with its value) on the fly without actually modifying the column family schema.
create column family FAMILY_DATA
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'BytesType'
and gc_grace = 86400;
But after going through this post, it looks like- I need to alter the schema every time whenever I am getting a new column to insert which is not what I want to do... As I believe CQL3 requires column metadata to exist...
Is there any other way, I can still add arbitrary columns and its particular value if I am going with Datastax Java driver?
Any code samples/example will help me to understand better.. Thanks..
I believe in CQL you solve this problem using collections.
You can define the data type of a field to be a map, and then insert arbitrary numbers of key-value pairs into the map, that should mostly behave as dynamic columns did in traditional Thrift.
Something like:
CREATE TABLE data ( data_id int PRIMARY KEY, data_time long, data_values map );
INSERT INTO data (data_id, data_time, data_values) VALUES (1, 21341324, {'sum': 2134, 'avg': 44.5 });
Here is more information.
Additionally, you can find the mapping between the CQL3 types and the Java types used by the DataStax driver here.
If you enable compact storage for that table, it will be backwards compatible with thrift and CQL 2.0 both of which allow you to enter dynamic column names.
You can have as many columns of whatever name you want with this approach. The primary key is composed of two things, the first element which is the row_key and the remaining elements which when combined as a set form a single column name.
See the tweets example here
Though you've said this is in production already, it may not be possible to alter a table with existing data to use compact storage.
I want to have a table like representation of data with multiple columns. e.g. consider following sample:
---------------------------------------------------------------
col1 col2 col3 col4 col5(numeric) col6(numeric)
---------------------------------------------------------------
val01 val02 val03 val04 05 06
val11 val12 val13 val14 15 16
val21 val22 val23 val24 25 26
val31 val32 val33 val34 35 36
.
.
.
---------------------------------------------------------------
I'd like to query on this table by a value in given col e.g. search for value val32 in column col2 which should return me all rows that could match this query in the same tabular format.
for some columns like say col5 and col6, I'd like to perform mathematical operations/queries like getMax(), getMin(), getSum(), divideAll() etc...
For such requirement can anybody suggest any type of data structure that could best solve my purpose? Any one data structure or combination of them, Considering efficient operations (like mathematical examples above), and querying??
Let me know if anybody need more information.
Edit: Additional requirement
This should be efficient enough to handle hundreds of millions of rows and also easy and efficient to persist.
What you need is a three-part approach:
A Row class that contains fields for each column
A List<Row> to store the rows and provide sequential access
One or more Map<String,Row> or Map<Integer,Row> to provide fast lookup of the rows by various column values. If the column values are not unique then you need a MultiMap<...> implementation (there are several available on the Internet) to allow multiple values for a given key.
The Row objects are first placed in the list, and then you build the index(es) after you have loaded all the rows.
I think below should help:
Map<String,List<Object>>
Search "val32" in "col2", search(cal2,val32):
get the list of the objects associated with cal2(map.get("cal2"),and iterate over them to find if this value exists or not.
getSum(String columnName):
Again just get the list, iterate over it add these values. Return the final sum.
Since you are adding List of Objects, you might want to throw ClassCasteException from these APIs.
Finally I planned to use Mongo Database instead of going through all basic and complicated implementations..
I hope this will solve my problem. Or there is any other db better that this in terms of speed, storage, and availability of required operations (as mentioned in question)?