I am retrieving data from Cassandra and mapping it to a class using the build in object mapping API in the java driver. After I process the data I want to delete it. My clustering key is a timestamp and it is mapped to a Date object. When I try do delete a partition it does not get deleted. I suspect that it is because of the mapping to the Date object and that some data is lost there? Have you encountered a similar problem?
The Accessor:
#Query("SELECT * FROM my_table WHERE id = ? AND event_time < ?")
Result<MyObject> getAllObjectsByTime(UUID id, Date eventToTime);
The retrieval of the objects:
MappingManager manager = new MappingManager (_cassandraDatabaseManager.getSession());
CassandraAccessor cassandraAccessor = manager.createAccessor(CassandraAccessor.class);
Result<MyObject> myObjectResult = cassandraAccessor.getAllObjectsByTime(id, eventToTime);
MyObject:
#Table(keyspace = "myKeyspace", name = "my_table ")
public class MyObject
{
#PartitionKey
#Column(name = "id")
private UUID id;
#Column(name = "event_time")
private Date eventTime;
}
The delete logic:
PreparedStatement statement = session
.prepare("DELETE FROM my_table WHERE id = ? AND event_time = ?;");
BatchStatement batch = new BatchStatement();
for (MyObject myObject: myObjects)
{
batch.add(statement.bind(myObject.getStoreId(), myObject.getEventTime()));
}
session.execute(batch);
EDIT
After a lot of debugging I figured, that maybe the Date is not the problem. It appears that the delete is working, but not for all of the partitions. When I debug the Java application I get the following CQL statement:
DELETE FROM my_table WHERE id=86a2f31d-5e6e-448b-b16c-052fe92a87c9 AND event_time=1442491082128;
When it is executed trough the Cassandra Java Driver the partition is not deleted. If I execute it in the CQLSH console the partition is deleted. I have no idea what is happening. I am starting to suspect that there is a problem with the Cassandra Java Driver. Any ideas?
Edit 2
This is the table:
CREATE TABLE my_table(
id uuid,
event_time timestamp,
event_data text,
PRIMARY KEY (id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)
I'd need to see more of your code to understand how you are issuing the delete, but perhaps you aren't specifying the timestamp to the correct precision on the delete.
Internally timestamp fields are epoch time in milliseconds. When you look at a timestamp in cqlsh, it shows the timestamp rounded down to the nearest second like this:
SELECT * from t12 where a=1 and b>'2015-09-16 12:51:49+0000';
a | b
---+--------------------------
1 | 2015-09-16 12:51:49+0000
So if you try to delete using that date string, it won't be an exact match since the real value is something like 2015-09-16 12:51:49.123+0000
If you show the timestamp as an epoch time in milliseconds, then you can delete it with that:
SELECT a, blobAsBigint(timestampAsBlob(b)) from t12;
a | system.blobasbigint(system.timestampasblob(b))
---+------------------------------------------------
1 | 1442407909964
DELETE from t12 where a=1 and b=1442407909964;
See this.
I have seen problems with batched statements being dropped or timing out. How many deletes are you trying to execute per batch? Try either lowering your batch size or removing batching all-together.
Remember, batch statements in Cassandra were designed to apply an update atomically to several different tables. They really weren't intended to be used to slam a few thousand updates into one table.
For a good description of how batch statements work, watch the video from (DataStax MVP) Chris Batey's webinar on Avoiding Cassandra Anti-Patterns. At 16:00 minutes he discusses what (exactly) happens in your cluster when it applies a batch statement.
Related
I'm working with Cassandra using their Java API for interacting with it. I have entity classes that my mapper object uses to peform CRUD operations. I needed a custom query to retrieve all my Purchase objects from a specific timespan. However when I run my query below, I never get anything in return. Fun fact though, after more extensive testing, the query does work on my colleague's Mac running Cassandra 3.11.2. My machine is running Windows and Cassandra 3.9.0.
String query = String.format("SELECT * FROM purchase WHERE timestamp >= %s AND timestamp <= %s ALLOW FILTERING;", startTimestamp, endTimestamp);
purchases = session.execute(query);
I have also tried using the IN operation, however I can't find any information on what it actually does and though it doesn't throw any exception, it won't find any purchases at all:
String query = String.format("SELECT * FROM purchase WHERE timestamp IN (%s , %s);", startTimestamp, endTimestamp);
I finally managed to solve. Turns out, if you store something with the Cassandra timestamp data type, you cannot select the item with a long anymore. You have to use a date format on a string. Solved it like this:
startDate = simpleDateFormat.format(Long.valueOf(startTimestamp));
endDate = simpleDateFormat.format(Long.valueOf(endTimestamp));
query = String.format("SELECT * FROM purchase WHERE timestamp >= '%s' AND timestamp <= '%s' ALLOW FILTERING;", startDate, endDate);
the TL;DR is that I am not able to delete a row previously created with an upsert using Java.
Basically I have a table like this:
CREATE TABLE transactions (
key text PRIMARY KEY,
created_at timestamp
);
Then I execute:
String sql = "update transactions set created_at = toTimestamp(now()) where key = 'test' if created_at = null";
session.execute(sql)
As expected the row is created:
cqlsh:thingleme> SELECT * FROM transactions ;
key | created_at
------+---------------------------------
test | 2018-01-30 16:35:16.663000+0000
But (this is what is making me crazy) if I execute:
sql = "delete from transactions where key = 'test'";
ResultSet resultSet = session.execute(sql);
Nothing happens. I mean: no exception is thrown and the row is still there!
Some other weird stuff:
if I replace the upsert with a plain insert, then the delete works
if I directly run the sql code (update and delete) by using cqlsh, it works
If I run this code against an EmbeddedCassandraService, it works (this is very bad, because my integration tests are just green!)
My environment:
cassandra: 3.11.1
datastax java driver: 3.4.0
docker image: cassandra:3.11.1
Any idea/suggestion on how to tackle this problem is really appreciated ;-)
I think the issue you are encountering might be explained by the mixing of lightweight transactions (LWTs) (update transactions set created_at = toTimestamp(now()) where key = 'test' if created_at = null) and non-LWTs (delete from transactions where key = 'test').
Cassandra uses timestamps to determine which mutations (deletes, updates) are the most recently applied. When using LWTs, the timestamp assignment is different then when not using LWTs:
Lightweight transactions will block other lightweight transactions from occurring, but will not stop normal read and write operations from occurring. Lightweight transactions use a timestamping mechanism different than for normal operations and mixing LWTs and normal operations can result in errors. If lightweight transactions are used to write to a row within a partition, only lightweight transactions for both read and write operations should be used.
Source: How do I accomplish lightweight transactions with linearizable consistency?
Further complicating things is that by default the java driver uses client timestamps, meaning the write timestamp is determined by the client rather than the coordinating cassandra node. However, when you use LWTs, the client timestamp is bypassed. In your case, unless you disable client timestamps, your non-LWT queries are using client timestamps, where your LWT queries are using a timestamp assigned by the paxos logic in cassandra. In any case, even if the driver wasn't assigning client timestamps this still might be a problem because the timestamp assignment logic is different on the C* side for LWT and non-LWT as well.
To fix this, you could alter your delete statement to include IF EXISTS, i.e.:
delete from transactions where key = 'test' if exists
Similar issue from the java driver mailing list
I prefer to use the timestamp as one of the column in Cassandra (which I decided to use as Clustering Key). which is the right way to store the column as timestamp in Cassandra?
(i.e) Is it fine to use the 'milliseconds' (Example : 1513078338560) directly like below?
INSERT INTO testdata (nodeIp, totalCapacity, physicalUsage, readIOPS, readBW, writeIOPS, writeBW, writeLatency, flashMode, timestamp) VALUES('172.30.56.60',1, 1,1,1,1,1,1,'yes',1513078338560);
or to use the dateof(now());
INSERT INTO testdata (nodeIp, totalCapacity, physicalUsage, readIOPS, readBW, writeIOPS, writeBW, writeLatency, flashMode, timestamp) VALUES('172.30.56.60',1, 1,1,1,1,1,1,'yes',dateof(now()));
which is faster and recommended way to use for timestamp based queries in Cassandra?
NOTE : I know internally it stores as milliseconds, I used the 'SELECT timestamp, blobAsBigint(timestampAsBlob(timestamp)) FROM'
Thanks,
Harry
The dateof is deprecated in Cassandra >= 2.2... Instead it's better to use function toTimestamp, like this: toTimestamp(now()). When you selecting, you can also use the toUnixTimestamp function if you want to get the timestamp as long:
cqlsh:test> CREATE TABLE test_times (a int, b timestamp, PRIMARY KEY (a,b));
cqlsh:test> INSERT INTO test_times (a,b) VALUES (1, toTimestamp(now()));
cqlsh:test> SELECT toUnixTimestamp(b) FROM test_times;
system.tounixtimestamp(b)
---------------------------
1513086032267
(1 rows)
cqlsh:test> SELECT b FROM test_times;
b
---------------------------------
2017-12-12 13:40:32.267000+0000
(1 rows)
Regarding the performance - there are different considerations:
If you already have the timestamp as number, then you can use it instead of calling function
It's better to use prepared statements instead of "raw inserts" - in this case Cassandra won't need to transfer full query, but only data, and also don't need to parse statement every time.
The pseudo code will look as following (Java-like).
PreparedStatement prepared = session.prepare(
"insert into your_table (field1, field2) values (?, ?)");
while(true) {
session.execute(prepared.bind(value1, value2));
}
I have an update query which I am trying to execute through batchUpdate method of spring jdbc template. This update query can potentially match 1000s of rows in EVENT_DYNAMIC_ATTRIBUTE table which needs to be get updated. Will updating thousands of rows in a table cause any issue in production database apart from timeout? like, will it crash database or slowdown the performance of entire database engine for other connections...etc?
Is there a better way to achieve this instead of firing single update query in spring JDBC template or JPA? I have the following settings for jdbc template.
this.jdbc = new JdbcTemplate(ds);
jdbc.setFetchSize(1000);
jdbc.setQueryTimeout(0); // zero means there is no limit
The update query:
UPDATE EVENT_DYNAMIC_ATTRIBUTE eda
SET eda.ATTRIBUTE_VALUE = 'claim',
eda.LAST_UPDATED_DATE = SYSDATE,
eda.LAST_UPDATED_BY = 'superUsers'
WHERE eda.DYNAMIC_ATTRIBUTE_NAME_ID = 4002
AND eda.EVENT_ID IN
(WITH category_data
AS ( SELECT c.CATEGORY_ID
FROM CATEGORY c
START WITH CATEGORY_ID = 495984
CONNECT BY PARENT_ID = PRIOR CATEGORY_ID)
SELECT event_id
FROM event e
WHERE EXISTS
(SELECT 't'
FROM category_data cd
WHERE cd.CATEGORY_ID = e.PRIMARY_CATEGORY_ID))
If it is one time thing, I normally first select the records which needs to be updated and put in a temporary table or in a csv, and I make sure that I save primary key of those records in a table or in a csv. Then I read records in batches from temporary table or csv, and do the update in the table using the primary key. This way tables are not locked for a long time and you can have fixed set of records added in the batch which needs update and updates are done using primary key so it will be very fast. And if any update fails then you know which records got failed by logging out the failed records primary key in a log file or in an error table. I have followed this approach many time for updating millions of records in the PROD database, as it is very safe approach.
I am trying to perform a SELECT on a cassandra database, using the datastax driver on a Java App.
I already developed simple queries as:
#Repository
public class CustomTaskRepository
extends AbstractCassandraRepository<CustomTask> {
#Accessor
interface ProfileAccessor {
#Query("SELECT * FROM tasks where status = :status")
Result<CustomTask> getByStatus(#Param("status") String status);
}
public List<CustomTask> getByStatus(String status) {
ProfileAccessor accessor = this.mappingManager.createAccessor(ProfileAccessor.class);
Result<CustomTask> tasks = accessor.getByStatus(status);
return tasks.all();
}
}
Thats works great.
The problem I have now is that I want to execute a SELECT statement for more than one status. For example I would like to execute the query for one, two ... or more status codes (Pending, Working, Finalized,etc).
How could I create a #Query Statement with the flexibility of accepting one or more Status codes?
Thanks in advance!!!
EDIT: The table create statement is:
CREATE TABLE tasks(
"reservation_id" varchar,
"task_id" UUID,
"status" varchar,
"asignee" varchar,
PRIMARY KEY((reservation_id),task_id)
)
WITH compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} ;
CREATE INDEX taskid_index ON tasks( task_id );
CREATE INDEX asignee_index ON tasks( asignee );
Try using IN instead of = . If this is partitioning key you will get the rows that you need out. Also note that it might cause performance degradation if there are a lot of statuses in in.