We have a table that will contain a huge amount of time series data. Probably we have to store several entries per millisecond in that table. To fulfill these requirements the table looks like
CREATE TABLE statistic (
name text,
id uuid,
start timestamp,
other_data ...,
PRIMARY KEY (name, start, id)
) WITH CLUSTERING ORDER BY (start DESC);
As you can see, the table consists of two clustering keys, start stores the time when the data arrives, id has the purpose to avoid that data is overwritten when it arrives at the same time.
Now this is ok, we can make range queries like
SELECT * FROM statistic WHERE name ='foo' AND start >= 1453730078182
AND start <= 1453730078251;
But we also need the capability to have additional search parameters in the query like
SELECT * FROM statistic WHERE name = 'foo'
AND start >= 1453730078182 AND start <= 1453730078251 AND other_data = 'bar';
This does not work of course because other_data is not part of the primary key. If we add it to the primary key, we get the following error
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column "other_data" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
That is also OK, that is not the way Cassandra works (I think).
Our approach to solve the problem is to select the needed (time series) data with the above mentioned (first) range query and afterwards filter the data in our Java application. That means we go through the list and kick out all data we don't need in our Java application. One single entry has not much data, but it can happen that we talk about some millions of rows in worst case.
Now I have two questions:
Is that the right approach to solve the problem?
Is Cassandra capable to handle that amount of data?
This does not work of course because other_data is not part of the primary key. If we add it to the primary key, we get the following error
This is a sweet spot for secondary index on column other_data. In your case this index will scale because you always provide the partition key (name) so Cassandra will not hit all nodes in the cluster.
With a secondary index on other_data, your second SELECT statement will be possible.
Now there is another issue with your data model, which is the partition size. Indeed, if you are inserting several entries per milliseconds per name, this will not scale because the partition for each name will grow very fast ...
If the insert is distributed on different partition keys (different name) then it's fine
Related
I'm trying to implement a counter with Java, Spring, Hibernate and Oracle SQL. Each record represents a count, by a given timestamp. Let's say each record is uniquely identified by the minute, and each record holds a count column. The service should expect to receive a ton of concurrent requests and my update a counter column for possibly the same record.
In my table, if the record does not exist, just insert the record in and set its count to 1. Otherwise, find the record by timestamp and increase its existing counter column by 1.
In order to ensure that we're maintain data consistency and integrity, I'm using pessimistic locking. For example, if 20 counts come in at the same time, and not necessarily by the same user, it's possible that we may override the record from a stale read of that record before updating. With locking, I'm ensuring that if 20 counts come in, the net effect on the database should represent the 20 count.
So locking is fine, but the problem is that if the record never did exist in the first place, and we have two or more concurrent requests coming in trying to update the not-yet-existant record, I've observed that the a duplicate record gets inserted because we cannot lock on a record that doesn't exist yet. How can we ensure that no duplicates get created in the table? Should it be controlled via Oracle? Or can I manage this via my app and Hibernate?
Thank you.
One was to avoid this sort of problem altogether would be to just generate the count at the time you actually query the data. Oracle has an analytic function ROW_NUMBER() which can assign a row number to each record in the result set of a query. As a rough example, consider the following query:
SELECT
ts,
ROW_NUMBER() OVER (ORDER BY ts) rn
FROM yourTable
The count you want would be in the rn column, representing the number of records appearing since the first entry in the table. Of course, you could further restrict the query.
This approach is robust to removing records, as the count would always start with 1. One drawback is that row number functionality is not supported by Hibernate. You would have to run this either as a native query or a stored proc.
We have a "audit" table that we create lots of rows in. Our persistence layer queries the audit table sequence to create a new row in the audit table. With millions of rows being created daily the select statement to get the next value from the sequence is one of our top ten most executed queries. We would like to reduce the number of database roundtrips just to get the sequence next value (primary key) before inserting a new row in the audit table. We know you can't batch select statements from JDBC. Are there any common techniques for reducing database roundtrips to get a sequence next value?
Get a couple (e.g. 1000) of sequence values in advance by a single select:
select your_sequence.nextval
from dual
connect by level < 1000
cache the obtained sequences and use it for the next 1000 audit inserts.
Repeat this when you have run out of cached sequence values.
Skip the select statement for the sequence and generate the sequence value in the insert statement itself.
insert (ID,..) values (my_sequence.nextval,..)
No need for an extra select. If you need the sequence value get it by adding a returning clause.
insert (ID,..) values (my_sequence.nextval,..) returning ID into ..
Save some extra time by specifying a cache value for the sequence.
I suggest you change the "INCREMENT BY" option of the sequence and set it to a number like 100 (you have to decide what step size must be taken by your sequence, 100 is an example.)
then implement a class called SequenceGenerator, in this class you have a property that contains the nextValue, and every 100 times, calls the sequence.nextVal in order to keep the db sequence up to date.
in this way you will go to db every 100 inserts for the sequence nextVal
every time the application starts, you have to initialize the SequenceGenerator class with the sequence.nextVal.
the only downside of this approach is that if your application stops for any reason, you will loose some of the sequences values and there will be gaps in your ids. but it should not be a logical problem if you don't have anu business logic on the id values.
I am trying to retrieve a set of records from a table. The query I am using is:
select * from EmployeeUpdates eu where eu.updateid>0 and eu.department = 'EEE'
The table EmployeeUpdates has around 20 million records. 'updateid' is the primary key and there are no records currently in the table with the department 'EEE'. But the query is taking lots of time, due to which the web-service call is getting timed out.
Currently we have index only on the column 'updateid'. 'department' is a new column added for which we are expecting 'EEE' records.
What changes can I make to retrieve the results faster?
First off, your sql isn't valid, looks like you're missing an 'and' between the 2 conditions.
I'm guessing that all the update ID's are positive, and as its the primary key, they're unique, so I suspect eu.updateid>0 matches every row. This means it's not technically a Tablespace scan, but an index based scan, although if that scan then has all 20 million rows after matching the index, you might as well have a table space scan. The only thing you can really do is add an index to the department field. Depending on what this data is, you could have it on a seperate table, with a numeric primary key and then store that as a foreign key on the eu table. This would mean you scanned through all the departments, then got the updated associated to them, rather than searching every single update for a specific department.
I think you should look into using a Table-per-subclass mapping (more here: http://docs.jboss.org/hibernate/orm/3.3/reference/en-US/html/inheritance.html#inheritance-tablepersubclass-discriminator). You can make the department the discriminator and then you'd have a EEEEmployeUpdates and ECEmployeeUpdates classes. Your query could change then to just query the EEEEmployeeUpdates.
Just a quick question about locking tables in a postgres database using JDBC. I have a table for which I want to add a new record to, however, To do this for the primary key, I use an increasing integer value.
I want to be able to retrieve the max value of this column in Java and store it as a variable to be used as a new primary key when adding a new row.
This gives me a small problem, as this is going to be modelled as a multi-user system, what happens when 2 locations request the same max value? This will of course create a problem when trying to add the same primary key.
I realise that I should be using an EXCLUSIVE lock on the table to prevent reading or writing while getting the key and adding a new row. However, I can't seem to find any way to deal with table locking in JDBC, just standard transactions.
psuedo code as such:
primaryKey = "SELECT MAX(id) FROM table1;";
primary key++;
//id retrieved again from 2nd source
"INSERT INTO table1 (primaryKey, value 1, value 2);"
You're absolutely right, if two locations request at around the same time, you'll run into a race condition.
The way to handle this is to create a sequence in postgres and select the nextval as the primary key.
I don't know exactly what direction you're heading and how your handle your data, but you could also set the column as a serial and not even include the column in your insert query. The column will automatically auto increment.
I have a webservice in java that receives a list of information to be inserted or updated in a database. I don't know which one is to insert or update.
Which one is the best approach to abtain better performance results:
Iterate over the list(a object list, with the table pk on it), try to insert the entry on Database. If the insert failed, run a update
Try to load the entry from database. if the results retrieved update, if not insert the entry.
another option? tell me about it :)
In first calls, i believe that most of the entries will be new bd entries, but there will be a saturation point that most of the entries will be to update.
I'm talking about a DB table that could reach over 100 million entries in a mature form.
What will be your approach? Performance is my most important goal.
If your database supports MERGE, I would have thought that was most efficient (and treats all the data as a single set).
See:
http://www.oracle.com/technology/products/oracle9i/daily/Aug24.html
https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=194
If performance is your goal then first get rid of the word iterate from your vocabulary! learn to do things in sets.
If you need to update or insert, always do the update first. Otherwise it is easy to find yourself updating the record you just inserted by accident. If you are doing this it helps to have an identifier you can look at to see if the record exists. If the identifier exists, then do the update otherwise do the insert.
The important thing is to understand the balance or ratio between the number of inserts versus the number of updates on the list you receive. IMHO you should implement an abstract strategy that says "persists this on database". Then create concrete strategies that (for example):
checks for primary key, if zero records are found does the insert, else updates
Does the update and, if fails, does the insert.
others
And then pull the strategy to use (the class fully qualified name for example) from a configuration file. This way you can switch from one strategy to another easily. If it is feasible, could be depending on your domain, you can put an heuristic that selects the best strategy based on the input entities on the set.
MySQL supports this:
INSERT INTO foo
SET bar='baz', howmanybars=1
ON DUPLICATE KEY UPDATE howmanybars=howmanybars+1
Option 2 is not going to be the most efficient. The database will already be making this check for you when you do the actual insert or update in order to enforce the primary key. By making this check yourself you are incurring the overhead of a table lookup twice as well as an extra round trip from your Java code. Choose which case is the most likely and code optimistically.
Expanding on option 1, you can use a stored procedure to handle the insert/update. This example with PostgreSQL syntax assumes the insert is the normal case.
CREATE FUNCTION insert_or_update(_id INTEGER, _col1 INTEGER) RETURNS void
AS $$
BEGIN
INSERT INTO
my_table (id, col1)
SELECT
_id, _col1;
EXCEPTION WHEN unique_violation THEN
UPDATE
my_table
SET
col1 = _col1
WHERE
id = _id;
END;
END;
$$
LANGUAGE plpgsql;
You could also make the update the normal case and then check the number of rows affected by the update statement to determine if the row is actually new and you need to do an insert.
As alluded to in some other answers, the most efficient way to handle this operation is in one batch:
Take all of the rows passed to the web service and bulk insert them into a temporary table
Update rows in the mater table from the temp table
Insert new rows in the master table from the temp table
Dispose of the temp table
The type of temporary table to use and most efficient way to manage it will depend on the database you are using.