I am trying to retrieve a set of records from a table. The query I am using is:
select * from EmployeeUpdates eu where eu.updateid>0 and eu.department = 'EEE'
The table EmployeeUpdates has around 20 million records. 'updateid' is the primary key and there are no records currently in the table with the department 'EEE'. But the query is taking lots of time, due to which the web-service call is getting timed out.
Currently we have index only on the column 'updateid'. 'department' is a new column added for which we are expecting 'EEE' records.
What changes can I make to retrieve the results faster?
First off, your sql isn't valid, looks like you're missing an 'and' between the 2 conditions.
I'm guessing that all the update ID's are positive, and as its the primary key, they're unique, so I suspect eu.updateid>0 matches every row. This means it's not technically a Tablespace scan, but an index based scan, although if that scan then has all 20 million rows after matching the index, you might as well have a table space scan. The only thing you can really do is add an index to the department field. Depending on what this data is, you could have it on a seperate table, with a numeric primary key and then store that as a foreign key on the eu table. This would mean you scanned through all the departments, then got the updated associated to them, rather than searching every single update for a specific department.
I think you should look into using a Table-per-subclass mapping (more here: http://docs.jboss.org/hibernate/orm/3.3/reference/en-US/html/inheritance.html#inheritance-tablepersubclass-discriminator). You can make the department the discriminator and then you'd have a EEEEmployeUpdates and ECEmployeeUpdates classes. Your query could change then to just query the EEEEmployeeUpdates.
Related
I'm using Amazon Dynamo DB to down load a number of records to Android.
I have 2 Tables.
Table 1 contains a Set of Strings containing ID's
Table 2 has records each with an individual ID.
I want to download 10 records from Table 2 only if the record ID does not appear in the Set of strings in Table 1.
I can do this by downloading all the records in table 2 and then not saving /displaying the ones that appear in the String Set in table 1. However is there a way to only download the ones that don't appear in the String Set?
Any ideals would be appreciated.
Many Thanks
In order to query a dynamodb you need the attribute to either be a range key or partition key which in terns need to be scalar so you cannot directly query what you want. Your best chance of doing what you want if I understand your requirement is scan operation. Scan the whole table(2) and then use queryExpression to filter the results you get using a nested query in table one. This demands you to make the "Set of Strings" either either partition or range key in your first table.
We have a table that will contain a huge amount of time series data. Probably we have to store several entries per millisecond in that table. To fulfill these requirements the table looks like
CREATE TABLE statistic (
name text,
id uuid,
start timestamp,
other_data ...,
PRIMARY KEY (name, start, id)
) WITH CLUSTERING ORDER BY (start DESC);
As you can see, the table consists of two clustering keys, start stores the time when the data arrives, id has the purpose to avoid that data is overwritten when it arrives at the same time.
Now this is ok, we can make range queries like
SELECT * FROM statistic WHERE name ='foo' AND start >= 1453730078182
AND start <= 1453730078251;
But we also need the capability to have additional search parameters in the query like
SELECT * FROM statistic WHERE name = 'foo'
AND start >= 1453730078182 AND start <= 1453730078251 AND other_data = 'bar';
This does not work of course because other_data is not part of the primary key. If we add it to the primary key, we get the following error
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column "other_data" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
That is also OK, that is not the way Cassandra works (I think).
Our approach to solve the problem is to select the needed (time series) data with the above mentioned (first) range query and afterwards filter the data in our Java application. That means we go through the list and kick out all data we don't need in our Java application. One single entry has not much data, but it can happen that we talk about some millions of rows in worst case.
Now I have two questions:
Is that the right approach to solve the problem?
Is Cassandra capable to handle that amount of data?
This does not work of course because other_data is not part of the primary key. If we add it to the primary key, we get the following error
This is a sweet spot for secondary index on column other_data. In your case this index will scale because you always provide the partition key (name) so Cassandra will not hit all nodes in the cluster.
With a secondary index on other_data, your second SELECT statement will be possible.
Now there is another issue with your data model, which is the partition size. Indeed, if you are inserting several entries per milliseconds per name, this will not scale because the partition for each name will grow very fast ...
If the insert is distributed on different partition keys (different name) then it's fine
Just a quick question about locking tables in a postgres database using JDBC. I have a table for which I want to add a new record to, however, To do this for the primary key, I use an increasing integer value.
I want to be able to retrieve the max value of this column in Java and store it as a variable to be used as a new primary key when adding a new row.
This gives me a small problem, as this is going to be modelled as a multi-user system, what happens when 2 locations request the same max value? This will of course create a problem when trying to add the same primary key.
I realise that I should be using an EXCLUSIVE lock on the table to prevent reading or writing while getting the key and adding a new row. However, I can't seem to find any way to deal with table locking in JDBC, just standard transactions.
psuedo code as such:
primaryKey = "SELECT MAX(id) FROM table1;";
primary key++;
//id retrieved again from 2nd source
"INSERT INTO table1 (primaryKey, value 1, value 2);"
You're absolutely right, if two locations request at around the same time, you'll run into a race condition.
The way to handle this is to create a sequence in postgres and select the nextval as the primary key.
I don't know exactly what direction you're heading and how your handle your data, but you could also set the column as a serial and not even include the column in your insert query. The column will automatically auto increment.
There's a DB that contains approximately 300-400 records. I can make a simple query for fetching 30 records like:
SELECT * FROM table
WHERE isValidated = false
LIMIT 30
Some more words about content of DB table. There's a column named isValidated, that can (as you correctly guessed) take one of two values: true or false. After a query some of the records should be made validated (isValidated=true). It is approximately 5-6 records from each bunch of 30 records. Correspondingly after each query, I will fetch the records (isValidated=false) from previous query. In fact, I'll never get to the end of the table with such approach.
The validation process is made with Java + Hibernate. I'm new to Hibernate, so I use Criterion for making this simple query.
Is there any best practices for such task? The variant with adding a flag-field (that marks records which were fetched already) is inappropriate (over-engineering for this DB).
Maybe there's an opportunity to create some virtual table where records that were already processed will be stored or something like this. BTW, after all the records are processed, it is planned to start processing them again (it is possible, that some of them need to be validated).
Thank you for your help in advance.
I can imagine several solutions:
store everything in memory. You only have 400 records, and it could be a perfectly fine solution given this small number
use an order by clause (which you should do anyway) on a unique column (the PK, for example), store the ID of the last loaded record, and make sure the next query uses where ID > :lastId
I have a webservice in java that receives a list of information to be inserted or updated in a database. I don't know which one is to insert or update.
Which one is the best approach to abtain better performance results:
Iterate over the list(a object list, with the table pk on it), try to insert the entry on Database. If the insert failed, run a update
Try to load the entry from database. if the results retrieved update, if not insert the entry.
another option? tell me about it :)
In first calls, i believe that most of the entries will be new bd entries, but there will be a saturation point that most of the entries will be to update.
I'm talking about a DB table that could reach over 100 million entries in a mature form.
What will be your approach? Performance is my most important goal.
If your database supports MERGE, I would have thought that was most efficient (and treats all the data as a single set).
See:
http://www.oracle.com/technology/products/oracle9i/daily/Aug24.html
https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=194
If performance is your goal then first get rid of the word iterate from your vocabulary! learn to do things in sets.
If you need to update or insert, always do the update first. Otherwise it is easy to find yourself updating the record you just inserted by accident. If you are doing this it helps to have an identifier you can look at to see if the record exists. If the identifier exists, then do the update otherwise do the insert.
The important thing is to understand the balance or ratio between the number of inserts versus the number of updates on the list you receive. IMHO you should implement an abstract strategy that says "persists this on database". Then create concrete strategies that (for example):
checks for primary key, if zero records are found does the insert, else updates
Does the update and, if fails, does the insert.
others
And then pull the strategy to use (the class fully qualified name for example) from a configuration file. This way you can switch from one strategy to another easily. If it is feasible, could be depending on your domain, you can put an heuristic that selects the best strategy based on the input entities on the set.
MySQL supports this:
INSERT INTO foo
SET bar='baz', howmanybars=1
ON DUPLICATE KEY UPDATE howmanybars=howmanybars+1
Option 2 is not going to be the most efficient. The database will already be making this check for you when you do the actual insert or update in order to enforce the primary key. By making this check yourself you are incurring the overhead of a table lookup twice as well as an extra round trip from your Java code. Choose which case is the most likely and code optimistically.
Expanding on option 1, you can use a stored procedure to handle the insert/update. This example with PostgreSQL syntax assumes the insert is the normal case.
CREATE FUNCTION insert_or_update(_id INTEGER, _col1 INTEGER) RETURNS void
AS $$
BEGIN
INSERT INTO
my_table (id, col1)
SELECT
_id, _col1;
EXCEPTION WHEN unique_violation THEN
UPDATE
my_table
SET
col1 = _col1
WHERE
id = _id;
END;
END;
$$
LANGUAGE plpgsql;
You could also make the update the normal case and then check the number of rows affected by the update statement to determine if the row is actually new and you need to do an insert.
As alluded to in some other answers, the most efficient way to handle this operation is in one batch:
Take all of the rows passed to the web service and bulk insert them into a temporary table
Update rows in the mater table from the temp table
Insert new rows in the master table from the temp table
Dispose of the temp table
The type of temporary table to use and most efficient way to manage it will depend on the database you are using.