sql query performance: finding max - java

I have a table with group and permission column. I want to find the max permission from a list of group. I am using java and oracle database, I thought of two ways to do this:
Way 1:
in java loop through the group list
result = select permission from table where group = currentgroup
if result > max, max = result
Way 2:
max = select max(permission) from table where group in (group list)
I thought way 2 would be faster, but then group list can be very long and I dont know if it is a good idea to have long list in a single sql query.

From the information you've given, the second approach is by far the best. Databases are optimised directly for these kinds of tasks, so within reason, its always best to narrow the data down with the database. The first approach means the database needs to return all values anyway, increasing processing time, bandwidth and using up memory within your java application.

Related

How to select items in date range in DynamoDB

How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.

Database performance for filter by integer or boolean?

I will be having a database table with a few million entries, eg products of an online shop.
If one is out of stock, I want to mark it somehow, and I want to exclude it from any findAll() sql fetches.
Therefore I though one of the following options:
each product already has an integer count of availability. I anyhow have to set that = 0. select * from products where availcount > 0
or I could introduce a boolean available = 'true' field that I set to false if out of stock, and the query would then be ...where available = 'true'
Question: will this make any difference? Are there reasons one of these options should be preferred?
I would stick with the stock levels (int availcount). Bit fields are typically very difficult to index, unless there is a massive skew in the data such that there are of the order of 1% or less products out of stock (and since you will likely be searching for in-stock products only, any index on the flag will be unused).
Since it seems you already store the actual stock level in any event, not storing available in stock indicator will save you headaches on trying to keep the two columns in synch.
Finally, many RDBMS's allow you to add COMPUTED columns (or failing which, add the available indicator to a VIEW), which will allow you the logical derivation of available indicator from the actual availcount, without any storage overhead.
Edit
As per the comments below, note that an index on availcount (for queries WHERE availcount = 0 and availcount > 0) will be equally un-SARGable as an index on a bit field, although an index may not be needed if the products are generally searched by other criteria.
In addition to deriving is available in stock ? in the database, this determination can also be taken in code, e.g. an additional bool isAvailable() { return availcount > 0 ;} method on your entity class.
If you already have the availcount column anyway, there is no reason to add a new one, your availcount > 0 will do.
If you do not need the count for other reasons, and are just trying to decide between having a count or a boolean, consider how hard it is going to be to update that column rather than filtering.
If you only have a boolean, you'll only need to touch when the product goes out of stock (or comes back in). Having a count is more complex: you'd need to update it every time a sale is made or the item is restocked. This is more complicated, has possible performance implications, and a bunch or corner cases to care about. So, unless you need the count for other purposes, it's, probably, a better idea to stick with the boolean.
I think the two options would be equally efficient on SELECT as long as there's an index in the column in question.
Indexing availcount will have a small penalty on any update of this column (and I guess this column will be updated often). On the other hand, having an available column will add redundancy to your database (i.e. it will not be normalized) which you may want to avoid.

Better to query once, then organize objects based on returned column value, or query twice with different conditions?

I have a table which I need to query, then organize the returned objects into two different lists based on a column value. I can either query the table once, retrieving the column by which I would differentiate the objects and arrange them by looping through the result set, or I can query twice with two different conditions and avoid the sorting process. Which method is generally better practice?
MY_TABLE
NAME AGE TYPE
John 25 A
Sarah 30 B
Rick 22 A
Susan 43 B
Either SELECT * FROM MY_TABLE, then sort in code based on returned types, or
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'A' followed by
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'B'
Logically, a DB query from a Java code will be more expensive than a loop within the code because querying the DB involves several steps such as connecting to DB, creating the SQL query, firing the query and getting the results back.
Besides, something can go wrong between firing the first and second query.
With an optimized single query and looping with the code, you can save a lot of time than firing two queries.
In your case, you can sort in the query itself if it helps:
SELECT * FROM MY_TABLE ORDER BY TYPE
In future if there are more types added to your table, you need not fire an additional query to retrieve it.
It is heavily dependant on the context. If each list is really huge, I would let the database to the hard part of the job with 2 queries. At the opposite, in a web application using a farm of application servers and a central database I would use one single query.
For the general use case, IMHO, I will save database resource because it is a current point of congestion and use only only query.
The only objective argument I can find is that the splitting of the list occurs in memory with a hyper simple algorithm and in a single JVM, where each query requires a bit of initialization and may involve disk access or loading of index pages.
In general, one query performs better.
Also, with issuing two queries you can potentially get inconsistent results (which may be fixed with higher transaction isolation level though ).
In any case I believe you still need to iterate through resultset (either directly or by using framework's methods that return collections).
From the database point of view, you optimally have exactly one statement that fetches exactly everything you need and nothing else. Therefore, your first option is better. But don't generalize that answer in way that makes you query more data than needed. It's a common mistake for beginners to select all rows from a table (no where clause) and do the filtering in code instead of letting the database do its job.
It also depends on your dataset volume, for instance if you have a large data set, doing a select * without any condition might take some time, but if you have an index on your 'TYPE' column, then adding a where clause will reduce the time taken to execute the query. If you are dealing with a small data set, then doing a select * followed with your logic in the java code is a better approach
There are four main bottlenecks involved in querying a database.
The query itself - how long the query takes to execute on the server depends on indexes, table sizes etc.
The data volume of the results - there could be hundreds of columns or huge fields and all this data must be serialised and transported across the network to your client.
The processing of the data - java must walk the query results gathering the data it wants.
Maintaining the query - it takes manpower to maintain queries, simple ones cost little but complex ones can be a nightmare.
By careful consideration it should be possible to work out a balance between all four of these factors - it is unlikely that you will get the right answer without doing so.
You can query by two conditions:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B'
This will do both for you at once, and if you want them sorted, you could do the same, but just add an order by keyword:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B' ORDER BY TYPE ASC
This will sort the results by type, in ascending order.
EDIT:
I didn't notice that originally you wanted two different lists. In that case, you could just do this query, and then find the index where the type changes from 'A' to 'B' and copy the data into two arrays.

Efficient solution for grouping same values in a large dataset

At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?
Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).
You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END
Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).
If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.

Implementing result paging in hibernate (getting total number of rows)

How do I implement paging in Hibernate? The Query objects has methods called setMaxResults and setFirstResult which are certainly helpful. But where can I get the total number of results, so that I can show link to last page of results, and print things such as results 200 to 250 of xxx?
You can use Query.setMaxResults(int results) and Query.setFirstResult(int offset).
Editing too: There's no way to know how many results you'll get. So, first you must query with "select count(*)...". A little ugly, IMHO.
You must do a separate query to get the max results...and in the case where between time A of the first time the client issues a paging request to time B when another request is issued, if new records are added or some records now fit the criteria then you have to query the max again to reflect such. I usually do this in HQL like this
Integer count = (Integer) session.createQuery("select count(*) from ....").uniqueResult();
for Criteria queries I usually push my data into a DTO like this
ScrollableResults scrollable = criteria.scroll(ScrollMode.SCROLL_INSENSITIVE);
if(scrollable.last()){//returns true if there is a resultset
genericDTO.setTotalCount(scrollable.getRowNumber() + 1);
criteria.setFirstResult(command.getStart())
.setMaxResults(command.getLimit());
genericDTO.setLineItems(Collections.unmodifiableList(criteria.list()));
}
scrollable.close();
return genericDTO;
you could perform two queries - a count(*) type query, which should be cheap if you are not joining too many tables together, and a second query that has the limits set. Then you know how many items exists but only grab the ones being viewed.
You can do one thing. just prepare Criteria query as per your busness requirement with all Predicates , sorting , searching etc.
and then do as below :-
CriteriaBuilder criteriaBuilder = em.getCriteriaBuilder();
CriteriaQuery<Feedback> criteriaQuery = criteriaBuilder.createQuery(Feedback.class);
//Just Prepare your all Predicates as per your business need.
//eg :-
yourPredicateAsPerYourBusnessNeed = criteriaBuilder.equal(Root.get("applicationName"), applicationName);
criteriaQuery.where(yourPredicateAsPerYourBusnessNeed).distinct(true);
TypedQuery<Feedback> criteriaQueryWithPredicate = em.createQuery(criteriaQuery);
//Getting total Count Here
Long totalCount = criteriaQueryWithPredicate.getResultStream().distinct().count();
Now we have our actual data with us as above with total count , right.
So now we can apply pagination on the data we have in our hand above , as below :-
List<Feedback> feedbackList = criteriaQueryWithPredicate.setFirstResult(offset).setMaxResults(pageSize).getResultList();
Now You can prepare a wrapper with your List return by DB along with the totalCount , startingPageNo that is offset here in this case, page Size etc and can return to your service / controller class.
I am 101 % sure , this will solve your problem, Because I was facing same problem and sorted it out same way.
Thanks- Sunil Kumar Mali
You can just setMaxResults to the maximum number of rows you want returned. There is no harm in setting this value greater than the number of actual rows available. The problem the other solutions is they assume the ordering of records remains the same each repeat of the query, and there are no changes going on between commands.
To avoid that if you really want to scroll through results, it is best to use the ScrollableResults. Don't throw this object away between paging, but use it to keep the records in the same order. To find out the number of records from the ScrollableResults, you can simply move to the last() position, and then get the row number. Remember to add 1 to this value, since row numbers start counting at 0.
I personally think you should handle the paging in the front-end. I know this isn't that efficiƫnt but at least it would be less error prone.
If you would use the count(*) thing what would happen if records get deleted from the table in between requests for a certain page? Lots of things could go wrong this way.

Categories