GAE Search API: How to prevent the 2000 bytes query limit? - java

We are using the GAE Search API since quite some time but recently hit the query length limit of 2000 bytes:
java.lang.IllegalArgumentException: query string must not be longer
than 2000 bytes, was 2384
We're basically having documents saved with a secondary id set as an atomic field. Within our query we do some sorting and distance calculations and also exclude docs with those secondary ids matching a list of ids with a NOT statement like following:
... AND NOT sec_id:(x AND y AND ...)
With a certain amount of excluded ids we obviously hit the query length limit. I could split the query into separate ones with the same base query and only use a different set of excluded ids but then the sorting is problematic.
So I am wondering if there is another way to implement this kind of query, preferably with a black and also a white list within one query (AND NOT :(..) & AND :(..)).

Related

scanindexforward() not working as expected while doing pagination in dynamodb

I am using DynamoDBEnchancedAsyncClient to query DynamoDB using GSI and pagination. Below is the code that I am using to achieve the same. I am tying to limit the number of items per page and number of pages sent to the subscriber of the Mono using below code. I need to sort the records in each page in descending order using the timestamp and this is the sort key in my GSI. For this I am using scanIndexForward(false) below. However I am not getting any records in the page even though there are in total 4 records that are present in DynamoDB.
SdkPublisher<Page<Customer>> query = secindex.query(QueryEnhancedRequest.builder().queryConditional(queryconditional).scanIndexForward(false)
.limit(2).build())
Mono.from(PagePublisher.create(query().limit(1)))
secindex is the DynamoDBAsyncIndex which is the GSI . As per the above code, 1 page should be returned to client with 2 records however none are getting returned. Also If I remove scanIndexForward(false) then the result is as expected but sorted in ascending order. How do I make it return limited records in descending order ? Does the pagination work differently when the scanIndexForward() is supplied?
Without 100% knowing what your filters are on your dynamo call, i can only guess - but I've seen this sort of thing many times
so.
Correction limit is applied before the query is returned not after. This was incorrect below - but because of the nature of additional filters being applied after the return this could indeed result in 2 items being returned that are then filtered out and an ultimate return of 0
end correction
Dynamodb Query does not perform any filter/limits on the data before returning it. The only thing a standard query to dynamo can do is check Hash Key/Range Key with some basic Range Key filtering ( gt, lt, between, begins with ect) - all other filters on attributes that are not Hash/Range are done by the SDK you're using after getting back a set of responses.
1 - Query the dynamo with a Hash Key/Range Key combination and any filtering on Range.
2 - All items that match this are sent back - up to 1mb data. Anything more than that needs additional calls
3 - Apply the limit to these results! this was incorrect, this is applied before being returned
4 - Apply the filter to whats been limited!
5 - Then whatever is left is returned.
This means that what happens often when you are using filter conditions on a dynamo query, you may not actually get back what you expect to - because either they are on the next page and what is on the current page, nothing matches the filter so you get back 0.
Since you are also using Limit, when it sorts the data in the same order as the Sort Key (as scan index forward is false) then if the first two values don't match your other filters, you get 0 items back.
I would recommend you try querying all the items without any filters beyond just Hash/Range key - no other attribute filters.
Then filter the response manually on your side.
You also should be aware of the internal pagination of the SDKs for dynamo - it will only return 1mb amount of data from the Dyanmo in a single call. Anything beyond that requires a second call including the LastEvaluatedKey that is returned in the first page of results. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html has more information.
If your system cannot afford to do the filtering itself after the query is called, then you need to re-evaluate your HashKey/SortKey combinations. Dynamo is best aligned in an Access Pattern schema - that is to say, I have X data and I will need Y data, so I will cause X to be a Hash Key, and the Y values to be different Range Keys under that X.
like as an example: User data. You might have a HashKey of "user_id".
Then you have several different patterns for Range_keys
meta# (with attributes of email, username, hashed and salted passwords, ect)
post#1
post#2
post#3
avatar#
and so you make a query on just Hash Key of the user id, you get all the info. Or if you have page with just their posts, you can do a query of hash key of user id and range key (begins with post#)
This is the most important aspect of a good dynamo schema - the ability to do queries on any thing you need with just a Hash Key or a HashKey and RangeKey
With a well understood set of access patterns that you will need, and a dynamo that is set up appropriately, then you should need no filters or limits, because your combination of Hash/Range key queries will be enough.
(this does sometimes mean a duplication of data! You may have the same information in a Post# item as you do in the Meta# item - ie they both contain usernames. This is OK as when you query for a post you need the user name - as well as when you query for the password/username to see if they match. - Dynamo as a NoSQL handles this very well, and very fast - a given Hash/Range key combination is basically considered its own table in terms of access, making queries VERY fast against it.)

How pagination works and how it helps in reducing response time when dealing with millions of records?

I am working on a trading application which is deployed on Weblogic and has a limitation that any request that takes more than one minute in processing is killed automatically. The restriction is set at the kernel level and is not application dependent, and is not in our control to change.
There is a search functionality in my application which fails when queried for more than 100 records during a given time frame, and I was assigned a task to see the possible solutions.
The first approach that I suggested was to use pagination instead of querying for all records at the same time. I was told that it won't help as on the database side it would any ways fetch all records at the same time. This was new for me as I had the understanding until now that this is handled on the database side and the query fetches only given number of records per page, and with each previous and next it handles it reducing the overall response time.
I did search a lot before posting this query on how pagination works and how it helps reduce the response time but did not get a concrete answer. So would be really great if somebody can help me explain this. Thanks in advance!!!
The first approach that I suggested was to use pagination instead of querying for all records at the same time. I was told that it won't help as on the database side it would any ways fetch all records at the same time
This is true if you are using LIMIT and OFFSET clause in your query for pagination. In this case, database loads the matched records(matched with WHERE clause) from disk and then applies OFFSET and LIMIT clause. Since databases use B-tree for indexing, it doesn't know to jump to OFFSET record directly without loading matched records to memory.
To load only the page size records, you need to use key based pagination. In this approach we avoid OFFSET clause, instead we use the key of record and LIMIT clause.
Example for key-based pagination:
Let's say you want to paginate the users
Request for first 10 records:
select * from user where userid > 0 order by userid asc limit 10
Let's say last userid in above query is 10.
Request for next 10 records:
select * from user where userid > 10 order by userid asc limit 10

Fetching records one by one from PostgreSql DB

There's a DB that contains approximately 300-400 records. I can make a simple query for fetching 30 records like:
SELECT * FROM table
WHERE isValidated = false
LIMIT 30
Some more words about content of DB table. There's a column named isValidated, that can (as you correctly guessed) take one of two values: true or false. After a query some of the records should be made validated (isValidated=true). It is approximately 5-6 records from each bunch of 30 records. Correspondingly after each query, I will fetch the records (isValidated=false) from previous query. In fact, I'll never get to the end of the table with such approach.
The validation process is made with Java + Hibernate. I'm new to Hibernate, so I use Criterion for making this simple query.
Is there any best practices for such task? The variant with adding a flag-field (that marks records which were fetched already) is inappropriate (over-engineering for this DB).
Maybe there's an opportunity to create some virtual table where records that were already processed will be stored or something like this. BTW, after all the records are processed, it is planned to start processing them again (it is possible, that some of them need to be validated).
Thank you for your help in advance.
I can imagine several solutions:
store everything in memory. You only have 400 records, and it could be a perfectly fine solution given this small number
use an order by clause (which you should do anyway) on a unique column (the PK, for example), store the ID of the last loaded record, and make sure the next query uses where ID > :lastId

Does using Limit in query using JDBC, have any effect in performance?

If we use the Limit clause in a query which also has ORDER BY clause and execute the query in JDBC, will there be any effect in performance? (using MySQL database)
Example:
SELECT modelName from Cars ORDER BY manuDate DESC Limit 1
I read in one of the threads in this forum that, by default a set size is fetched at a time. How can I find the default fetch size?
I want only one record. Originally, I was using as follows:
SQL Query:
SELECT modelName from Cars ORDER BY manuDate DESC
In the JAVA code, I was extracting as follows:
if(resultSett.next()){
//do something here.
}
Definitely the LIMIT 1 will have a positive effect on the performance. Instead of the entire (well, depends on default fetch size) data set of mathes being returned from the DB server to the Java code, only one row will be returned. This saves a lot of network bandwidth and Java memory usage.
Always delegate as much as possible constraints like LIMIT, ORDER, WHERE, etc to the SQL language instead of doing it in the Java side. The DB will do it much better than your Java code can ever do (if the table is properly indexed, of course). You should try to write the SQL query as much as possibe that it returns exactly the information you need.
Only disadvantage of writing DB-specific SQL queries is that the SQL language is not entirely portable among different DB servers, which would require you to change the SQL queries everytime when you change of DB server. But it's in real world very rare anyway to switch to a completely different DB make. Externalizing SQL strings to XML or properties files should help a lot anyway.
There are two ways the LIMIT could speed things up:
by producing less data, which means less data gets sent over the wire and processed by the JDBC client
by potentially having MySQL itself look at fewer rows
The second one of those depends on how MySQL can produce the ordering. If you don't have an index on manuDate, MySQL will have to fetch all the rows from Cars, then order them, then give you the first one. But if there's an index on manuDate, MySQL can just look at the first entry in that index, fetch the appropriate row, and that's it. (If the index also contains modelName, MySQL doesn't even need to fetch the row after it looks at the index -- it's a covering index.)
With all that said, watch out! If manuDate isn't unique, the ordering is only partially deterministic (the order for all rows with the same manuDate is undefined), and your LIMIT 1 therefore doesn't have a single correct answer. For instance, if you switch storage engines, you might start getting different results.

How to implement several threads in Java for downloading a single table data?

How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.

Categories