App Engine SearchAPI: java.util.concurrent.CancellationException: Task was cancelled

App Engine SearchAPI: java.util.concurrent.CancellationException: Task was cancelled - java

Some of my App Engine Search API queries give a 'java.util.concurrent.CancellationException: Task was cancelled' exception. The error is reproducable.
I have multiple indexes. On some indexes, those queries run, on others they fail.
The query is very basic. If I run it from the admin console (https://console.cloud.google.com/appengine/search/index), it gives no problem.
There is nothing special about the query.
The query filters on 2 atom fields: isReliable = "1" AND markedForDelete = "0", and sorts on a number field.
There seems nothing wrong with the code, as it runs many of such queries with no problem, far more difficult as the failing ones.

I've seen such exceptions caused by timeout limits. Check in the logs if you get them after app. the same execution time (e.g. 59-60 seconds).
If this is not a user-facing request, you can move it into a task, which has 10 minutes execution limit. If this is a user-facing request, some changes in the data model might be necessary. For example, you may combine some fields into flags for frequently used queries, e.g. isReliable = "1" AND markedForDelete = "0" becomes code = "10" or "reliableToDelete = "true".

Related

How to identify expensive queries by the Query DSL?

I have a requirement in my application: to identify expensive elasticsearch queries in the application.
I only know there's Query DSL for elasticsearch. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
I need to identify each elasticsearch query in the reverse proxy for elasticsearch (the reverse proxy is developed in java, just to throttle the requests to ES and do some user statistics), if it's expensive query, only limited users can perform at a specific rate limit.
What is difficult to me is how to identify the expensive queries. I know that there is a switch for elasticsearch, can disable / enable the expensive queries by setting this parameter. I read elasticsearch source code, but I cannot find how the elasticsearch identify different kinds of expensive queries.
If you know:
Is there any elasticsearch API (from elasticsearch client sdk) that can identify expensive queries ? Then I can invoke the API directly in my application.
If not, do you know what's the effective way to identify expensive queries by analysis the query body ? by some AST (Abstract Syntax Tree) resolver ? Or by search specific keywords in the query body ?
I'd really appreciate some help on this!

There isnt a good 'native' way to do it in Elasticsearch, but you do have some options that might help
Setting timeout or terminate_after
This option looks at your requirement from a different perspective.
From Elasticsearch docs: search-your-data
You could save records of the amount of time each query, performed by the user, took by looking at the took field returned in the result.
{
"took": 5,
"timed_out": false,
...
}
This way you have a record of how many queries a user performed in a time-windows that were 'expansive' (took more then X ).
For that user, you can start adding the timeout or terminate_after params that will try to limit the query execution. this wont prevent the user from performing an expansive query, but it will try to cancel long running queries after 'timeout' has expired, returning a partial or empty result back to the user.
GET /my-index-000001/_search
{
"timeout": "2s",
"query": {
"match": {
"user.id": "kimchy"
}
}
}
This will limit the affect of the expansive queries on the cluster, performed by that user.
a side-note; this stackoverflow answer states that there are certain queries that can still bypass the timeout/terminate_after flags, such as script.
terminate_after limits the amount of documents searched though on each of the shards, this might is an alternative option to be used, or even another backup if timeout is too high or ignored for some reason.
Long term analytics
This answer requires a lot more work probably, but you could save statistics on queries performed and the amount of time they took.
You should probably use the json representation of the queryDSL in this case, save them in an elasticsearch index along what the time that query took and keep aggregates of the average time similar queries take.
You could possibly use the rollup feature to pre-aggregate all the averages and check a query against this index if its a "possibly expansive query".
The problem here is which part of the query to save and which queries are "similar" enough to be considered for this aggregation.
Searching for keywords in the query
You stated this as an option as well. the DSL query in the end translates to a REST call with JSON body, so using a JsonNode you could look for specific sub-elements that you 'think' will make the query expansive and even limit things like 'amount of buckets' etc.
Using ObjectMapper you could write the query into a string and just look for keywords, this would be the easiest solution.
There are specific features that we know require a lot of resources from Elasticsearch and can potentially take a long time to finish, so these could be limited through this answer as a "first defense".
Examples:
Highlighting
Scripts
search_analyzers
etc...
So although this answer is the most naive, it could be a fast win while you work on a long term solution that requires analytics.

In addition to the answer by Dima with some good pointers, here is a list of usual suspects for expensive / slow queries: https://blog.bigdataboutique.com/2022/10/expensive-queries-in-elasticsearch-and-opensearch-a83194
In general we'd split the discussion into three:
Is this the query that is slow? see the list above for usual suspects. Some of them by the way can be disabled by setting search.allow_expensive_queries to false in cluster settings.
Or is it an aggregations request?
Maybe it's the cluster that is overwhelmed that makes queries slow, and not the actual queries.
The only way to figure this out is to look at cluster metrics over time, and correlate with the slow queries. You can also collect all your queries and analyze them for suspected culprits, and correlate with their latency. Usually that highlights a few things that can be improved (e.g. better use of caches, etc).

Google App-Engine Datastore is extremely slow

I need help in understanding why the below code is taking 3 to 4 seconds.
UPDATE: Use case for my application is to get the activity feed of a person since last login. This feed could contain updates from friends or some new items outside of his network that he may find interesting. The Activity table stores all such activities and when a user logs in, I run a query on the GAE-DataStore to return above activities. My application supports infinite scrolling too, hence I need the cursor feature of GAE. At a given time, I get around 32 items but the activities table could have millions of rows (as it contains data from all the users).
Currently the Activity table is small and contains 25 records only and the below java code reads only 3 records from the same table.
Each record in the Activity table has 4 UUID fields.
I cannot imagine how the query would behave if the table contained millions of rows and result contained 100s of rows.
Is there something wrong with the below code I have below?
(I am using Objectify and app-engine cursors)
Filter filter = new FilterPredicate("creatorID", FilterOperator.EQUAL, userId);
Query<Activity> query = ofy().load().type(Activity.class).filter(filter);
query = query.startAt(Cursor.fromWebSafeString(previousCursorString));
QueryResultIterator<Activity> itr = query.iterator();
while (itr.hasNext())
{
Activity a = itr.next();
System.out.println (a);
}
I have gone through Google App Engine Application Extremely slow and verified that response time improves if I keep on refreshing my page (which calls the above code). However, the improvement is only ~30%
Compare this with any other database and the response time for such tiny data is in milliseconds, not even 100s of milliseconds.
Am I wrong in expecting a regular database kind of performance from the GAE DataStore?
I do not want to turn on memcache just yet as I want to improve this layer without caching first.

Not exactly sure what your query is supposed to do but it doesn't look like it requires a cursor query. In my humble opinion the only valid use case for cursor queries is a paginated query for data with a limited count of result rows. Since your query does not have a limit i don't see why you would want to use a cursor at all.
When you need millions of results you're probably doing ad-hoc analysis of data (as no human could ever interpret millions of raw data rows) you might be better off using BigQuery instead of the appengine datastore. I'm just guessing here, but for normal front end apps you rarely need millions of rows in a result but only a few (maybe hundreds at times) which you filter from the total available rows.
Another thing:
Are you sure that it is the query that takes long? It might as well be the wrapper around the query. Since you are using cursors you would have to recall the query until there are no more results. The handling of this could be costly.
Lastly:
Are you testing on appengine itself or on the local development server? The devserver can obviouily not simulate a cloud and thus could be slower (or faster) than the real thing at times. The devserver does not know about instance warmup times either when your query spawns new instances.
Speaking of cloud: The thing about cloud databases is not that they have the best performance for very little data but that they scale and perform consistently with a couple of hundreds and a couple of billions of rows.
Edit:
After performing a retrieval operation, the application can obtain a
cursor, which is an opaque base64-encoded string marking the index
position of the last result retrieved.
[...]
The cursor's position is defined as the location in the result list
after the last result returned. A cursor is not a relative position in
the list (it's not an offset); it's a marker to which the Datastore
can jump when starting an index scan for results. If the results for a
query change between uses of a cursor, the query notices only changes
that occur in results after the cursor. If a new result appears before
the cursor's position for the query, it will not be returned when the
results after the cursor are fetched.
(Datastore Queries)
These two statements make be believe that the query performance should be consistent with or without cursor queries.
Here are some more things you might want to check:
How do you register your entity classes with objectify?
What does your actual test code look like? I'd like to see how and where you measure.
Can you share a comparison between cursor query and query without cursors?
Improvement with multiple request could be the result of Objectifys integrated caching. You might want to disable caching for datastore performance tests

Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?

I have a CSV which is... 34 million lines long. Yes, no joking.
This is a CSV file produced by a parser tracer which is then imported into the corresponding debugging program.
And the problem is in the latter.
Right now I import all rows one by one:
private void insertNodes(final DSLContext jooq)
throws IOException
{
try (
final Stream<String> lines = Files.lines(nodesPath, UTF8);
) {
lines.map(csvToNode)
.peek(ignored -> status.incrementProcessedNodes())
.forEach(r -> jooq.insertInto(NODES).set(r).execute());
}
}
csvToNode is simply a mapper which will turn a String (a line of a CSV) into a NodesRecord for insertion.
Now, the line:
.peek(ignored -> status.incrementProcessedNodes())
well... The method name tells pretty much everything; it increments a counter in status which reflects the number of rows processed so far.
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
But now jooq has this (taken from their documentation) which can load directly from a CSV:
create.loadInto(AUTHOR)
.loadCSV(inputstream)
.fields(ID, AUTHOR_ID, TITLE)
.execute();
(though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account).
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
The problem however is that I lose the "by second" information I get from the current code... And if I replace the query by a select count(*) from the_victim_table, that kind of defeats the point, not to mention that this MAY take a long time.
So, how do I get "the best of both worlds"? That is, is there a way to use an "optimized CSV load" and query, quickly enough and at any time, how many rows have been inserted so far?
(note: should that matter, I currently use H2; a PostgreSQL version is also planned)

There are a number of ways to optimise this.
Custom load partitioning
One way to optimise query execution at your side is to collect sets of values into:
Bulk statements (as in INSERT INTO t VALUES(1), (2), (3), (4))
Batch statements (as in JDBC batch)
Commit segments (commit after N statements)
... instead of executing them one by one. This is what the Loader API also does (see below). All of these measures can heavily increase load speed.
This is the only way you can currently "listen" to loading progress.
Load partitioning using jOOQ 3.6+
(this hasn't been released yet, but it will be, soon)
jOOQ natively implements the above three partitioning measures in jOOQ 3.6
Using vendor-specific CSV loading mechanisms
jOOQ will always need to pass through JDBC and might thus not present you with the fastest option. Most databases have their own loading APIs, e.g. the ones you've mentioned:
H2: http://www.h2database.com/html/tutorial.html#csv
PostgreSQL: http://www.postgresql.org/docs/current/static/sql-copy.html
This will be more low-level, but certainly faster than anything else.
General remarks
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
That's a very interesting idea. Will register this as a feature request for the Loader API: Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?
though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account
We've fixed that for jOOQ 3.6, thanks to your remarks: https://github.com/jOOQ/jOOQ/issues/4141
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
No, jOOQ doesn't make any assumptions about maximising throughput. This is extremely difficult and depends on many other factors than your DB vendor, e.g.:
Constraints on the table
Indexes on the table
Logging turned on/off
etc.
jOOQ offers you help in maximising throughput yourself. For instance, in jOOQ 3.5+, you can:
Set the commit rate (e.g. commit every 1000 rows) to avoid long UNDO / REDO logs in case you're inserting with logging turned on. This can be done via the commitXXX() methods.
In jOOQ 3.6+, you can also:
Set the bulk statement rate (e.g. combine 10 rows in a single statement) to drastically speed up execution. This can be done via the bulkXXX() methods.
Set the batch statement rate (e.g. combine 10 statements in a single JDBC batch) to drastically speed up execution (see this blog post for details). This can be done via the batchXXX() methods.

How to find number of database round trips by an application

I am a java programmer and I want to know how many database calls/trips are done by my application. We use Oracle as our relational database.
With oracle, I got to know about a way to alter session statistics and generate the trace files. Below are the queries to be fired:
ALTER SESSION SET TIMED_STATISTICS = TRUE;
ALTER SESSION SET SQL_TRACE = TRUE;
After the trace files are generated, they could be read using the TKProf utility. But this approach cannot be used because:
my application uses hibernate and spring frameworks and hence the application does not have an handle to the session.
Even if we get the trace files, I need to know whether the set of queries are fired in one go (in a batch) or separately. I am not sure if TkProf output could help to understand this.
Does anyone have any better suggestions?

In TkProf, you can basically tell the number of round-trips as the number of "calls" (although there are exceptions so that less round trips are required, e.g. parse/execute/fetch of a single row select is, theoretically, possible in a single round trip, the so called "exact fetch" feature of oracle). However as a estimate, the tkprof figures are good enough.
If trace wait events, you should directly see the 'SQL*Net from/to client' wait events in the raw trace, but I think tkprof does not show it (not sure, give it a try).
Another way is to look into the session statistics:
select value
from v$mystat ms, v$statname sn
where ms.value > 0
and ms.statistic#=sn.statistic#
and sn.name IN ('SQL*Net roundtrips to/from client')
However, if you do that in your app, you will slowdown your app, and the figures you receive will include the round-trips for that select.
A wrote a few articles about round-trip optimization:
http://blog.fatalmind.com/2009/12/22/latency-security-vs-performance/
http://blog.fatalmind.com/2010/01/29/oracle-jdbc-prefetch-portability/

Firstly, use a dedicated database (or timeframe) for this test, so it doesn't get easily confused with other sessions.
Secondly, look at the view v$session to identify the session(s) for hibernate. The USERNAME, OSUSER, TERMINAL, MACHINE should make this obvious. The SID and SERIAL# columns uniquely identify the session. Actually the SID is unique at any time. The SERIAL# is only needed if you have sessions disconnecting and reconnecting.
Thirdly, use v$sessstat (filtered on the SID,SERIAL# from the v$session) and v$statname (as shown by Markus) to pull out the number of round trips. You can take a snapshot before the test, run the test, then look at the values again and determine the work done.
That said, I'm not sure it is a particularly useful measure in itself. The TKPROF will be more detailed and is much more focussed on time (which is a more useful measure).

Best would be to get a dedicated event 10046 level 12 tracefile of the running session. You will find there all information in detail. This means that you can see how many fetches the application will do per executed command and the related wait events/elapsed time. The resul can be analyzed using tool from Oracle like TKPROF or Oracle Trace Analyzer or Third party tools like [QueryAdvisor][1].
By the way you can ask your DBA to define a database trigger activating Oracle filetrace automatic after login. So capturing the file should not be the problem.
R.U.
[1]: http://www.queryadvisor.com/"TKPROF Oracle tracefile analysis with QueryAdvisor"

Storing result set for later fetch

I have some queries that run for a quite long (20-30 minutes). If a lot of queries are started simultaneously, connection pool is drained quickly.
Is it possible to wrap the long-running query into a statement (procedure) that will store the result of a generic query into a temp table, terminanting the connection, and fetchin (polling) the results later on demand?
EDIT: queries and data stuctures are optimized, and tips like 'check your indices and execution plan' don't work for me. I'm looking for a way to store [maybe a] byte presentation of a generic result set, for later retreive.

First of all, 20-30 minutes is an extremely long time for a query - are you sure you aren't missing any indexes for the query? Do check your execution plan - you could get a huge performance gain from a well-placed index.
In MySQL, you could do
INSERT INTO `cached_result_table` (
SELECT your_query_here
)
(of course, cached_result_table needs to have the exact same column structure as your SELECT returns, otherwise you'll get an error).
Then, you could query these cached results (instead of the original tables), and only run the above query from time to time - to update the cached_result_table.
Of course, the query will need to run at least once initially, which will take the 20-30 minutes you mentioned. I suggest to pre-populate the cached table before the data are requested, and keep some locking mechanism to prevent the update query to run several times simultaneously. Pseudocode:
init:
insert select your_big_query
work:
if your_big_query cached table is empty or nearing expiration:
refresh in the background:
check flag to see if there's another "refresh" process running
if yes
end // don't run two your_big_queries at the same time
else
set flag
re-run your_big_query, save to cached table
clear flag
serve data to clients always from cached table

An easy way to do that in Oracle is "CREATE TABLE sometempname AS SELECT...". That will create a new table using the result columns from the select.

Not quite sure what you are requesting.
Currently you have 50 database sessions. Say you get 40 running long-running queries, that leaves 10 to service the rest.
What you seem to be asking for is, you want those 40 queries asynchronously (running in the background) not clogging up the connection pool of 50. The question is, do you want those 40 running concurrently with (potentially) another 50 queries from the connection pool, or do you want them queued up in some way ?
Queuing can be done (look into DBMS_SCHEDULER and DBMS_JOB). But you will need to deliver those results into some other table and know how to deliver that result set. The old fashioned way is simply to generate reports on request that get delivered to a directory on a shared drive or by email. Could be PDF or CSV or Excel.
If you want the 40 running concurrently alongside the 50 'connection pool' settings, then you may be best off setting up a separate connection pool for the long-running queries.
You can look into Resource Manager for terminating calls that take too long or too many resources. That way the quickie pool can't get bogged down in long running requests.

The most generic approach in Oracle I can think of is creating a stored procedure that will convert a result set into XML, and store it as CLOB XMLType in a table with the results of your long-running queries.
You can find more on generation XMLs from a generic result sets here.
SQL> select dbms_xmlgen.getxml('select employee_id, first_name,
2 last_name, phone_number from employees where rownum < 6') xml
3 from dual

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.