How to identify expensive queries by the Query DSL?

How to identify expensive queries by the Query DSL? - java

I have a requirement in my application: to identify expensive elasticsearch queries in the application.
I only know there's Query DSL for elasticsearch. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
I need to identify each elasticsearch query in the reverse proxy for elasticsearch (the reverse proxy is developed in java, just to throttle the requests to ES and do some user statistics), if it's expensive query, only limited users can perform at a specific rate limit.
What is difficult to me is how to identify the expensive queries. I know that there is a switch for elasticsearch, can disable / enable the expensive queries by setting this parameter. I read elasticsearch source code, but I cannot find how the elasticsearch identify different kinds of expensive queries.
If you know:
Is there any elasticsearch API (from elasticsearch client sdk) that can identify expensive queries ? Then I can invoke the API directly in my application.
If not, do you know what's the effective way to identify expensive queries by analysis the query body ? by some AST (Abstract Syntax Tree) resolver ? Or by search specific keywords in the query body ?
I'd really appreciate some help on this!

There isnt a good 'native' way to do it in Elasticsearch, but you do have some options that might help
Setting timeout or terminate_after
This option looks at your requirement from a different perspective.
From Elasticsearch docs: search-your-data
You could save records of the amount of time each query, performed by the user, took by looking at the took field returned in the result.
{
"took": 5,
"timed_out": false,
...
}
This way you have a record of how many queries a user performed in a time-windows that were 'expansive' (took more then X ).
For that user, you can start adding the timeout or terminate_after params that will try to limit the query execution. this wont prevent the user from performing an expansive query, but it will try to cancel long running queries after 'timeout' has expired, returning a partial or empty result back to the user.
GET /my-index-000001/_search
{
"timeout": "2s",
"query": {
"match": {
"user.id": "kimchy"
}
}
}
This will limit the affect of the expansive queries on the cluster, performed by that user.
a side-note; this stackoverflow answer states that there are certain queries that can still bypass the timeout/terminate_after flags, such as script.
terminate_after limits the amount of documents searched though on each of the shards, this might is an alternative option to be used, or even another backup if timeout is too high or ignored for some reason.
Long term analytics
This answer requires a lot more work probably, but you could save statistics on queries performed and the amount of time they took.
You should probably use the json representation of the queryDSL in this case, save them in an elasticsearch index along what the time that query took and keep aggregates of the average time similar queries take.
You could possibly use the rollup feature to pre-aggregate all the averages and check a query against this index if its a "possibly expansive query".
The problem here is which part of the query to save and which queries are "similar" enough to be considered for this aggregation.
Searching for keywords in the query
You stated this as an option as well. the DSL query in the end translates to a REST call with JSON body, so using a JsonNode you could look for specific sub-elements that you 'think' will make the query expansive and even limit things like 'amount of buckets' etc.
Using ObjectMapper you could write the query into a string and just look for keywords, this would be the easiest solution.
There are specific features that we know require a lot of resources from Elasticsearch and can potentially take a long time to finish, so these could be limited through this answer as a "first defense".
Examples:
Highlighting
Scripts
search_analyzers
etc...
So although this answer is the most naive, it could be a fast win while you work on a long term solution that requires analytics.

In addition to the answer by Dima with some good pointers, here is a list of usual suspects for expensive / slow queries: https://blog.bigdataboutique.com/2022/10/expensive-queries-in-elasticsearch-and-opensearch-a83194
In general we'd split the discussion into three:
Is this the query that is slow? see the list above for usual suspects. Some of them by the way can be disabled by setting search.allow_expensive_queries to false in cluster settings.
Or is it an aggregations request?
Maybe it's the cluster that is overwhelmed that makes queries slow, and not the actual queries.
The only way to figure this out is to look at cluster metrics over time, and correlate with the slow queries. You can also collect all your queries and analyze them for suspected culprits, and correlate with their latency. Usually that highlights a few things that can be improved (e.g. better use of caches, etc).

Related

Set pagination off while searching documents in marklogic for a specified criteria

I am doing a search in the marklogic using JsonDocumentManager by providing the StructuredQuery Definition. As a result I am getting a DocumentPage, defaults to 50 records (page length defaulted in JsonDocumentManager). But I want to retrieve all the documents in one go?
I can see two options here to solve this, either by increasing the page length to a limit which cannot be exceeded for the criteria I am supplying or by providing the page offset in the jsonDocumentManager.search(queryDefinition, pageOffset) in the loop till the documentPage.isLastPage returns to true
Could some one please let me know the further options if any? Is there any parameter for pagination which I can switch to false to not allow marklogic to do a paginated search?

As stated by #grtjn, it's always best to paginate, and even faster if you can run requests in parallel. For that reason, the Java API doesn't have a flag to get all results. Nor do the layers it builds on: REST API and the search:search API.
The layer those build on, cts:search, uses server-side lazy evaluation to efficiently paginate under the hood while it appears to get all results. With that said, if you must have another options besides those you already know about, consider creating a Resource extension and have it call directly to the cts:search API.
For what it's worth, in MarkLogic 9 we'll be providing the Data Movement SDK which will do all the pagination and parallelization for you under the hood on the client side. It is specifically designed for long-running data movement applications that need to export or manipulate large datasets. If that's of interest, please consider joining the early access program and you can try it out.

Can JaVers return a count of filtered entries?

JaVers is a great library. I wonder, though, whether it offers a way of determining the number of results that would be returned by a JQL query.
This would be extremely helpful in our use case, which would be to:
1) Use JaVers’ skip() and limit() functions to return just a page of requested data.
2) Determine and return additional pagination-related data, such as whether the returned page is the last one and how many pages exist in the system (which depends on the user specified page size... along with the maximum number of possible results in the system - which I'm unsure of how best to retrieve).
I realize that we can simply load all results returned by JaVers' findChanges method into memory to get the full count. Is there a more efficient alternative, though?
Kind thanks,
Ben

There is no count() in JQL. It could be added in the future but the performance gain is not obvious.
In many cases, database count() queries are not significantly faster than select queries.
What could be saved, are network and application's CPU resources used for fetching snapshots from DB and JSON parsing.
Some tests should be created to estimate the performance gain comparing count(*) queries vs findChanges() queries.

Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?

I have a CSV which is... 34 million lines long. Yes, no joking.
This is a CSV file produced by a parser tracer which is then imported into the corresponding debugging program.
And the problem is in the latter.
Right now I import all rows one by one:
private void insertNodes(final DSLContext jooq)
throws IOException
{
try (
final Stream<String> lines = Files.lines(nodesPath, UTF8);
) {
lines.map(csvToNode)
.peek(ignored -> status.incrementProcessedNodes())
.forEach(r -> jooq.insertInto(NODES).set(r).execute());
}
}
csvToNode is simply a mapper which will turn a String (a line of a CSV) into a NodesRecord for insertion.
Now, the line:
.peek(ignored -> status.incrementProcessedNodes())
well... The method name tells pretty much everything; it increments a counter in status which reflects the number of rows processed so far.
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
But now jooq has this (taken from their documentation) which can load directly from a CSV:
create.loadInto(AUTHOR)
.loadCSV(inputstream)
.fields(ID, AUTHOR_ID, TITLE)
.execute();
(though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account).
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
The problem however is that I lose the "by second" information I get from the current code... And if I replace the query by a select count(*) from the_victim_table, that kind of defeats the point, not to mention that this MAY take a long time.
So, how do I get "the best of both worlds"? That is, is there a way to use an "optimized CSV load" and query, quickly enough and at any time, how many rows have been inserted so far?
(note: should that matter, I currently use H2; a PostgreSQL version is also planned)

There are a number of ways to optimise this.
Custom load partitioning
One way to optimise query execution at your side is to collect sets of values into:
Bulk statements (as in INSERT INTO t VALUES(1), (2), (3), (4))
Batch statements (as in JDBC batch)
Commit segments (commit after N statements)
... instead of executing them one by one. This is what the Loader API also does (see below). All of these measures can heavily increase load speed.
This is the only way you can currently "listen" to loading progress.
Load partitioning using jOOQ 3.6+
(this hasn't been released yet, but it will be, soon)
jOOQ natively implements the above three partitioning measures in jOOQ 3.6
Using vendor-specific CSV loading mechanisms
jOOQ will always need to pass through JDBC and might thus not present you with the fastest option. Most databases have their own loading APIs, e.g. the ones you've mentioned:
H2: http://www.h2database.com/html/tutorial.html#csv
PostgreSQL: http://www.postgresql.org/docs/current/static/sql-copy.html
This will be more low-level, but certainly faster than anything else.
General remarks
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
That's a very interesting idea. Will register this as a feature request for the Loader API: Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?
though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account
We've fixed that for jOOQ 3.6, thanks to your remarks: https://github.com/jOOQ/jOOQ/issues/4141
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
No, jOOQ doesn't make any assumptions about maximising throughput. This is extremely difficult and depends on many other factors than your DB vendor, e.g.:
Constraints on the table
Indexes on the table
Logging turned on/off
etc.
jOOQ offers you help in maximising throughput yourself. For instance, in jOOQ 3.5+, you can:
Set the commit rate (e.g. commit every 1000 rows) to avoid long UNDO / REDO logs in case you're inserting with logging turned on. This can be done via the commitXXX() methods.
In jOOQ 3.6+, you can also:
Set the bulk statement rate (e.g. combine 10 rows in a single statement) to drastically speed up execution. This can be done via the bulkXXX() methods.
Set the batch statement rate (e.g. combine 10 statements in a single JDBC batch) to drastically speed up execution (see this blog post for details). This can be done via the batchXXX() methods.

How to use Bulk API with WHERE clause in Salesforce

I want to use Bulk API of Salesforce to run queries of this format.
Select Id from Object where field='<value>'.
I have thousands of such field values and want to retrieve Id of those objects. AFAIK, Bulk query of Salesforce supports only one SOQL statement as input.
One option could be to form a query like
Select Id,field where field in (<all field values>)
but problem is SOQL has 10000 characters limitation.
Any suggestions here?
Thanks

It seems like you are attempting to perform some kind of search query. If so you might look into using a SOSL query as opposed to SOQL as long as the fields you are searching are indexed by SFDC.
Otherwise, I agree with Born2BeMild. Your second approach is better and breaking up your list of values into batches would help get around the limits.
It would also help if you described a bit of your use case in more detail. Typically queries on a dynamic set of fields and values doesn't always yield the best performance even with the bulk api. You are almost better off downloading the data to a local database and exploring the data that way.

You could break those down into batches of 200 or so values and iteratively query Salesforce to build up a result set in memory or process subsets of the data.
You would have to check the governor limits for the maximum number of SOQL queries though. You should be able to track your usage via the API at runtime to avoid going over the maximum.

The problem is that you are hitting the governor limits. Saleforce can only process 200 records at a time if its coming from a database. Therefore to be able to work with all this records first you need to add all records to a list for example:
List<Account> accounts= [SELECT id, name, FROM Account];
Then you can work with the list accounts do everything you need to do with it then when you done you can update the database using:
Update accounts;
this link might be helpful:
https://help.salesforce.com/apex/HTViewSolution?id=000004410&language=en_US

Strategy for locale sensitive sort with pagination

I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.

Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.

How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)

I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.

You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.

This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.