Context
Our container cluster is located # us-east1-c
We are using the following Java library: google-cloud-bigquery, 0.9.2-beta
Our dataset has around 26M rows and represents ~10G
All of our queries return less than 100 rows as we are always grouping on a specific column
Question
We analyzed the last 100 queries executed in BigQuery, these are were all executed in about 2-3 seconds (we analyzed this by calling bq --format=prettyjson show -j JOBID, end time - creation time).
In our Java logs though, most of the calls to bigquery.query are blocking for 5-6 seconds (and 10 seconds is not out of the ordinary). What could explain the systematic gap between the query to finish in the BigQuery cluster and the results being available in Java? I know 5-6 seconds isn't astronomic, but I am curious to see if this is a normal behaviour when using the Java BigQuery cloud library.
I didn't dig to the point where I analyzed the outbound call using Wireshark. All our tests were executed in our container cluster (Kubernetes).
Code
QueryRequest request = QueryRequest.newBuilder(sql)
.setMaxWaitTime(30000L)
.setUseLegacySql(false)
.setUseQueryCache(false)
.build();
QueryResponse response = bigquery.query(request);
Thank you
Just looking at the code briefly here:
https://github.com/GoogleCloudPlatform/google-cloud-java/blob/master/google-cloud-bigquery/src/main/java/com/google/cloud/bigquery/BigQueryImpl.java
It appears that there are multiple potential sources of delay:
Getting query results
Restarting (there are some automatic restarts in there that can explain the delay spikes)
The frequency of checking for new results
It sounds like looking at Wireshark would give you a precise answer of what is happening.
Related
UPDATE Added Read/Write Throughput, IOPS, and Queue-Depth graphs metrics and marked graph at time-position where errors I speak of started
NOTE: Hi, just looking for suggestions of what could possibly be causing this issue from experienced DBA or database developers (or anyone that would have knowledge for that matter). Some of the logs/data I have are sensitive, so I cannot repost here but I did my best to provide screen shots and data from my debugging so it would allow people to help me. Thank you.
Hello, I have a Postgres RDS database (version 12.7 engine) that is hosted on Amazon (AWS). This database is "hit" or called by a API client (Spring Boot/Web/Hibernate/JPA Java API) thousands of times per hour. It is only executing one 1 hibernate sql query on the backend that is on a Postgres View across 5 tables. queryDB instance (class = db.m5.2xlarge) specs are:
8 vCPU
32 GB RAM
Provisioned IOPS SSD Storage Type
800 GiB Storage
15000 Provisioned IOPS
The issue I am seeing is on Saturdays I wake up to many logs of JDBCConnectionExceptions and I noticed my API Docker containers (Defined as Service-Task on ECS) which are hosted on AWS Elastic Container Service (ECS) will start failing and return a HTTP 503 error, e.g.
org.springframework.dao.DataAccessResourceFailureException: Unable to acquire JDBC Connection; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
Upon checking AWS RDS DB status, I can see also the sessions/connections increase dramatically, as seen in image below with ~600 connections. It will keep increasing, seeming to not stop.
Upon checking the postgres database pg_locks and pg_stat_activity tables when I started getting all these JDBCConnectionExceptions and the DB Connections jumped to around ~400 (at this specific time), I did indeed see many of my API queries logged with interesting statuses. I exported the data to CSV and have included an excerpt below:
wait_event_type wait_event state. query
--------------- ------------ --------------------------------------------- -----
IO DataFileRead active (480 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
IO DataFileRead idle (13 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
IO DataFilePreFetch active (57 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
IO DataFilePreFetch idle (2 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
Client ClientRead idle (196 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
Client ClientRead active (10 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
LWLock BufferIO idle (1 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
LWLock BufferIO active (7 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
If I look at my pg_stats_activity table when my API and DB are running and stable, the majority of the rows from the API query are simply Client ClientRead idle status, so I feel something is wrong here.
You can see the below "performance metrics" on the DB at the time this happened (i.e. roughly 19:55 UTC or 2:55PM CST), the DataFileRead and DataFilePrefetch are astronomically high and keep increasing, which backs up the pg_stat_activity data I posted above. Also, as I stated above, during normal DB use when it is stable, the API queries will simply be in Client ClientRead Idle status in pg_stat_activity table, the the numerous DataFileRead/Prefetches/IO and ExclusiveLocks confuses me.
I don't expect anyone to debug this for me, though I would appreciate it if a DBA or someone who has experienced similiar could narrow down the issue possibly for me. I honestly wasn't sure if it was an API query taking too long (wouldn't make sense, because API has ben running stable for years), something running on the Postgres DB without my knowledge on Saturday (I really think something like this is going on), or a bad postgresql Query coming into the DB that LOCKS UP the resources and causes a deadlock (doesn't completely make sense to me as I read Postgres resolves deadlocks on its own). Also, as I stated before, all the API calls that make an SQL query on the backend are just doing SELECT ... FROM ... on a Postgres VIEW, and from what I understand, you can do concurrent SELECTS with ExclusiveLocks so.....
Would take any advice here or suggestions for possible causes of this issue
Read-Throughput (first JdbcConnectionException occured around 2:58PM CST or 14:58, so I marked the graph where READ throughput starts to drop since the DB queries are timing out and API containers are failing)
Write-Throughput (API only READS so I'm assuming spikes here are for writing to Replica RDS to keep in-sync)
Total IOPS (IOPS gradually increasing from morning i.e. 8AM, but that is expected as API calls were increasing, but these total counts of API calls match other days when there are 0 issues so doesn't really point to cause of this issue)
Queue-Depth (you can see where I marked graph and where it spikes is exactly around 14:58 or 2:58PM where first JdbcConnectionExceptions start occuring, API queries start timing out, and Db connections start to increase exponentially)
EBS IO Balance (burst balance basically dropped to 0 at this time as-well)
Performance Insights (DataFileRead, DataFilePrefetch, buffer_io, etc)
This just looks like your app server is getting more and more demanding and the database can't keep up. Most of the rest of your observations are just a natural consequence of that. Why it is happening is probably best investigated from the app server, not from the database server. Either it is making more and more requests, or each one is takes more IO to fulfill. (You could maybe fix this on the database by making it more efficient, like adding a missing index, but that would require you sharing the query and/or its execution plan).
It looks like your app server is configured to maintain 200 connections at all times, even if almost all of them are idle. So, that is what it does.
And that is what ClientRead wait_event is, it is just sitting there idle trying to read the next request from the client but is not getting any. There are probably a handful of other connections which are actively receiving and processing requests, doing all the real work but occupying a small fraction of pg_stat_activity. All of those extra idle connections aren't doing any good. But they probably aren't doing any real harm either, other than making pg_stat_activity look untidy, and confusing you.
But once the app server starts generating requests faster than they can be serviced, the in-flight requests start piling up, and the app server is configured to keep adding more and more connections. But you can't bully the disk drives into delivering more throughput just by opening more connections (at least not once you have met a certain threshold where it is fully saturated). So the more active connections you have, the more they have to divide the same amount of IO between them, and the slower each one gets. Having these 700 extra connections all waiting isn't going to make the data arrive faster. Having more connections isn't doing any good, and is probably doing some harm as it creates contention and dealing with contention is itself a resource drain.
The ExclusiveLocks you mention are probably the locks each active session has on its own transaction ID. They wouldn't be a cause of problems, just an indication you have a lot of active sessions.
The BufferIO is what you get when two sessions want the exact same data at the same time. One asks for the data (DataFileRead) and the other asks to be notified when the first one is done (BufferIO).
Some things to investigate.
Query performance can degrade over time. The amount of data being requested can increase, especially with date predicated ones. Look at Performance Insights you can see how many blocks are read(disk/io), hit(from the buffer) You want as much hit as possible. The loss of burst balance is a real indicator that this is something that is happening. Its not an issue during the week as you have less requests.
The actual amount of shared buffers you have to service these queries, the default is 25% of RAM, you could tweak this to be higher, some say 40%.. Its a dark art and you will unlikely find an answer outside of tweak and test.
Vacuum and analyzing your tables. Data comes from somewhere right? With updates and deletes and inserts tables grow get full of garbage etc. At a certain point the autovacuum processes aren't enough at default levels. You can tweak these to be more agressive, manually fire at night etc.
Index management, same as above.
Autovacuum docs
Resource Consumption
Based on what you've shared I would guess your connections are not being properly closed.
I meet some slow queries in the production environment and I config the slow log to find some slow queries info like this(query_string with 500ms timeout):
[2021-06-21T10:43:33,930][DEBUG][index.search.slowlog.query.xxxx] [xxx][g][2] took[1s], took_millis[1043], total_hits[424690], types[top], stats[], search_typ e[QUERY_THEN_FETCH], total_shards[6], source[{query_string}]
In this case, the query timeout is 500ms, and the took_millis in response is 1043ms.
As far as I know the timeout is only useful for the query parse and the took value represents the execution time in es without some external phases like Query timing: ‘took’ value and what I’m measuring. I have two questions:
Firstly, why there is 504ms(1043 - 500 = 504) between the timeout and took_millis?
Secondly, how can I know the detail time cost between the timeout and took_millis time?
Thanks a lot!
Setting a timeout on a query doesn't ensure that the query is actually cancelled when its execution time surpasses that timeout. The Elasticsearch documentation states:
"By default, a running search only checks if it is cancelled or not on
segment boundaries, therefore the cancellation can be delayed by large
segments."
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search.html#global-search-timeout
Check issues 3627, 4586 and 2929
This can explain the 504ms between timeout and took_millis, your query just takes that long, and it is not cancelled in time.
To analyze the query execution and see what might be causing these long delays, you can rerun the query with the profile API. Note that if the slow execution of your query cannot be reproduced, this won't help you solve the issue. If your query runs fine most of the time, try to correlate these slow-running queries with external factors such as server load.
I am doing API load testing with JMeter. I have a Macbook Air (client) connected with ethernet to a machine being tested with the load (server).
I wanted to do a simple test. Hit the server with 5 requests per second (RPS). I create a concurrency thread group with 60 threads, a throughput shaping timer with 5 RPS for one minute, my HTTP request and hit the play button and run the test.
I expect to see my Hits per Second listener indicating a flat line of 5 hits per second, instead I see a variable rate, starting with 5 and then dropping to 2 and then later to 4... Sometimes there is more than the specified 5 RPS (e.g. 6 RPS) the point is that it's not a constant 5. It's too much of a variable rate - it's all over the place. And I don't get any errors.
My server, takes between 500ms to 3s to return an answer based on how much load is present - this is what I am testing. What I want to achieve with this test is to return as much as possible a response in 500ms time under load and I am not getting that. I have to start wondering if it's JMeter's fault in some way, but that's a topic for another day.
When I replace my HTTP sample request with a dummy sampler, I get the RPS I desire.
I thought I had a problem with JMeter resources, so I change heap size/memory to 1GB, use the -XX:+ DisableExplicitGC and -d64 flag and run in CLI mode. I never got any errors, not before setting the flags and not after. Also, I believe that 5 RPS is a small number so I don't expect resources to be a problem.
Something worth noting is that sometimes, the threads start executing towards the end of the test rather than at the start, I find this very odd behaviour.
What's next? Time to move to a new tool?
My AppEngine project retrieves XML data from a particular link using the GAE URL fetch API. I have used the sample from here except that it looks like this:
InputStream stream;
URLConnection connection = new URL(url).openConnection();
connection.setConnectTimeout(0);
connection.setReadTimeout(0);
stream = connection.getInputStream();
This takes more than 60 seconds (max allowed by the API) and hence causes a DeadlineExceededException. Using TaskQueues for the purpose is also not an option as mentioned here.
Is there any other way someone might have achieved this until now?
Thanks!
Task Queues can be active longer than the AppEngine automatic scaling request response deadline of 1 minute. On automatic scaling, a task can run for 10 minutes. On basic or manual scaling, it can run for 24 hours. See the docs here. (Note that the language python is actually not related to the material - the same is true for Java on GAE, Go, PHP, as well).
Finally, I have to echo what was said by the other users - the latency is almost certainly caused by the endpoint of your URL fetch, not by the network or app engine. You can also check this for sure by looking at your App Engine log lines for the failing requests. The cpu_millis field tells you how long the actual process GAE-side worked on the request, while the millis field will be the total time for the request. If the total time is much higher than the cpu time, it means the cost was elsewhere in the network.
It might be related to bandwidth exhaustion of multiple connections relative to the endpoint's limited resources. If the endpoint is muc2014.communitymashup.net/x3/mashup as you added in a comment, it might help to know that at the time I posted this comment, approx 1424738921 in unix time, the average latency (including full response, not just time to response beginning) on that endpoint was ~6 seconds, although that could feasibly go up to >60s given heavy load if no scaling system is set up for the endpoint. The observed latency is already quite high, but it might vary according to what kind of work needs to be done server-side, what volume of requests/data is being handled, etc.
The problem lied in the stream being used by a function from the EMF library which took a lot of time (wasn't the case previously).
Rather, loading the contents from the URL into a StringBuilder, converting it to a separate InputStream and passing that to the function worked. All this being done in a cron job.
I am developing a Java application which will query tables which may hold over 1,000,000 records. I have tried everything I could to be as efficient as possible but I am only able to achieve on avg. about 5,000 records a minute and a maximum of 10,000 at one point. I have tried reverse engineering the data loader and my code seems to be very similar but still no luck.
Is threading a viable solution here? I have tried this but with very minimal results.
I have been reading and have applied every thing possible it seems (compressing requests/responses, threads etc.) but I cannot achieve data loader like speeds.
To note, it seems that the queryMore method seems to be the bottle neck.
Does anyone have any code samples or experiences they can share to steer me in the right direction?
Thanks
An approach I've used in the past is to query just for the IDs that you want (which makes the queries significantly faster). You can then parallelize the retrieves() across several threads.
That looks something like this:
[query thread] -> BlockingQueue -> [thread pool doing retrieve()] -> BlockingQueue
The first thread does query() and queryMore() as fast as it can, writing all ids it gets into the BlockingQueue. queryMore() isn't something you should call concurrently, as far as I know, so there's no way to parallelize this step. All ids are written into a BlockingQueue. You may wish to package them up into bundles of a few hundred to reduce lock contention if that becomes an issue. A thread pool can then do concurrent retrieve() calls on the ids to get all the fields for the SObjects and put them in a queue for the rest of your app to deal with.
I wrote a Java library for using the SF API that may be useful. http://blog.teamlazerbeez.com/2011/03/03/a-new-java-salesforce-api-library/
With the Salesforce API, the batch size limit is what can really slow you down. When you use the query/queryMore methods, the maximum batch size is 2000. However, even though you may specify 2000 as the batch size in your SOAP header, Salesforce may be sending smaller batches in response. Their batch size decision is based on server activity as well as the output of your original query.
I have noticed that if I submit a query that includes any "text" fields, the batch size is limited to 50.
My suggestion would be to make sure your queries are only pulling the data that you need. I know a lot of Salesforce tables end up with a lot of custom fields that may not be needed for every integration.
Salesforce documentation on this subject
We have about 14000 records in our Accounts object and it takes quite some time to get all the records. I perform a query which takes about a minute but SF only returns batches of no more than 500 even though I set batchsize to 2000. Each query more operation takes from 45 seconds to a minute also. This limitation is quite frustrating when you need to get bulk data.
Make use of Bulk-api to query any number of records from Java. I'm making use of it and performs very effectively even in seconds you get the result. The String returned is comma separated. Even you can maintain batches less than or equal to 10k to get the records either in CSV (using open csv) or directly in String.
Let me know if you require the code help.
Latency is going to be a killer for this type of situation - and the solution will be either multi-thread, or asynchronous operations (using NIO). I would start by running 10 worker threads in parallel and see what difference it makes (assuming that the back-end supports simultaneous gets).
I don't have any concrete code or anything I can provide here, sorry - just painful experience with API calls going over high latency networks.