Apache Drill "limit 0" query while using Spring datasource

Apache Drill "limit 0" query while using Spring datasource - java

TL;DR
I have a Spring Boot application that makes use of parquet files stored on the file system. To access them we are using Apache Drill.
Since I have multiple users that might access them, I've set up a connection pool in Spring.
When I'm using the connection pool, Drill somehow executes a "limit 0" query before executing my actual query, and this affect performances. The same "limit 0" query is NOT executed when I run my queries through a simple Statement obtained from direct Connection.
This seems to be related to the fact that Spring JdbcTemplate makes use of PreparedStatements instead of simple Statements.
Is there a way to get rid of those "limit 0" queries?
-- Details --
The connection pool in the Spring configuration class looks like this:
#Bean
#ConfigurationProperties(prefix = "datasource.parquet")
#Qualifier("parquetDataSource")
public DataSource parquetDataSource() {
return DataSourceBuilder.create().build();
}
And the corresponding properties in the development profile YML file are:
datasource:
parquet:
url: jdbc:drill:drillbit=localhost:31010
jdbcUrl: jdbc:drill:drillbit=localhost:31010
jndiName: jdbc/app_parquet
driverClassName: org.apache.drill.jdbc.Driver
maximumPoolSize: 5
initialSize: 1
maxIdle: 10
maxActive: 20
validation-query: SELECT 1 FROM sys.version
test-on-borrow: true
When I execute a query using the JdbcTemplate created with the mentioned Drill DataSource, 3 different queries might be executed:
the validation query SELECT 1 FROM sys.version;
a "limit 0" query that looks like SELECT * FROM (<my actual query>) LIMIT 0;
my actual query.
Here's the execution code (parquetJdbcTemplate is an instance of a class that extends org.springframework.jdbc.core.JdbcTemplate):
parquetJdbcTemplate.query(sqlQuery, namedParameters,
resultSet -> {
MyResultSet result = new MyResultSet();
while (resultSet.next()) {
// populate the "result" object
}
return result;
});
Here's a screenshot from the Profile page of my Drill monitor:
The bottom query is the "limit 0" one, then in the middle you have the validation query and on top (even if the query is not shown) the actual query that returns the data I want.
As you can see, the "limit 0" query takes more than 1/3 of the entire execution time to run. The validation query is fine, since the execution time is negligible and it's needed to check the connection.
The fact is, when I execute the same query using a Connection through the Drill driver (thus, with no pool), I only see my actual query in the UI monitor:
public void executeQuery(String myQuery) {
Class.forName("org.apache.drill.jdbc.Driver");
Driver.load();
Connection connection = DriverManager.getConnection("jdbc:drill:drillbit=localhost:31010");
Statement st = connection.createStatement();
ResultSet resultSet = st.executeQuery(myQuery);
while (resultSet.next()) {
// do stuff
}
}
As you can see, the total execution time improves by a lot (~14 seconds instead of ~26), just because the "limit 0" query is not executed.
As far as I know, those "limit 0" queries are executed to validate and get information about the underlying schema of the parquet files. Is there a way to disable them while using the connection pool? I ideally would like to still use PreparedStatements over simple Statements, but I could switch to simple Statements if needed, because I have full control over those queries (so, no SQL injection should be possible unless someone hacks the deployed artifacts).

You are right Drill executes limit 0 prior prepared statements to get information about schema. I don't think there is a way to disable such behavior. Though I can recommend to enable planner.enable_limit0_optimization option which is false by default, this may speed limit 0 query execution. Another way to speed limit 0 queries is to indicate schema explicitly using casts through the view usage or directly in queries.
Regarding not showing query, I think this was fixed in the latest Drill version.

Related

How to get postgres schema from runinng query in Java

assume you do query on pg_stat_activity table and you get example result:
datid
datname
pid
usesysid
usename
application_name
client_addr
client_hostname
client_port
backend_start
xact_start
query_start
state_change
wait_event_type
wait_event
state
backend_xid
backend_xmin
query
backend_type
7198
10
rdsadmin
7195
16384
rdsadmin
32375
10
rdsadmin
PostgreSQL JDBC Driver
16409
c-t-s
21143
16410
c-t-s
c-t-s
10.10.3.1
48037
2021-01-18 13:19:03
2021-01-18 13:31:23
2021-01-18 13:31:23
Client
ClientRead
idle
COMMIT
client backend
I would like to know on which schema the query COMMIT was executed?
My case is i have schema-based multitenancy and i would like to distinguish between schemas (tenants). We always make a single-schema queries, so we dont mix them. To achieve that we set search_path on each getConnection method invocation. Code is developed in java and we dont use schema names in queries, as it is always dynamic -- taken from current request context and set in getConnection method.
With current result I dont know which tenant (schema) is causing slow / long queries.
I have tried to select from pg_class by ids taken from pg_stat_activity but without luck.
So far the comments did not answer my problem, is that possible at all?

How to use fetch size in JdbcTemplate to process 20+ millions rows?

I have table with 20+ millions of rows and I can't select all rows using single query because of OutOfMemoryError. I read about fetchSize attribute and looks like it might help to resolve my issue because it is common advise
But I have question about how to apply it.
I have following code:
private final JdbcTemplate jdbcTemplate;
...
jdbcTemplate.setFetchSize(1000);
List<MyTable> myList= this.jdbcTemplate.query(
"SELECT * FROM my_table",
new Object[]{},
MyTableMapper.INSTANCE
);
mylist.foreach(obj->processAndSave(obj));
Looks like jdbc driver will select 1000 per request. But what should I do to proceess all 20+ millions rows ?
Should I invoke jdbcTemplate.query several times ?

You can Getting results based on a cursor
If you are using PostgreSQL 11
requiments:
setFetchSize > 0
make sure autocommit is off
For you example for standard configurations
spring.datasource.hikari.auto-commit=false
spring.jdbc.template.fetch-size=50
Also if you want process the result ouside of a repository class
you can combine with queryForStream
Stream<MyTable> stream = jdbcTemplate.queryForStream(SQL, MyTableMapper.INSTANCE);
PostgreSQL Documentation:
Issuing a Query and Processing the Result
https://jdbc.postgresql.org/documentation/head/query.html

Consistency level ALL used while statement has consistency level TWO defined

We use java datastax cassandra driver 2.1.2. Cassandra version we use is 2.0.9.
We have statement which we build with QueryBuilder and we are setting consistency level to statement on TWO explicitly.
Select selectStatement = QueryBuilder.select().from(ARTICLES);
selectStatement.where(eq(ORGANIZATION_ID, organizationId));
selectStatement.setConsistencyLevel(ConsistencyLevel.TWO);
final ResultSet rs = session.execute(selectStatement);
//call to all() will be removed since it is enough to iterate over result set
//and then you get pagination for free instead of loading everything in
//memory
List<Row> rows = rs.all();
for (final Row row : rows) {
//do something with Row, convert to POJO
}
We get exception like this:
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ALL (3 responses were required but only 2 replica responded)
com.datastax.driver.core.exceptions.ReadTimeoutException.copy (ReadTimeoutException.java:69)
com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException (DefaultResultSetFuture.java:259)
com.datastax.driver.core.ArrayBackedResultSet$MultiPage.prepareNextRow (ArrayBackedResultSet.java:279)
com.datastax.driver.core.ArrayBackedResultSet$MultiPage.isExhausted (ArrayBackedResultSet.java:239)
com.datastax.driver.core.ArrayBackedResultSet$1.hasNext (ArrayBackedResultSet.java:122)
com.datastax.driver.core.ArrayBackedResultSet.all (ArrayBackedResultSet.java:111)
I know that calling all() on ResultSet makes it load all articles for organization in memory and work with it and creates load on cassandra. This will be removed as noted in comments. This can cause read timeout but I am still puzzled why in exception message there is ALL.
Question is why exception is telling that consistency level ALL is used when we set it to TWO for original statement. Is all() internally doing something with query and using CL ALL by default?

Your problem is almost certainly https://issues.apache.org/jira/browse/CASSANDRA-7947 . You are seeing an error message from failing to perform read repair. It is unrelated to your original consistency level. This is fixed in 2.1.3+.

how to print Statement (CallableStatement) in Java?

How can I print this OracleCallableStatement ?
ocstmt = (OracleCallableStatement) connection.prepareCall("{?= call
package.method(id => ?, name=>?)}");
ocstmt.registerOutParameter(1, OracleTypes.CURSOR);
ocstmt.setInt(2, obj.getId());
ocstmt.setString(3, obj.getName());
ocstmt.execute();
resultSet = ocstmt.getCursor(1);
What I mean is how can I know that what query goes to the database, how can I print the query? because sometimes it gives me error like "Wrong type" that is why I want to look at this query

Are you using log4j?
If so, add loggers for sql like below.
log4j.logger.java.sql.Connection=DEBUG
log4j.logger.java.sql.Statement=DEBUG
log4j.logger.java.sql.PreparedStatement=DEBUG
log4j.logger.java.sql.ResultSet=DEBUG
If you are using a ORM framework such as ibatis, you could add additional logger like below.
log4j.logger.com.ibatis.sqlmap.engine.impl.SqlMapClientDelegate=DEBUG

Yes, you can do this. You can either wrap your callable statement in a proxy that can substitute the actual values when you print it (and show the sql), or hunt around for a driver that has a meaningful toString. javaworld article There is also p6spy, and others.
Stored procedures are harder, but still doable.

You can't get the SQL by printing the Statement.
Is the example you posted one of the "sometimes" that triggers the error?
Why do you have to case this to an OracleCallableStatement? What part of the call is not the standard CallableStatement?

In general, use myObject.toString() to see what it prints. You may or may not be able to see the full query though. If you can't get it to go, the first thing that I would look at is the API documentation(javadocs for that Oracle library or driver that you're using)

I'm not sure if I understand the question, but it seems like you want to see this:
String sql = "{?= call package.method(id => ?, name=>?)}";
System.out.println(sql);
ocstmt = (OracleCallableStatement) connection.prepareCall(sql);
...

It's possible to use proxy jdbc driver to log all jdbc database actions.
this driver can prints all statements with values and all results.

My solution is use ProxyDataSourceBuilder(use it in Spring Boot project).
#Bean
public DataSource dataSource() {
SLF4JQueryLoggingListener loggingListener = new SLF4JQueryLoggingListener();
return ProxyDataSourceBuilder
.create(datasource)
.listener(loggingListener)
.build();
}
...
#Autowired
private DataSource dataSource;
Connection connection = dataSource.getConnection();
ocstmt = (OracleCallableStatement) connection.prepareCall("{?= call
package.method(id => ?, name=>?)}");
And just turn on logging in application.yml:
logging:
level:
net.ttddyy.dsproxy.listener.logging: debug

Try
> java -classpath ojdbc8.jar oracle.jdbc.driver.OracleSql false false "<your sql here>"
That will print the SQL that the driver sends to the database, among other things. This is not documented nor supported, but it's been around forever.

I thought it might be useful if you are looking whether executed query has value or not
System.out.println("value : "+CallableStatement.execute());
i.e The "false" returned by "CallableStatement.execute()" means that the JDBC statement didn't read any rows, (so there's no ResultSet to read). This is what you'd expect, as stored procedures don't directly return ResultSets, they only return values for any OUT or INOUT parameters.

What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?

I have this really big table with some millions of records every day and in the end of every day I am extracting all the records of the previous day. I am doing this like:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
Statement.executeQuery(SQL);
The problem is that this program takes like 2GB of memory because it takes all the results in memory then it processes it.
I tried setting the Statement.setFetchSize(10) but it takes exactly the same memory from OS it does not make any difference. I am using Microsoft SQL Server 2005 JDBC Driver for this.
Is there any way to read the results in small chunks like the Oracle database driver does when the query is executed to show only a few rows and as you scroll down more results are shown?

In JDBC, the setFetchSize(int) method is very important to performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database and correspondingly the amount of RAM used for ResultSet processing.
Inherently if setFetchSize(10) is being called and the driver is ignoring it, there are probably only two options:
Try a different JDBC driver that will honor the fetch-size hint.
Look at driver-specific properties on the Connection (URL and/or property map when creating the Connection instance).
The RESULT-SET is the number of rows marshalled on the DB in response to the query.
The ROW-SET is the chunk of rows that are fetched out of the RESULT-SET per call from the JVM to the DB.
The number of these calls and resulting RAM required for processing is dependent on the fetch-size setting.
So if the RESULT-SET has 100 rows and the fetch-size is 10,
there will be 10 network calls to retrieve all of the data, using roughly 10*{row-content-size} RAM at any given time.
The default fetch-size is 10, which is rather small.
In the case posted, it would appear the driver is ignoring the fetch-size setting, retrieving all data in one call (large RAM requirement, optimum minimal network calls).
What happens underneath ResultSet.next() is that it doesn't actually fetch one row at a time from the RESULT-SET. It fetches that from the (local) ROW-SET and fetches the next ROW-SET (invisibly) from the server as it becomes exhausted on the local client.
All of this depends on the driver as the setting is just a 'hint' but in practice I have found this is how it works for many drivers and databases (verified in many versions of Oracle, DB2 and MySQL).

The fetchSize parameter is a hint to the JDBC driver as to many rows to fetch in one go from the database. But the driver is free to ignore this and do what it sees fit. Some drivers, like the Oracle one, fetch rows in chunks, so you can read very large result sets without needing lots of memory. Other drivers just read in the whole result set in one go, and I'm guessing that's what your driver is doing.
You can try upgrading your driver to the SQL Server 2008 version (which might be better), or the open-source jTDS driver.

You need to ensure that auto-commit on the Connection is turned off, or setFetchSize will have no effect.
dbConnection.setAutoCommit(false);
Edit: Remembered that when I used this fix it was Postgres-specific, but hopefully it will still work for SQL Server.

Statement interface Doc
SUMMARY: void setFetchSize(int rows)
Gives the JDBC driver a hint as to the
number of rows that should be fetched
from the database when more rows are
needed.
Read this ebook J2EE and beyond By Art Taylor

Sounds like mssql jdbc is buffering the entire resultset for you. You can add a connect string parameter saying selectMode=cursor or responseBuffering=adaptive. If you are on version 2.0+ of the 2005 mssql jdbc driver then response buffering should default to adaptive.
http://msdn.microsoft.com/en-us/library/bb879937.aspx

It sounds to me that you really want to limit the rows being returned in your query and page through the results. If so, you can do something like:
select * from (select rownum myrow, a.* from TEST1 a )
where myrow between 5 and 10 ;
You just have to determine your boundaries.

Try this:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
connection.setAutoCommit(false);
PreparedStatement stmt = connection.prepareStatement(SQL, SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY, SQLServerResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(2000);
stmt.set....
stmt.execute();
ResultSet rset = stmt.getResultSet();
while (rset.next()) {
// ......

I had the exact same problem in a project. The issue is that even though the fetch size might be small enough, the JDBCTemplate reads all the result of your query and maps it out in a huge list which might blow your memory. I ended up extending NamedParameterJdbcTemplate to create a function which returns a Stream of Object. That Stream is based on the ResultSet normally returned by JDBC but will pull data from the ResultSet only as the Stream requires it. This will work if you don't keep a reference of all the Object this Stream spits. I did inspire myself a lot on the implementation of org.springframework.jdbc.core.JdbcTemplate#execute(org.springframework.jdbc.core.ConnectionCallback). The only real difference has to do with what to do with the ResultSet. I ended up writing this function to wrap up the ResultSet:
private <T> Stream<T> wrapIntoStream(ResultSet rs, RowMapper<T> mapper) {
CustomSpliterator<T> spliterator = new CustomSpliterator<T>(rs, mapper, Long.MAX_VALUE, NON-NULL | IMMUTABLE | ORDERED);
Stream<T> stream = StreamSupport.stream(spliterator, false);
return stream;
}
private static class CustomSpliterator<T> extends Spliterators.AbstractSpliterator<T> {
// won't put code for constructor or properties here
// the idea is to pull for the ResultSet and set into the Stream
#Override
public boolean tryAdvance(Consumer<? super T> action) {
try {
// you can add some logic to close the stream/Resultset automatically
if(rs.next()) {
T mapped = mapper.mapRow(rs, rowNumber++);
action.accept(mapped);
return true;
} else {
return false;
}
} catch (SQLException) {
// do something with this Exception
}
}
}
you can add some logic to make that Stream "auto closable", otherwise don't forget to close it when you are done.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.