I have a program in scala that connects to an oracle database using ojdbc, queries a table, and tries to insert records from the java.sql.resultSet into another table on a separate jdbc connection.
//conn1 to oracle: java.sql.Connection = oracle.jdbc.driver.T4CConnection#698122b2
//conn2 to another non-oracle database: java.sql.Connection = com.snowflake.client.jdbc.SnowflakeConnectionV1#6e4566f1
My attempt at capturing results from an oracle table:
val stmt1 = conn1.createStatement()
stmt1.setFetchSize(3000)
val sql1 = "select userid from nex.users"
val result = stmt1.executeQuery(sql1)
and code for attempting to insert records from result to a separate database and table via jdbc:
val insert_sql = "insert into test.users (userid) values (?)"
val ps = conn2.prepareStatement(insert_sql)
val batchSize = 3000
var count = 0
while (result.next) {
ps.setInt(1, result.getInt(1))
ps.addBatch()
count += 1
if (count % batchSize == 0) ps.executeBatch()
}
What's stumping me is this is almost the exact same syntax in many examples of using jdbc, but in my second table, I'm seeing 4x the original number of rows from the first table.
select userid, count(*) from test.users group by userid
1 4
2 4
3 4
4 4
5 4
6 4
etc
Yes, clearBatch is missing.
executeBatch() calls clearBatch() in the end.
But there is no guarantee for that will be exactly the same in other implementations.
Also, if needed, I am making a minor-subtle addition to tchoedak's answer :)
ps.executeBatch();
conn2.commit();
ps.clearBatch();
The issue was that I needed to execute ps.clearBatch() after every execute, otherwise the next batch would get piled on top of the previous batch. When trying this on a large table that would need to call executeBatch more often, the amount of duplicate rows were x times higher. The final code looks similar but with ps.clearBatch() .
val ps = conn2.prepareStatement(insert_sql)
val batchSize = 3000
var count = 0
while (result.next) {
ps.setInt(1, result.getInt(1))
ps.addBatch()
count += 1
if (count % batchSize == 0)
ps.executeBatch()
ps.clearBatch()
}
Related
I have performance problems when querying CLOBs and LONGs of big Oracle database tables.
So far, I wrote the following unit tests with cx_Oracle (python) and JDBC (java):
Python code using cx_Oracle:
class CXOraclePerformanceTest(TestCase):
def test_cx_oracle_performance_with_clob(self):
self.execute_cx_oracle_performance("CREATE TABLE my_table (my_text CLOB)")
def test_cx_oracle_performance_with_long(self):
self.execute_cx_oracle_performance("CREATE TABLE my_table (my_text LONG)")
def execute_cx_oracle_performance(self, create_table_statement):
# prepare test data
current_milli_time = lambda: int(round(time.time() * 1000))
db = cx_Oracle.connect(CONNECT_STRING)
db.cursor().execute(create_table_statement)
db.cursor().execute("INSERT INTO my_table (my_text) VALUES ('abc')")
for i in range(13):
db.cursor().execute("INSERT INTO my_table (my_text) SELECT 'abc' FROM my_table")
row_count = db.cursor().execute("SELECT count(*) FROM my_table").fetchall()[0][0]
self.assertEqual(8192, row_count)
# execute query with big result set
timer = current_milli_time()
rows = db.cursor().execute("SELECT * FROM my_table")
for row in rows:
self.assertEqual("abc", str(row[0]))
timer = current_milli_time() - timer
print("{} -> duration: {} ms".format(create_table_statement, timer))
# clean-up
db.cursor().execute("DROP TABLE my_table")
db.close()
Java code using ojdbc7.jar:
public class OJDBCPerformanceTest {
#Test public void testOJDBCPerformanceWithCLob() throws Exception {
testOJDBCPerformance("CREATE TABLE my_table (my_text CLOB)");
}
#Test public void testOJDBCPerformanceWithLong() throws Exception {
testOJDBCPerformance("CREATE TABLE my_table (my_text LONG)");
}
private void testOJDBCPerformance(String createTableStmt) throws Exception {
// prepare connection
OracleConnection connection = (OracleConnection) DriverManager.getConnection(connectionString);
connection.setAutoCommit(false);
connection.setDefaultRowPrefetch(512);
// prepare test data
Statement stmt = connection.createStatement();
stmt.execute(createTableStmt);
stmt.execute("INSERT INTO my_table (my_text) VALUES ('abc')");
for (int i = 0; i < 13; i++)
stmt.execute("INSERT INTO my_table (my_text) SELECT 'abc' FROM my_table");
ResultSet resultSet = stmt.executeQuery("SELECT count(*) FROM my_table");
resultSet.next();
Assert.assertEquals(8192, resultSet.getInt(1));
// execute query with big result set
long timer = new Date().getTime();
stmt = connection.createStatement();
resultSet = stmt.executeQuery("SELECT * FROM my_table");
while (resultSet.next())
Assert.assertEquals("abc", resultSet.getString(1));
timer = new Date().getTime() - timer;
System.out.println(String.format("%s -> duration: %d ms", createTableStmt, timer));
// clean-up
stmt = connection.createStatement();
stmt.execute("DROP TABLE my_table");
}
}
Python test output:
CREATE TABLE my_table (my_text CLOB) -> duration: 31186 ms
CREATE TABLE my_table (my_text LONG) -> duration: 218 ms
Java test output:
CREATE TABLE my_table (my_text CLOB) -> duration: 359 ms
CREATE TABLE my_table (my_text LONG) -> duration: 14174 ms
Why is the difference between both durations so high?
What can I do to improve the performance in one or both programs?
Is there any Oracle specific option or parameter which I can use to improve the query performance?
To get the same performance as LONG, you need to tell cx_Oracle to fetch the CLOBs in that fashion. You can look at this sample:
https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLongs.py.
In your code, I added this method:
def output_type_handler(self, cursor, name, defaultType, size, precision, scale):
if defaultType == cx_Oracle.CLOB:
return cursor.var(cx_Oracle.LONG_STRING, arraysize = cursor.arraysize)
Then, after the connection to the database has been created, I added this code:
db.outputtypehandler = self.output_type_handler
With those changes, the performance is virtually identical.
Note that behind the scenes, cx_Oracle is using dynamic fetching and allocation. This method works very well for small CLOBs (where small generally means a few megabytes or less). In that case, the database can send the data directly, whereas when LOBs are used, just the locator is returned to the client and then another round trip to the database is required to fetch the data. As you can imagine, that significantly slows down the operation, particularly if the database and client are separated on the network!
After some research I can partly answer my question.
I managed to improve the OJDBC performance. The OJDBC API provides the property useFetchSizeWithLongColumn with which you can query LONG columns very fast.
New query duration:
CREATE TABLE my_table (my_text LONG) -> duration: 134 ms
Oracle documentation:
THIS IS A THIN ONLY PROPERTY. IT SHOULD NOT BE USED WITH ANY OTHER DRIVERS.
If set to "true", the performance when retrieving data in a 'SELECT' will be improved but the default behavior for handling LONG columns will be changed to fetch multiple rows (prefetch size). It means that enough memory will be allocated to read this data. So if you want to use this property, make sure that the LONG columns you are retrieving are not too big or you may run out of memory. This property can also be set as a java property :
java -Doracle.jdbc.useFetchSizeWithLongColumn=true myApplication
Or via API:
Properties props = new Properties();
props.setProperty("useFetchSizeWithLongColumn", "true");
OracleConnection connection = (OracleConnection) DriverManager.getConnection(connectionString, props);
http://docs.oracle.com/cd/E11882_01/appdev.112/e13995/oracle/jdbc/OracleDriver.html
http://docs.oracle.com/cd/E11882_01/appdev.112/e13995/oracle/jdbc/OracleDriver.html
I still have no solution for cx_Oracle. That's why I opened a github issue:
https://github.com/oracle/python-cx_Oracle/issues/63
I have a table with millions of records in it. In order to make the system faster, I need to implement the pagination concept in my Java code. I need to fetch just 1000 records at a time and process them, then pick another 1000 records and do my processing and so on. I have already tried a few things and none of them is working. Some of the things I tried are listed below -
1) String query = "select * from TABLENAME" + " WHERE ROWNUM BETWEEN %d AND %d";
sql = String.format(query, firstrow, firstrow + rowcount);
In the above example, when the query is SELECT * from TABLENAME Where ROWNUM BETWEEN 0 and 10 it gives me a result but when the query is SELECT * from TABLENAME Where ROWNUM BETWEEN 10 and 20, it returns an empty result set. I even tried to run it in the DB, it return Empty result set (not sure why!!)
2) preparedStatement.setFetchSize(100); I have that in my Java code, but it still fetches all the records from the table. Adding this statement didnt affect my code in anyway.
Please help!
It sounds like you are not actually needed to paginate the results but to just process the results in batches. If this is the case then all you need to do is set the fetch size to 1000 using setFetchSize and iterate over the resultset as usual (using resultset.next()) and process the results as you iterate. There are many resources describing setFetchSize and what it does. Do some research:
What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
How JDBC Statement.SetFetchsize exaclty works
What and when should I specify setFetchSize()?
For oracle pagination there are a lot of resources describing how to do this. Just do a web search. Here are a couple of resources that describe how to do it:
http://www.databasestar.com/limit-the-number-of-rows-in-oracle/
http://ocptechnology.com/how-to-use-row-limiting-clause/
Pagination is not very useful if you do not define a consistent ordering (ORDER BY clause) since you cannot rely on the order they are returned.
This answer explains why your BETWEEN statement is not working: https://stackoverflow.com/a/10318244/908961
From the answer if using oracle older than 12c you need to do a sub select to get your results. Something like:
SELECT c.*
FROM (SELECT c.*, ROWNUM as rnum
FROM (SELECT * FROM TABLENAME ORDER BY id) c) c
WHERE c.rnum BETWEEN %d AND %d
If you are using Oracle 12c or greater I would recommend using the newer OFFSET FETCH syntax instead of fiddling with rownum. See the first link above or
http://www.toadworld.com/platforms/oracle/b/weblog/archive/2016/01/23/oracle-12c-enhanced-syntax-for-row-limiting-a-k-a-top-n-queries
So your query would be something like
String query = "select * from TABLENAME OFFSET %d ROWS FETCH NEXT 1000 ONLY";
String.format(query, firstrow);
or using prepared statements
PreparedStatement statement = con.prepareStatement("select * from TABLENAME OFFSET ? ROWS FETCH NEXT 1000 ONLY");
statement.setInt(1, firstrow);
ResultSet rs = statement.executeQuery();
Alternately you can also use the limit keyword as described here http://docs.oracle.com/javadb/10.10.1.2/ref/rrefjdbclimit.html and your query would be something like
String query = "select * from TABLENAME { LIMIT 1000 OFFSET %d }";
String.format(query, firstrow);
The normal way to implement pagination in Oracle is to use an analytic windowing function, e.g. row_number together with an ORDER BY clause that defines the row ordering. The query with the analytic function is then wrapped into an inline view (or a "window"), from which you can query the row numbers you need. Here's an example that queries the first 1000 rows from my_table (ordering by column_to_sort_by):
select rs.* from
(select t.*,
row_number() over (order by column_to_sort_by) as row_num
from my_table t
) rs
where rs.row_num >= 1 and rs.row_num < 1001
order by rs.row_num
A JDBC implementation could then look like the following:
public void queryWithPagination() throws SQLException {
String query = "select rs.* from"
+ " (select t.*,"
+ " row_number() over (order by column_to_sort_by) as row_num"
+ " from my_table t"
+ " ) rs"
+ " where rs.row_num >= ? and rs.row_num < ?"
+ " order by rs.row_num";
final int pageSize = 1000;
int rowIndex = 1;
try (PreparedStatement ps = myConnection.prepareStatement(query)) {
do {
ps.setInt(1, rowIndex);
ps.setInt(2, rowIndex + pageSize);
rowIndex += pageSize;
} while (handleResultSet(ps, pageSize));
}
}
private boolean handleResultSet(PreparedStatement ps, final int pageSize)
throws SQLException {
int rows = 0;
try (ResultSet rs = ps.executeQuery()) {
while (rs.next()) {
/*
* handle rows here
*/
rows++;
}
}
return rows == pageSize;
}
Note that the table should remain unchanged while you're reading it so that the pagination works correctly across different query executions.
If there are so many rows in the table that you're running out of memory, you probably need to purge/serialize your list after some pages have been read.
EDIT:
If the ordering of rows doesn't matter to you at all, then -- as #bdrx mentions in his answer -- you don't probably need pagination, and the fastest solution would be to query the table without a WHERE condition in the SELECT. As suggested, you can adjust the fetch size of the statement to a larger value to improve throughput.
Hi StackOverflow community :)
I come to you to share one of my problems...
I have to extract a list of every table in each database of a SQL Server instance, I found this query :
EXEC sp_msforeachdb 'Use ?; SELECT DB_NAME() AS DB, * FROM sys.tables'
It works perfectly on Microsoft SQL Server Management Studio but when I try to execute it in my Java program (that includes JDBC drivers for SQL Server) it says that it doesn't return any result.
My Java code is the following :
this.statement = this.connect.createStatement(); // Create the statement
this.resultats = this.statement.executeQuery("EXEC sp_msforeachdb 'Use ?; SELECT DB_NAME() AS DB, * FROM sys.tables'"); // Execute the query and store results in a ResultSet
this.sortie.ecrireResultats(this.statement.getResultSet()); // Write the ResultSet to a file
Thanks to anybody who will try to help me,
Have a nice day :)
EDIT 1 :
I'm not sure that the JDBC driver for SQL Server supports my query so I'll try to get to my goal in another way.
What I'm trying to get is a list of all the tables for each database on a SQL Server instance, the output format will be the following :
+-----------+--------+
| Databases | Tables |
+-----------+--------+
So now I'm asking can someone help me to get to that solution using SQL queries thru Java's JDBC for SQL Server driver.
I also wish to thanks the very quick answers I got from Tim Lehner and Mark Rotteveel.
If a statement can return no or multiple results, you should not use executeQuery, but execute() instead, this method returns a boolean indicating the type of the first result:
true: result is a ResultSet
false : result is an update count
If the result is true, then you use getResultSet() to retrieve the ResultSet, otherwise getUpdateCount() to retrieve the update count. If the update count is -1 it means there are no more results. Note that the update count will also be -1 when the current result is a ResultSet. It is also good to know that getResultSet() should return null if there are no more results or if the result is an update count.
Now if you want to retrieve more results, you call getMoreResults() (or its brother accepting an int parameter). The return value of boolean has the same meaning as that of execute(), so false does not mean there are no more results!
There are only no more results if the getMoreResults() returns false and getUpdateCount() returns -1 (as also documented in the Javadoc)
Essentially this means that if you want to correctly process all results you need to do something like below:
boolean result = stmt.execute(...);
while(true)
if (result) {
ResultSet rs = stmt.getResultSet();
// Do something with resultset ...
} else {
int updateCount = stmt.getUpdateCount();
if (updateCount == -1) {
// no more results
break;
}
// Do something with update count ...
}
result = stmt.getMoreResults();
}
NOTE: Part of this answer is based on my answer to Java SQL: Statement.hasResultSet()?
If you're not getting an error, one issue might be that sp_msforeachdb will return a separate result set for each database rather than one set with all records. That being the case, you might try a bit of dynamic SQL to union-up all of your rows:
-- Use sys.tables
declare #sql nvarchar(max)
select #sql = coalesce(#sql + ' union all ', '') + 'select ''' + quotename(name) + ''' as database_name, * from ' + quotename(name) + '.sys.tables'
from sys.databases
select #sql = #sql + ' order by database_name, name'
exec sp_executesql #sql
I still sometimes use INFORMATION_SCHEMA views as well, since it's easier to see the schema name, among other things:
-- Use INFORMATION_SCHEMA.TABLES to easily get schema name
declare #sql nvarchar(max)
select #sql = coalesce(#sql + ' union all ', '') + 'select * from ' + quotename(name) + '.INFORMATION_SCHEMA.TABLES where TABLE_TYPE = ''BASE TABLE'''
from sys.databases
select #sql = #sql + ' order by TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME'
exec sp_executesql #sql
Be aware that this method of string concatenation (select #sql = foo from bar) may not work as you intend through a linked server (it will only grab the last record). Just a small caveat.
UPDATE
I've found the solution !
After reading an article about sp_spaceused being used with Java, I figured out that I was in the same case.
My final code is the following :
this.instances = instances;
for(int i = 0 ; i < this.instances.size() ; i++)
{
try
{
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
this.connect = DriverManager.getConnection("jdbc:sqlserver://" + this.instances.get(i), "tluser", "result");
this.statement = this.connect.prepareCall("{call sp_msforeachdb(?)}");
this.statement.setString(1, "Use ?; SELECT DB_NAME() AS DB, name FROM sys.tables WHERE DB_NAME() NOT IN('master', 'model', 'msdb', 'tempdb')");
this.resultats = this.statement.execute();
while(true)
{
int rowCount = this.statement.getUpdateCount();
if(rowCount > 0)
{
this.statement.getMoreResults();
continue;
}
if(rowCount == 0)
{
this.statement.getMoreResults();
continue;
}
ResultSet rs = this.statement.getResultSet();
if(rs != null)
{
while (rs.next())
{
this.sortie.ecrireResultats(rs); // Write the results to a file
}
rs.close();
this.statement.getMoreResults();
continue;
}
break;
}
this.statement.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
It tried it out and my file has everything I want in it.
Thank you all for your help ! :)
I am fetching records from MySQL database using Java (JDBC). I have tables -
Stop_Times with 1.5 Million records and
Stops with 1 lac records.
I am using following code
ResultSet rs = stm.executeQuery("select distinct(stop_id) from Stop_Times force index (idx_stop_times) where agency_id = '" + agency_id + "' and route_type = " + route_type + " order by stop_id");
while(rs.next())
{
stop_id.add(rs.getString("stop_id"));
}
JSONArray jsonResult = new JSONArray();
String sql = "select * from Stops force index (idx_Stops) where stop_id = ? and agency_id = ? and location_type = 0 order by stop_name";
PreparedStatement pstm = con.prepareStatement(sql);
int rid = 0;
for(int r = 0; r < stop_id.size(); r++)
{
pstm.setString(1, stop_id.get(r).toString());
pstm.setString(2, agency_id);
rs = pstm.executeQuery();
if(rs.next())
{
JSONObject jsonStop = new JSONObject();
jsonStop.put("str_station_id", rs.getString("stop_id"));
jsonStop.put("str_station_name", rs.getString("stop_name") + "_" + rs.getString("stop_id"));
jsonStop.put("str_station_code", rs.getString("stop_code"));
jsonStop.put("str_station_desc", rs.getString("stop_desc"));
jsonStop.put("str_station_lat", rs.getDouble("stop_lat"));
jsonStop.put("str_station_lon", rs.getDouble("stop_lon"));
jsonStop.put("str_station_url", rs.getString("stop_url"));
jsonStop.put("str_location_type", rs.getString("location_type"));
jsonStop.put("str_zone_id", rs.getString("zone_id"));
jsonResult.put((rid++), jsonStop);
}
}
The first query returns 6871 records. But it is taking too much time - on server side it is taking 8-10 seconds and at client side 40-45 seconds.
I want to reduce these times as for server side 300-500 milliseconds and at client side around 10 sec.
Please can anybody help me for how to to this?
Your strategy is to use a first query to get IDs, and then loop over these IDs and execute another query for each of the IDs found by the first query. You're in fact doing a "manual" join instead of letting the database do it for you. You could rewrite everything in a single query:
select * from Stops stops
inner join Stop_Times stopTimes on stopTimes.stop_id = stops.stop_id
where stops.stop_id = ?
and stops.agency_id = ?
and stops.location_type = 0
and stopTimes.agency_id = ?
and stopTimes.route_type = ?
order by stops.stop_name
Try to get the explain plan associated with your request (Cf. http://dev.mysql.com/doc/refman/5.0/en/using-explain.html) ; avoid full table scan (Explain join ALL). Then add relevant indexes. Then retry.
Here is the problem: At my company we have a large database that we want to perform some automated operations in it. To test that we got a small sample of that data about 6 10MB sized csv files. We want to use H2 to test the results of our program in it. H2 Seemed to work fine with our previous cvs though they were, at most, 1000 entries long. When it comes to any of our 10MB files the command
insert into myschema.mytable (select * from csvread('mycsvfile.csv'));
reports a failure because one of the registries is supposedly duplicated and offends our primary key constraints.
Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
insert into myschema.mytable (select * from csvread('src/test/resources/h2/data/mycsvfile.csv')) [23001-148] 23001/23001
Breaking the mycsvfile.csv into smaller pieces I was able to see that the problem starts to appear after about 10000 rows inserted(though the number varies depending on what data I used). I could however insert more than 10000 rows if I broke the file into pieces and then ran the command individually. But even if I manage to insert all that data manually I need an automated method to fill the database.
Since running the command would not give me the row that was causing the problem I guessed that the problem could be some cache in the csvread routine.
Then I created a small java program that could insert the data in the H2 database manually. No matter whether I batched the commands, closed and opened the connection for 1000 rows h2 reported that I was trying to duplicate an entry in the database.
org.h2.jdbc.JdbcSQLException: Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
INSERT INTO myschema.mytable VALUES ( '1997-10-06 01:00:00.0',25485116,1.600,0,18 ) [23001-148]
Doing a normal search for that registry using emacs I can find that the registry is not duplicated as the datetime column is unique in the whole dataset.
I cannot give that data for you to test since the company sells that information. But here is how my table definition is like.
create table myschema.mytable (
datetime timestamp,
largenumber numeric(8,0) references myschema.largenumber(largecode),
value numeric(8,3) not null,
flag numeric(1,0) references myschema.flag(flagcode),
kind smallint references myschema.kind(kindcode),
primary key (datetime, largenumber, kind)
);
This is how our csv looks like:
datetime,largenumber,value,flag,kind
1997-06-11 16:45:00.0,25485116,0.710,0,18
1997-06-11 17:00:00.0,25485116,0.000,0,18
1997-06-11 17:15:00.0,25485116,0.000,0,18
1997-06-11 17:30:00.0,25485116,0.000,0,18
And the java code that would fill our test database(forgive my ugly code, I got desperate :)
private static void insertFile(MyFile file) throws SQLException {
int updateCount = 0;
ResultSet rs = Csv.getInstance().read(file.toString(), null, null);
ResultSetMetaData meta = rs.getMetaData();
Connection conn = DriverManager.getConnection(
"jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
rs.next();
while (rs.next()) {
Statement stmt = conn.createStatement();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < meta.getColumnCount(); i++) {
if (i == 0)
sb.append("'" + rs.getString(i + 1) + "'");
else
sb.append(rs.getString(i + 1));
sb.append(',');
}
updateCount++;
if (sb.length() > 0)
sb.deleteCharAt(sb.length() - 1);
stmt.execute(String.format(
"INSERT INTO myschema.mydatabase VALUES ( %s ) ",
sb.toString()));
if (updateCount == 1000) {
conn.close();
conn = DriverManager.getConnection(
"jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
updateCount = 0;
}
}
if (!conn.isClosed()) {
conn.close();
}
rs.close();
}
I'll be glad to provide more information if requested.
EDIT
#Randy I always check if the database is clean before running the command and in my java program I have a routine to delete all data from a file that fails to be inserted.
select * from myschema.mytable where largenumber = 25485116;
DATETIME LARGENUMBER VALUE FLAG KIND
(no rows, 8 ms)
The only thing that I can think of is that there is a trigger on the table that sets the timestamp to "now". Although that would not explain why you are successful with a few rows, it would explain why the primary key is being violated.