Huge time accessing database from Java

Huge time accessing database from Java - java

I'm a junior java programmer and I've finally made my first program, all by myself, apart from school :).
The basics are: you can store data on it and retrieve it anytime. The main thing is, I want to be able to run this program on another computer (as a runable .jar file).
Therefore I had to install JRE and microsoft access 2010 drivers (they both are 32 bit), and the program works perfect, but there is 1 small problem.
It takes ages (literaly, 17 seconds) to store or delete something from the database.
What is the cause of this? Can I change it?
Edit:
Here's the code to insert an object of the class Woord into the database.
public static void ToevoegenWoord(Woord woord) {
try (Connection conn = DriverManager.getConnection("jdbc:odbc:DatabaseSenne")) {
PreparedStatement addWoord =
conn.prepareStatement("INSERT INTO Woorden VALUES (?)");
addWoord.setString(1, woord.getWoord());
addWoord.executeUpdate();
} catch (SQLException ex) {
for (Throwable t : ex) {
System.out.println("Het woord kond niet worden toegevoegd aan de databank.");
t.printStackTrace();
}
}
}

Most likely creating Connection every time is slow operation in your case (especially using JDBC-ODBC bridge). To confirm this try to put print statements with timestamp before and after the line that get Connection from DriverManager. If that's the case consider not to open connection on every request but open it once and reuse, better yet use some sort of Connection Pooling, there are plenty of options available.
If that's mot the case then actual insert could be slow as well. Again simple profiling with print statements should help you to discover where your code is spending most of the time.

First of all, congrats on your first independent foray. To answer your question / elaborate on maximdim's answer, the concern is that calling:
try (Connection conn = DriverManager.getConnection("jdbc:odbc:DatabaseSenne")) {
every time you're using this function may be a major bottleneck (or perhaps another section of your code is.) Most importantly, you will want to understand the concept of using logging or even standard print statements to help diagnose where you are seeing an issue. Wrapping individual lines of code like so:
System.out.println("Before Connection retrieval: " + new Date().getTime());
try (Connection conn = DriverManager.getConnection("jdbc:odbc:DatabaseSenne")) {
System.out.println("AFTER Connection retrieval: " + new Date().getTime());
...to see how many milliseconds pass for each call can help you determine exactly where your bottleneck lies.

Advise: use another database, like Derby, hsqldb. They are not so different from MSAccess, (= can use a file based DB), but perform better (than JDBC/ODBC). And can even be embedded in the application (without extra installation of the DB).

Related

Java code runs out of space memory on AWS but not MacOSX

I need another set of eyes on this.
I've written out a zip file into hundreds of gigabytes with this exact code with no modifications locally on MacOSX.
With 100% unchanged code, just deployed to an AWS instance running Ubuntu, this same code runs into Out of Memory issues (heap space).
Here's the code that's being run, streaming MyBatis to a CSV file on disk:
File directory = new File(feedDirectory);
File file;
try {
file = File.createTempFile(("feed-" + providerCode + "-"), ".csv", directory);
} catch (IOException e) {
throw new RuntimeException("Unable to create file to write feed to disk: " + e.getMessage(), e);
}
String filePath = file.getAbsolutePath();
log.info(String.format("File name for %s feed is %s", providerCode, filePath));
// output file
try (FileOutputStream out = new FileOutputStream(file)) {
streamData(out, providerCode, startDate, endDate);
} catch (IOException e) {
throw new RuntimeException("Unable to write feed to file: " + e.getMessage());
}
public void streamData(OutputStream outputStream, String providerCode, Date startDate, Date endDate) throws IOException {
try (CSVPrinter printer = CsvUtil.openPrinter(outputStream)) {
StreamingHandler<FStay> handler = stayPrintingHandler(printer);
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, handler);
}
}
private StreamingHandler<FStay> stayPrintingHandler(CSVPrinter printer) {
StreamingHandler<FStay> handler = new StreamingHandler<>();
handler.setHandler((stay) -> {
try {
EXPORTER.writeStay(printer, stay);
} catch (IOException e) {
log.error("Issue with writing output: " + e.getMessage(), e);
}
});
return handler;
}
// The EXPORTER method
import org.apache.commons.csv.CSVPrinter;
public void writeStay(CSVPrinter printer, FStay stay) throws IOException {
List<Object> list = asList(stay);
printer.printRecord(list);
}
List<Object> asList(FStay stay) {
List<Object> list = new ArrayList<>(46);
list.add(stay.getUid());
list.add(stay.getProviderCode());
//....
return list;
}
Here's a graph of the JVM heap space (using jvisualvm) when I run this locally. I've run this consistently with of Java 8 (jdk1.8.0_51 and 1.8.0_112) locally and have gotten great results. Even written out a terabyte of data.
^ In the above, the max heap space is set to 4 gigs, and the most it ever increases to is 1.5 gigs, before going back down to around 500 MB, while streaming data to the CSV file as it's supposed to.
However, when I run this on Ubuntu with jdk 1.8.0_111, the exact same operation will not complete, running out of heap space (java.lang.OutOfMemoryError: Java heap space)
I've upped the Xmx value from 8 gigs to 16 to 25 gigs, and still run out of heap space. Meanwhile... the total size of the file is only 10 Gigs in total... which really perplexes me.
Here's what the JVisualVm graph looks like on the Ubuntu box:
I've no doubt it's the exact same code running in both environments, with the same operation being performed in each (same database server providing the same data)
The only differences I can think of at this point are:
Operating system - Ubuntu vs Mac OS X
Hosted VM in AWS vs hard metal laptop
Network speed is faster in AWS between database and Ubuntu server
JDK version is 1.8.0_111 in Ubuntu, tried 1.8.0_51 and 1.8.0_112 locally
Can anyone help shed any light on this problem?
Update
I've tried replacing all the 'try-with-resources' statements with explicit flush/close statements and no luck.
What's more, I tried to force a garbage collection on the Ubuntu box as soon as I started to see the data come in, and it had no effect-- there is something definitely stopping the heap from being collected on the Ubuntu machine... while running the exact same code on OS X let me write the full enchilada again no problem.
Update 2
In addition to the differences in the environments above, the only other difference I can think of is if the connection between the servers in AWS is so fast that it streams the data faster than it can flush the data to disk... but that still doesn't explain the issue where I only have 10 gigs of data total, and it blows up a JVM with 20 Gigs of heap space.
Is there any likelihood of there being a bug at the Ubuntu/Java level for this?
Update 3
Tried replacing the output of the CSVPrinter to use an entirely separate library (OpenCSV's CSVWriter in lieu of Apache's CSV library) and the same result occurs.
As soon as this code starts receiving data from the database, the heap starts blowing up and the garbage collector fails to reclaim any memory... but only on Ubuntu. On OS X, everything is reclaimed immediately and the heap never grows.
I've also tried flushing the stream after every write, but had no luck with that as well.
Update 4
Got the heap dump to print out, and according to this I should be looking at the database driver. Specifically the InboundDataHandler in amazon's redshift driver.
I'm using myBatis with a custom result handler. I tried setting the result handler to effectively do nothing when it gets a result (new ResultHandler<>() { // method overridden to do literally nothing}) and I know I'm not holding on to any references there.
Since it's the InboundDataHandler defined by AWS/Redshift... it makes me think it may be lower than the myBatis level... either:
Error in the SqlSessionFactory I'm setting up
Bug in the Redshift driver that only pops up in Ubuntu / AWS
Bug in the result handler I have overwritten
Here's the heap dump screenshot:
Here's where I'm setting up my SqlSessionFactoryBean:
#Bean
public javax.sql.DataSource redshiftDataSource() throws ClassNotFoundException {
log.info("Got to datasource config");
// Dynamically load driver at runtime.
Class.forName(dataWarehouseDriver);
DataSource dataSource = new DataSource();
dataSource.setURL(dataWarehouseUrl);
dataSource.setUserID(dataWarehouseUsername);
dataSource.setPassword(dataWarehousePassword);
return dataSource;
}
#Bean
public SqlSessionFactoryBean sqlSessionFactory() throws ClassNotFoundException {
SqlSessionFactoryBean factoryBean = new SqlSessionFactoryBean();
factoryBean.setDataSource(redshiftDataSource());
return factoryBean;
}
Here's the myBatis code I'm running as a test to verify that it's not me holding on to records in my ResultHandler:
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, new ResultHandler<FStay>() {
#Override
public void handleResult(ResultContext<? extends FStay> resultContext) {
// do nothing
}
});
Is there a way I can force the SQL connection to not hang on to records or something? I'll again re-iterate that on my local machine, there is no issue with this memory leak... it only surfaces when running the code in the hosted AWS environment. And in both cases, the Database driver and server are the same.
Update 6
I think it's finally fixed. Thanks to all who pointed me in the direction of the heap dump. That helped narrow it down to the offending class in a huge way.
After that, I did some research on the AWS redshift driver, and it explicitly says that your clients should specify a limit for any operations on large data. So I found out how to do that in my myBatis configuration:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And this did the trick.
Mind you, this isn't necessary even when handling much larger data sets downloaded remotely from AWS (Database in AWS, code executing on laptop at home), and this shouldn't be necessary since I'm overriding the myBatis ResultHandler<> which handles each row individually and never holds on to any objects.
Yet something funky happens with the AWS redshift jdbc driver only when it's run in AWS (database in aws, code executing in AWS instance) which causes this InboundDataHandler to never release its resources, unless a fetchSize is specified.
Here's the heap of the server running now, getting much further than it ever has before in AWS, with the heap space never moving above 500Mb, and after i hit 'force gc' in jvisualvm, it shows the 'used' heap at less than 100mb:
Thanks again in a huge way to all those who helped guide this!

Finally figured out a solution.
The heap dump was the biggest aid-- it indicated the InboundDataHandler class of Amazon's RedShift/postgres JDCB driver was the prime culprit.
The code to set up the SqlSession appeared legit, so traveling over to Amazon's documentation landed this gem:
To avoid client-side out-of-memory errors when retrieving large data
sets using JDBC, you can enable your client to fetch data in batches
by setting the JDBC fetch size parameter.
We hadn't run into this before, as we stream results with custom ResultHandlers in MyBatis... but there seems to be something different when the AWS Redshift JDBC driver is running on AWS itself vs outside AWS connecting in.
Taking the guidance from the documentation, we added a 'fetchSize' to our MyBatis select query:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And voila! Everything worked swimmingly. This is the only change we made and the heap never went above a couple hundred MBs.
You can see in one of the above graphs where the heap goes off the charts, as soon as the data started to be received on Amazon, the heap marches right up linearly and never reclaims an ounce of heap space once it starts.
My guess is the Redshift JDBC driver is doing something different when it's in Amazon's environment for some kind of optimization... that's all I can think of to explain the behavior.
Clearly Amazon knows what's going on since they documented it up front. I may not know the full 'why' of what's happening, but at least everything is resolved in what appears to be a satisfactory way.
Thanks to all those who helped.

Connecting with java to postgresql server failed (UTF-8)

I am browsing this website for quite some time now. The community helped me quite a lot, even though I never registered. Thanks for that.
However, this time I can't find a solution to my problem just by browsing, so I decided to register here and state my question so that someone might be able to help me.
First of all I'd like to say, that I'm basically at the beginning of my studies, so I'm not very knowledgeable yet. At the moment I'm learning for a upcoming exam and I need to know about databases. The Database of choice is Postgresq. But to really understand something like that, you need to try it out and not just read about it.
I'm not exactly sure what information are exactly required. But I'm using Windows 7, pgadmin and eclipse for that, all of them should be up to date. If you need further information, please ask for it.
I used a Tutorial to get it up and running, however I'm not able to establish a connection to the database using eclipse. The DriverManager.getConnection() causes a UTF-8 error I'm not understanding nor able to fix it.
org.postgresql.util.PSQLException: The connection attempt failed.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:280)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:67)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:212)
at org.postgresql.Driver.makeConnection(Driver.java:407)
at org.postgresql.Driver.connect(Driver.java:275)
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at com.hywy.TestConnection.main(TestConnection.java:23)
Caused by: java.io.IOException: Illegal UTF-8 sequence: initial byte is 11111xxx: 252
at org.postgresql.core.UTF8Encoding.decode(UTF8Encoding.java:131)
at org.postgresql.core.PGStream.ReceiveString(PGStream.java:331)
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:447)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:223)
... 7 more
The Code I'm using is:
import java.sql.*;
import java.util.Properties;
//import com.hywy.db.DbContract;
public class TestConnection {
public static void main(String[] args) {
Properties properties = new Properties();
properties.setProperty("user", "postgre");
properties.setProperty("password", "random_password");
//properties.setProperty("ssl", "true");
String url = "jdbc:postgresql://localhost:1112/people";
try {
Class.forName("org.postgresql.Driver");
// Connection c = DriverManager.getConnection(
// DbContract.HOST+DbContract.DB_NAME,
// DbContract.USERNAME,
// DbContract.PASSWORD);
Connection con = DriverManager.getConnection(url, properties);
//Connection c = DriverManager.getConnection("jdbc:postgresql://localhost:1112/people" , "postgre" , "random_password");
System.out.println("DB connected");
} catch (ClassNotFoundException | SQLException e) {
e.printStackTrace();
}
}
}
The code is a bit of a mess, since I Googled some time for it and tried some stuff. I was not sure if I should paste the out-commented code with the rest of it, but at least you can see that I already tried some things.
I also tried a different port, at first I was using the standard port (5432) but it made no difference.
I also linked the up-to-date jdbc jar file in my build path.
I hope someone knows (about) this problem and is willing to help me.
Thanks in advance.
Excuse my English, it's not my native tongue.

Ok guys, I just wanted to say that I fixed the problem. I also wanted to thanks anyone who tried to help me.
It's only logical that no one was really able to help me, because it was only due to my stupidity.
The mistake happened at
"jdbc:postgresql://localhost:1112/people"
Only when I did the whole thing on another computer, to see if it's my system, I realized I'm not supposed to write the actual table name at the end of the line, but instead the name of the Database, so I just changed "people" into "postgres" and it worked.
Sorry for the stupid question, I hope my next one will be better :)
*Topic can be deleted or closed or whatever you want to do with it.
Thanks anyone!

searching for multiple occurrences of variables in a method

I have a problem with cursor leakage in my Java project.
Typical example:
private void doSomething() throws Exception {
String sql1= "some sql statement";
String sql2= "some other sql statement";
PreparedStatement ps = null;
ResultSet rs = null;
try {
Connection con = getConnection();
ps = con.prepareStatement(sql1);
rs = ps.executeQuery();
//do something with the ResultSet rs
//[Need to call ps.close here. Otherwise I risk getting ORA-01000.]
ps = con.prepareStatement(sql2);
ps.executeQuery();
} catch (Exception e) {
} finally {
ps.close();
rs.close();
}
}
Since I have a rather large code base I would like to be able to find all methods
having two or more variables named sql.
Alternativly finding methods with two (or more) calls for prepareStatement without
calling ps.close; between the two.
I am using Eclipse and the file search has a regex option. Maybe thet is the way to go?
If so, what would it look like?

Recent versions of Eclipse (I used Juno [4.2]) will show a warning on those lines:
You can enable this warning in the Eclipse preferences:
Even for larger size code bases, you can filter the problems view for this warning in order to find these places in the code.

You can avoid having to close resources manually (if you are using at least JDK 7) by using a try with resources
try(con=getConnection()){
try(ps = con.prepareStatement()){
try(rs=ps.executeQuery()){
...
}
}
}
Anything that implements Autoclosable can be used in a try with resources and the resource gets automatically closed for you. No need to implement complicated resource closing operations anymore.
See the Try-with-Resources Tutorial

What you want to do is kind of static program analysis. As there are specialized tools for that, you may also write your own tool for this specific task.
In that way, you could count lines containing rs=ps.executeQuery and rs.close():
for(File file:sourceFiles) {
int openedResultSets = count("rs=ps.executeQuery");
int closedResultSets = count("rs.close()");
if (openedResultSets > closedResultSets) {
log.error(file.getName());
}
}
But it should be more complex because probably not only this snippet is used in your projects. Therefore I suppose you should write some code, not only one regexp.
Although specialized tools are expensive in most cases, probably some trial version will be enough for you.

There are FindBugs rules to find this, such as
ODR_OPEN_DATABASE_RESOURCE
ODR_OPEN_DATABASE_RESOURCE_EXCEPTION_PATH
OBL_UNSATISFIED_OBLIGATION
more if you look for the keyword 'resource' in the bug description
If you are not specifically looking for resource leaks, but rather for two or more variables named sql*, you can write a Checkstyle check to find them, possibly as a subclass of LocalVariableNameCheck. This is not a difficult kind of check, but requires coding and deployment work.

Why does a single thread make my Java Program so much faster?

I have been given the task of creating a sql database and creating a GUI in Java to access it with. I pretty much have it but I have a question about threads. Before today I did not use any threads in my program and as a result just to pull 150 records from the database i had to wait around 5 - 10 seconds. This was very inconvenient and I was not sure if i could fix the issue. Today I looked on the internet about using threads in programs similar to mine and i decided to just use one thread in this method:
public Vector VectorizeView(final String viewName) {
final Vector table = new Vector();
int cCount = 0;
try {
cCount = getColumnCount(viewName);
} catch (SQLException e1) {
e1.printStackTrace();
}
final int viewNameCount = cCount;
Thread runner = new Thread(){
public void run(){
try {
Connection connection = DriverManager.getConnection(getUrl(),
getUser(), getPassword());
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("Select * FROM "
+ viewName);
while (result.next()) {
Vector row = new Vector();
for (int i = 1; i <= viewNameCount; i++) {
String resultString = result.getString(i);
if (result.wasNull()) {
resultString = "NULL";
} else {
resultString = result.getString(i);
}
row.addElement(resultString);
}
table.addElement(row);
}
} catch (SQLException e) {
e.printStackTrace();
}
}
};
runner.start();
return table;
}
The only thing i really changed was adding the thread 'runner' and the performance increased exponentially. Pulling 500 records occurs almost instantly this way.
The method looked like this before:
public Vector VectorizeTable(String tableName) {
Vector<Vector> table = new Vector<Vector>();
try {
Connection connection = DriverManager.getConnection(getUrl(),
getUser(), getPassword());
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("Select * FROM "
+ tableName);
while (result.next()) {
Vector row = new Vector();
for (int i = 1; i <= this.getColumnCount(tableName); i++) {
String resultString = result.getString(i);
if (result.wasNull()) {
resultString = "NULL";
} else {
resultString = result.getString(i);
}
row.addElement(resultString);
}
table.addElement(row);
}
} catch (SQLException e) {
e.printStackTrace();
}
return table;
}
My question is why is the method with the thread so much faster than the one without? I don't use multiple threads anywhere in my program. I have looked online but nothing seems to answer my question.
Any information anyone could give would be greatly appreciated. I'm a noob on threads XO
If you need any other additional information to help understand what is going on let me know!
Answer:
Look at Aaron's answer this wasn't an issue with threads at all. I feel very noobish right now :(. THANKS #Aaron!

I think that what you are doing is appearing to make the database load faster because the VectorizeView method is returning before the data has been loaded. The load is then proceeding in the background, and completing in (probably) the same time as before.
You could test this theory by adding a thread.join() call after the thread.start() call.
If this is what is going on, you probably need to do something to stop other parts of your application from accessing the table object before loading has completed. Otherwise your application is liable to behave incorrectly if the user does something too soon after launch.
FWIW, loading 100 or 500 records from a database should be quick, unless the query itself is expensive for the database. That shouldn't be the case for a simple select from a table ... unless you are actually selecting from a view rather than the table, and the view is poorly designed. Either way, you probably would be better off focussing on why such a simple query is taking so long, rather than trying to run it in a separate thread.
In your follow-up you say that the version with the join after the start is just as fast as the version without it.
My first reaction is to say: "Leave the join there. You've fixed the problem."
But this doesn't explain what is actually going on. And I'm now completely baffled. The best I can think of is that something your application is doing before this on the current thread is the cause of this.
Maybe you should investigate what the application is doing in the period in which this is occurring. See if you can figure out where all the time is being spent.
Take a thread dump and look at the threads.
Run it under the debugger to see where the "pause" is occurring.
Profile it.
Set the application logging to a high level and see if there are any clues.
Check the database logs.
Etcetera

It looks like you kick off (i.e. start) a background thread to perform the query, but you don't join to wait for the computation to complete. When you return table, it won't be filled in with the results of the query yet -- the other thread will fill it in over time, after your method returns. The method returns almost instantly, because it's doing no real work.
If you want to ensure that the data is loaded before the method returns, you'll need to call runner.join(). If you do so, you'll see that loading the data is taking just as long as it did before. The only difference with the new code is that the work is performed in a separate thread of execution, allowing the rest of your code to get on with other work that it needs to perform. Note that failing to call join could lead to errors if code in your main thread tries to use the data in the Vector before it's actually filled in by the background thread.
Update: I just noticed that you're also precomputing getColumnCount in the multi-threaded version, while in the single-threaded version you're computing it for each iteration of the inner loop. Depending on the complexity of that method, that might explain part of the speedup (if there is any).

Are you sure that it is faster? Since you start separate thread, you will return table immediately. But are you sure that you measure time after it's fully populated with data?
Update
To measure time correctly, save runner object somewhere and call runner.join(). You can even to it in the same method for testing.

Ok, I think that if you examine table at the end of this method you will find it's empty. That's because start starts running the thread in the background, and you immediately return table without the background thread having a chance to populate it. So it appears to be going faster but actually isn't.

Java/Oracle: executing a prepared statement fails on a second iteration of a loop (not all variables bound). Why?

I'm debugging a Java App, which connects to Oracle DB via a thin client.
The code looks as follows: (i'm trying to simplify the use case here so pardon me if t does not actually comile)
Connection conn = myEnv.getDbConnection();
CallableStatement call = conn.prepareCall(
"{ ? = call SomePackage.SomeFunction (?)}");
call.registerOutParameter(1, OracleTypes.CURSOR);
for (int runCount = 0; runCount <= 1; runCount++) {
currency = getCurrency(runCount); // NOTE: [0]=CAD, [1]=USD
call.setString(2, currency);
try { call.execute(); } catch { // BREAKS HERE! }
ResultSet rset = (ResultSet)call.getObject(1);
... more code that I think is irrelevant as it does not use/affect "call"
}
When I run this code, the following happens:
First iteration of the loop, currency is set to "CAN".
Entire code of the loop runs perfectly fine.
Second iteration of the loop,currency is set to "USD".
The "execute()" call throws SQLException, as follows:
ORA-01008: not all variables bound
Why?
My initial suspicion was that it somehow related to registerOutParameter call before the loop that doesn't get called on 2d iteration. But moving that call inside the loop does not fix the problem. It seems that execute() call un-binds something but having both bindings inside the loop does not help.
What am I missing?
If it's something obvious, please be gendle - I know very little about Oracle and thin client, and Googling witrh miriad of fancy queries returned no love.
One additional clue: this design seemed to have worked before when the app was on Oracle 9 with OCI drivers. The reason I'm debuggin it is someone "upgraded" it to Oracle 10.2 thi client and it broke.
My next step should probably be bringing in entire CallableStatement into the loop, but that kind of defeats the whole idea of why I though prepared statements are used in the first place, no?

Have you tried adding call.clearParameters() into the loop? Perhaps it would reset some internal state on the object that it needs to execute again.

The explanation obtained via Oracle Support call was that this version of Java (1.3) was not compatible with new Oracle. Java 1.4 fixed the issue.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.