I am using Sybase ASE, and for a table, in which I will save results calculated by Java. This table has 10 columns, one column type is INT value (but not an ID column), and other 9 columns are all VARCHAR(50) type.
There's no index or trigger on this table (in fact this table is really independent). I need to insert around 160K rows into this table. I tried to separate the work by batch, which will do 10,000 insertions every time. I used two different ways, one is Spring's JdbcTemplate.batchUpdate the other one is native JDBC PreparedStatement.executeBatch api.
However no clear winner regarding the performance. Both of them takes around 25 to 30 seconds for 10K insertions.
Then I thought it could be related to the JDBC driver, so I tried two different drivers: jConnect and jTDS. No real impact on insertion performance.
Finally I decided to compare Sybase with another database, i.e. PostgreSQL in my test. I kept the same Java code, and surprisingly PostgreSQL takes only 0.3 second for every 10K insertions, while Sybase took 25 to 30 seconds (75 to 100 times longer).
DBA support team explains the difference is due to that PostgreSQL is installed on my local machine, while Sybase is installed on our enterprise's server. However, I am not convinced by this explanation at all.
Does anyone know if there's a configuration in Sybase which could considerably impact the insertion speed? Or are there any other possible causes for my above scenario?
The delay that you see on the sybase end is because of a lot of factors that needs to be checked and comparing it to a different database that too on a local machine is not correct.
For starting, we need to check the network latency and the storage used in the sybase database. We need to check the sybase server configuration, page size and locking scheme of the table that you are inserting into. We also need to do a basic health check of the server while you are inserting the data. As you have mentioned that you have used two different ways to insert the data, It is important that you check whether these two ways along are updated accordingly to the sybase client you have installed on your system.
To sum it up, It may be a simple issue as blocking on the sybase instance or it could be related to the storage which is not able to write it quickly. Given the sybase is configured properly, The performance would be very good.
Whether the DB server is local or not may indeed make a significant difference. Until you cut out this factor, comparison with a local DB makes little sense.
But that aside, there are many aspects that affect insert performance in ASE. First off, make sure the overall memory configuration (e.g. data cache and procedure cache) is not too small -- leaving it at the installation defaults is a guarantee for disappointing results. Then there is network packet size that can play a role. And the batch size (#rows before you commit). And the table's lock scheme.
Trying to use minimally logged inserts will help (requires config setting changes), especially since the table has no indexes (and no UNIQUE or PK constraints either?)
The ASE server page size (which you choose when you create the server) also makes a difference: bigger is basically better for inserts.
Set the ENABLE_BULK_LOAD parameter to True. It will speed it up.
Related
I built an java web application and it works fine with SQL Server 2008 if the size in the queried table is about 100 records. But when I increased it to 1.3 million records, it takes about 4-8 minutes to execute a single query. My application uses hibernate.
I have deployed this application on a 6gb ram server and a 12gb ram server and have also increased my java heap size to 4gb and 8gb respectively but I still encounter the same problem.
Please what can I do to improve performance?
UPDATE:
This is a one of the sql queries that is really slow on SQL Server but runs fast on Postgresql
select distinct c.company from Affiliates c where c.portalUser.userId = 'user.getUserId()' and lower(c.company.classification.name) = lower('" + companyClass + "') order by c.company.dateOfReservation desc";
I think the most painful part of your example query is the part:
lower(c.company.classification.name) = lower('" + companyClass + "')
Here you are forcing a table scan because for each row the names have to be lowercased and compared. If your database is not configured to compare case you might be able to omit the lower() calls. If not you could consider adding an extra column with a lower case copy of the string and use this copy for the query.
How many companyClasses are there? Could you create a separate table with all the company classes and refer to it by index? This would probably speed up this query a lot since you would not have to do any varchar comparisons anymore.
Just some ideas.
You haven't provided any code or sufficient specific details but the usual causes are:
Emitting millions of ORM queries. Use SQL Profiler (or turn on Hibernate's logging) to determine if you are emitting many selects from code when a single stored procedure to do all the work would be more appropriate. This SO answer shows you how to enable Hibernate SQL logging.
Poorly indexed tables causing lots of large table scans. See How Can I Log and Find the Most Expensive Queries?. Determine your expensive queries and create indexes to improve their performance.
Note: #Adriaan Koster pointed out the lower case conversion performed in your query. By default, SQL Server is case insensitive (I've only come across one that wasn't in 20 years, and that was set up mistakenly), so you can almost certainly drop the conversions to lowercase. This would allow the query to use an appropriate index if one exists.
What's the fastest way to get a large volume of data from an Oracle database into Java objects.
Are there any Oracle tricks as to the way the data should be organised?
I was thinking of using plain JDBC rather than any Hibernate style libraries?
Would it be better to get Oracle to produce a file and then read from the file - although this has to be done programatically.
All thoughts appreciated.
I am not a Java or JDBC expert, but if you plan on pulling a lot of rows down from a database, you will likely benefit by increasing the prefetch rows on the connection.
Connection conn = DriverManager.getConnection ("jdbc:oracle:","user","password");
//Set the default row prefetch setting for this connection
((OracleConnection)conn).setDefaultRowPrefetch(100);
I believe the default for JDBC is to fetch one row at a time, so you're paying for a round trip to the database for each row fetched. (Note, I've seen documentation that suggests the default is 10 rows per round trip). Setting prefetch to a larger number will fetch more rows per round trip to the database. Speed increases can be dramatic depending on the number of rows and the performance of your network.
Depending on how far you want to go with this I'd imagine dropping jdbc and writing a custom application residing on the same machine as the DB using Oracle Call API and JNI would be the fastest...
It's probably much simpler to just use a plain prepared statment using JDBC and then if that's not enough (and depending on where the bottle neck is) try making a stored procedure. The caching done by ORM's like Hibernate should not be discounted though, so I guess you'd have to do some benchmarks. Also if the bottle neck is the database and you write a stored procedure which improves the read performance, then you could still use Hibernate to marshal the data to java objects. See Using stored procedures for querying
Whatever you wind up doing, design for/implement "lazy initialization" [really only applies for complex object hierarchies/networks; you said java objects (plural) so I'm imagining something more than just a single table that maps to a single object]. So basically, you are only reading in the objects that are needed at that time; when you run a getter method, then it does more db calls for just that data.
Another trick sometimes overlooked in the Java world is: if you have some complex sql coming from the code, you can rather create a view on the Oracle side, embedding that complexity there, then map your object to the view; so if you can flatten your object like the view, then you're in business.
I have a SELECT query with lot of IF conditions, which I can do either in the query itself (takes DB machine's CPU) or I can put it in my java code (takes server machine's CPU).
Is there any preferred approach here (to put conditions in DB Vs in mid-tier)?
UPDATE: My query is a join on more than 2 tables,
and I am using left join to combine and there are some rows which will have corresponding row in 2nd table and some are not.
I need to have some default value for those columns when I don't have corresponding row in 2nd table.
SElECT CASE WHEN t2.col1 is null
then 'default' else t2.col1
END
FROM table1 t1
LEFT JOIN table2 t2 ON t1.id = t2.id
If it's really something that the DB cannot do any faster than the app server, and which actually reduces the load on the DB server if moved to the app server, then I'd move it to the app server.
The reason: if you reach the limits of your hardware, it's much easier to have multiple app servers than to have a clustered database.
However, the second condition above should be tested thoroughly: many things will not reduce (or even increase) the DB load if moved away from the DB.
Update: For the kind of thing you need, I doubt whether the first condition is satisfied - have you tested it? A simple CASE is completely insignificant, unless the condition or the branches contain some very expensive calculations.
Yes, though I would suggest another approach, one that adds no load to the app server and minimal load to the DBMS. It's a little hard to answer the question since you haven't provided a concrete example but I'll give it a shot.
My preferred solution is to get rid of the if conditions totally if you can. At a bare minimum, you can re-jig your database schema to move the cost of calculation away from the select (which happens a lot) and into the insert/update (which happens less often).
That's the normal case, I have seen databases that write more frequently than read, but they're the exception rather than the rule.
By way of example, let's say you store person information and you want to get a list of people whose first name is more than 5 characters long. Don't ask why, I'm the customer, you have to give me what I want :-)
Rather than a monstrous select statement to (possibly) split apart the name and count the characters in it, do that as an insert/update trigger when the data enters the table - that's the only time when the value can change after all.
Put that calculation in another column (indexed) and use that in your select. The cost of the calculation is amortised over al the selects, which will be blindingly fast.
It will take up more storage space but, if you compare the number of database "how can I make this faster?" questions against the number of "how can I use less space?" questions, you'll find the former greatly outweigh the latter.
And, yes, it does mean you store redundant data but the triggers mitigate the possibility of losing ACID properties. It's okay to bend rules if you know the possible consequences and how best to avoid them.
Based on your update, you should put the workload on to the machine where it causes the least impact. That may be the DBMS, it may be the app server, it may even be on the client side (of the app server) itself since that would distribute the cost across a lot of machines rather than concentrating it at a single point.
You should measure, not guess! Set up realistic performance test systems along with realistic production-quality data, then try the different approaches. That's the only real way to be certain.
I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.
Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.
How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)
I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.
You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.
This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8
I’ve been experiencing a performance problem with deleting blobs in derby, and was wondering if anyone could offer any advice.
This is primarily with 10.4.2.0 under windows and solaris, although I’ve also tested with the new 10.5.1.1 release candidate (as it has many lob changes), but this makes no significant difference.
The problem is that with a table containing many large blobs, deleting a single row can take a long time (often over a minute).
I’ve reproduced this with a small test that creates a table, inserts a few rows with blobs of differing sizes, then deletes them.
The table schema is simple, just:
create table blobtest( id integer generated BY DEFAULT as identity, b blob )
and I’ve then created 7 rows with the following blob sizes : 1024 bytes, 1Mb, 10Mb, 25Mb, 50Mb, 75Mb, 100Mb.
I’ve read the blobs back, to check they have been created properly and are the correct size.
They have then been deleted using the sql statement ( “delete from blobtest where id = X” ).
If I delete the rows in the order I created them, average timings to delete a single row are:
1024 bytes: 19.5 seconds
1Mb: 16 seconds
10Mb: 18 seconds
25Mb: 15 seconds
50Mb: 17 seconds
75Mb: 10 seconds
100Mb: 1.5 seconds
If I delete them in reverse order, the average timings to delete a single row are:
100Mb: 20 seconds
75Mb: 10 seconds
50Mb: 4 seconds
25Mb: 0.3 seconds
10Mb: 0.25 seconds
1Mb: 0.02 seconds
1024 bytes: 0.005 seconds
If I create seven small blobs, delete times are all instantaneous.
It thus appears that the delete time seems to be related to the overall size of the rows in the table more than the size of the blob being removed.
I’ve run the tests a few times, and the results seem reproducible.
So, does anyone have any explanation for the performance, and any suggestions on how to work around it or fix it? It does make using large blobs quite problematic in a production environment…
I have exact the same issue you have.
I found that when I do DELETE, derby actually "read through" the large segment file completely. I use Filemon.exe to observe how it run.
My file size it 940MB, and it takes 90s to delete just a single row.
I believe that derby store the table data in a single file inside. And some how a design/implementation bug that cause it read everything rather then do it with a proper index.
I do batch delete rather to workaround this problem.
I rewrite a part of my program. It was "where id=?" in auto-commit.
Then I rewrite many thing and it now "where ID IN(?,.......?)" enclosed in a transaction.
The total time reduce to 1/1000 then it before.
I suggest that you may add a column for "mark as deleted", with a schedule that do batch actual deletion.
As far as I can tell, Derby will only store BLOBs inline with the other database data, so you end up with the BLOB split up over a ton of separate DB page files. This BLOB storage mechanism is good for ACID, and good for smaller BLOBs (say, image thumbnails), but breaks down with larger objects. According to the Derby docs, turning autocommit off when manipulating BLOBs may also improve performance, but this will only go so far.
I strongly suggest you migrate to H2 or another DBMS if good performance on large BLOBs is important, and the BLOBs must stay within the DB. You can use the SQuirrel SQL client and its DBCopy plugin to directly migrate between DBMSes (you just need to point it to the Derby/JavaDB JDBC driver and the H2 driver). I'd be glad to help with this part, since I just did it myself, and haven't been happier.
Failing this, you can move the BLOBs out of the database and into the filesystem. To do this, you would replace the BLOB column in the database with a BLOB size (if desired) and location (a URI or platform-dependent file string). When creating a new blob, you create a corresponding file in the filesystem. The location could be based off of a given directory, with the primary key appended. For example, your DB is in "DBFolder/DBName" and your blobs go in "DBFolder/DBName/Blob" and have filename "BLOB_PRIMARYKEY.bin" or somesuch. To edit or read the BLOBs, you query the DB for the location, and then do read/write to the file directly. Then you log the new file size to the DB if it changed.
I'm sure this isn't the answer you want, but for a production environment with throughput requirements I wouldn't use Java DB. MySQL is just as free and will handle your requirements a lot better. I think you are really just beating your head against a limitation of the solution you've chosen.
I generally only use Derby as a test case, and especially only when my entire DB can fit easily into memory. YMMV.
Have you tried increasing the page size of your database?
There's information about this and more in the Tuning Java DB manual which you may find useful.