I have a small web application configured with Guice, Jersey and EclipseLink, and run this application on jetty (8.0.0.M1) during development. There are about 10 (small) JPA managed classes (entities and embeddables), and about 20 classes total.
The initial startup takes 15 seconds + 5 seconds for the first requests. It seems like JPA is working on the first request, since I have the table generation strategy "create" enabled and see some JPA output from Maven on the first request.
A reload takes about 10 seconds and the first request after reloading takes about 3 to 4 seconds.
You may think, that the startup time is not so bad, but I'm wondering if I could accelerate the startup to work more fluently like with Django. Any idea for startup tuning?
I'm afraid that if you are not prepared to remove the table creation strategy, you will have to tolerate such loading times. In essence, everytime your startup your application, it will drop/create/verify the tables and issue the correct DDL statements to make it match the entities in your package.
Assuming that you're done defining your entities and you are working on some business-logic code, you can create the database once, and just re-use your initial setup.
I imagine you are using Jetty for rapid application development (RAD) and you want to see and test out any changes as quickly as possible. If there is no actual "persistent" requirement on your RAD environment's database, you could try moving to an im-memory DB engine. DB engine's like HSQL allow you to spin up new tables (and other structures) very rapidly compared to actual production quality DB engines. This would require that you use an ORM because HSQL's SQL is very different then most other databases but it sounds like you are already using JPA so this shouldn't be difficult.
The only alternative I see is using a database which has it's schema already created appropriately and not dropping it every time.
I am trying to understand when to use schema.sql db creation technique and when to rely on Spring boot's creation based on my entity classes. How to decide?
Let's leave for the moment the schema.sql out.
ORM automatic schema creation (create or update or create-drop) is normally useful during the development of application. Even during release candidates and QA reviews it is still useful because the changes which happen from development team with issues arising could be live much faster.
When the application reaches a critical phase and is mature in production, then any changes happening automatically from ORM could be considered dangerous.
At this stage you would normally in big companies rollout in production database only some sql scripts that affect the database, which should be reviewed first and tested before rollout happens. Also a rollback sql might be necessary.
So normally ORM having effect in database schema is only during early stages of an application and not when it has matured enough in production.
Let's now come back to schema.sql. This file can be used just once to create the database from some sql commands. This also would not be expected to be run anytime the application executes. At least not in majority of applications.
I think a logical approach would be to use the ORM during intial stages of development and QA and then when you are about to reach a mature phase, you inspect what type of database the ORM has created and then you make a manual review just to be sure of everything and any optimizations which would make sense and at this stage you can create with the already existing schema of ORM your own permanent schema.sql.
The obvious reason for the above is that with schema.sql you have 100% control of how the database is created. Using ORM you depend on the ORM provider to build this for you. This ORM provider might provide some new library that affects how the previous was used and many other things. The result is that with ORM you don't have 100% control on the database to be created.
I'm investigating the possibility of using neo4j to handle some of the queries of our java web application that simply take too long to run on MSSQL as they require so many joins on large tables, even with indexes implemented.
I am however concerned about the time that it might take to complete the ETL ultimately impacting on how outdated the information may be when queries.
Can someone advise on either a production strategy or toolkit / library that can assist in reading a production sql-server database (using deltas if possible to optimise) and updating a running instance of a neo4j database? I imagine that there will have to be some kind of mapping configuration but the idea is to have this run in an automated manner, updating the neo4j database with one or more sql-server table or view contents.
The direct way to connect a MS SQL database to a Neo4j database would be using the apoc.load.jdbc procedure.
For an initial load you can use Neo4j ETL (https://neo4j.com/blog/rdbms-neo4j-etl-tool/).
There is however no way around the fact that some planning and work will be involved if you want to keep two databases in sync (and if the logic involved goes beyond a few simple queries) continiously. You might want to offload a delta every so often (monthly, daily, hourly, ...) into CSV files and load those (with CYPHER syntax determining what needs to be added, removed, changed or connected) with LOAD CSV.
Sadly enough there's no such thing as a free lunch.
Hope this helps,
I'm having a Java EE application that runs on JBoss 7. In order to do TDD, I'm setting up embedded tests that use Arquillian with embedded Weld and a H2 embedded database.
This works fine, but the initial startup of Hibernate takes a considerable amount of time (5-10 seconds) and I haven't included all JPA entities yet. I have tried to use a persisted Oracle DB instead to avoid table creation, but it doesn't make much of a difference.
The main problem seems to be that Hibernate goes through all the entities and validates and prepares all the CRUD methods, named queries and so on.
Is there any way to tell Hibernate to do this lazily when needed (or not at all)? Most of the time, there will only be a subset of entities and queries involved in a test case, so I'd happily trade in execution time for start-up time while implementing.
Any ideas?
I know I could just use a subset of the entities, but it's sometimes difficult as they often have relations to other entities not needed in a test context. Or is there an easy way to 'deactivate' such relations to generate subsets of the database?
It seams like it's not clear what my problem is, so I'll try to clarify:
I have set up a testing environment with Arquillian (embedded Weld) that sets up an embedded database (H2) to do JPA enabled testing
I would like to use this approach to do Test Driven Development (TDD), which means I will have the following workflow on my local developing machine:
Create test case
Run test case
If test case fails, implement necessary changes and go back to 2.
Normally, one will perform steps 2 and 3 a couple of times before finishing a feature which means that I will often run a single test from my IDE that has to set up the entire testing JVM with Arquillian, Weld, embedded DB and whatever to run just A SINGLE TEST.
So much for the scenario. Now I've noticed that running that single test takes around 10 seconds, which is not the end of the world, but rather long to do TDD. And when I further investigated, I've noticed that most of this time goes to Hibernate setting up its internal structures (it's not Weld, Arquillian, Schema creation or whatever, but Hibernate getting ready to provide an EntityManager).
So my question is: Is there a way to speed up hibernate initialization so I can drop these 10 seconds to maybe 1-2 seconds? I wouldn't care if it's sort of a hack (like keeping the testing JVM with hibernate alive during multiple manual test runs or deactivating some validations or optimizations of Hibernate). My only issue is the start up time for a single test. Consecutive tests run fine and fast, so I don't have a problem with full regression testing or with testing on my build server.
Hope that makes my case a bit clearer...
Let me guess, you are using junit library?
Database schema creation usually isn't the most time-consuming operation (although it depends on amount of entities).
Personally, if I were you, I would run all your JUnit tests with TestNG (yes, TestNG can run all JUnit tests out of the box) and would take a look at <your-module>/test-output/index.html (particularly its sections Chronological view and Times). Then you would know what operations are the most time-consuming. And with such information, you can go further.
Furthermore, a couple of items of advice:
Arquillian tests with a remote (or managed) containers are usually faster, because you start server only once.
H2 embedded database isn't a bad choice, really. Typically you don't have to abandon it (typically, because sometimes your application may use some of the target database features, that are not present in the H2).
There is Arquillian Suite Extension that lets you do deployment only once and reuse it across the test classes. In case of many tests, this extension can speed up tests execution significantly.
You can test your entities outside any container (so called standalone JPA with transaction-type="RESOURCE_LOCAL"). It is the fastest way, but suits well only for testing entities annotations, relations, queries and so on.
There is number of ways to do your task much better. Just a couple of points:
You don't have to re-create database every time. Even embedded databases (H2, HSQLDB, Derby) have server mode in which they last longer than JVM.
If Hibernate initialization bothers you, then do it once across the all tests (I mean keeping EntityManagerFactory between unit tests).
If you want to avoid a database creation, just use H2 in server mode and set hibernate to not do any changes (<property name="hibernate.hbm2ddl.auto" value="none" />)
I am using Spring 2.5 and the Hibernate that goes with it. I'm running against an Oracle 11g database.
I have created my DAOs which extend HibernateTemplate. Now I want to write a loader that inserts 5 million rows in my person table. I have written this in a simple minded fashion like read a row from a CSV file, turn it into a person, save into the table. Keep doing this until CSV file is empty.
The problem is that I run out of heap space at about 450000 rows. So I double the size of memory from 1024m to 2048m and now I run out of memory after about 900000 rows.
So I've read some things about turning off the query cache for Hibernate, but I'm not using a L2 cache, so I don't think this is the issue.
I've read some things about JDBC2 batching, but I don't think that applies to hibernate.
So, I'm wondering if maybe there's a fundamental thing about Hibernate that I'm missing.
To be honest I wouldn't be using hibernate for that. ORMs are not designed to load million of rows into DBs. Not saying that you can't, but it's a bit like digging a swimming pool with a electric drill; you'd use an excavator for that, not a drill.
In your case, I'd load the CSV directly to the DB with a loader application that comes with databases. If you don't want to do that, yes, batch inserts will be way more efficient. I don't think Hibernate let's you do that easily though. If I were you I'd just use plain JDBC, or at most Spring JDBC.
If you have complicated businesslogic in the entities and absolutely have to use Hibernate, you could flush every N records as Richard suggests. However, I'd consider that a pretty bad hack.
In my experience with EclipseLink, holding a single transaction open while inserting/updating many records results in the symptoms you've experienced.
You are working with an EntityManager (of some sort, JPA or Hybernate specific - it's still managing Entitys). It's trying to keep the working set in memory, for the life of the transaction.
A general solution was to commit & the restart the transaction after every N inserts; a typical N for me was 1000.
As a footnote, with some version (undefined, it's been a few years) of EclipseLink, a session flush/clear didn't solve the problem.
It sounds like you are running out of space due to your first-level cache (the Hibernate session). You can flush the Hibernate session periodically to keep memory usage down, and break up the work into chunks by committing every few thousand rows, keeping the database's transaction log from getting too big.
But using Hibernate for a load task like that will be slow, because JDBC is slow. If you have a good idea what the environment will be like, you have a cap on the amount of data, and you have a big enough window for processing, then you can manage, but in a situation where you want it to work in multiple different client sites and you want to minimize the time spent on figuring out problems due to some client site's load job not working, then you should go with the database's bulk-copy tool.
The bulk-copy approach means the database suspends all constraint checking and index-building and transaction logging, instead it concentrates on slurping the data in as fast as possible. Because JDBC doesn't get anything like this level of cooperation from the database it can't compete. At a previous job we replaced a JDBC loader task that took over 8 hours to run with a SQLLoader task that took 20 minutes.
You do sacrifice database independence, but all databases have a bulk-copy tool (because DBAs rely on them) so you will have a very similar process for each database, only the exe you invoke and the way the file formatting is specified should change. And this way you make the best use of your processing window.
I've got an Oracle database that has two schemas in it which are identical. One is essentially the "on" schema, and the other is the "off" schema. We update data in the off schema and then switch the schemas behind an alias which our production servers use. Not a great solution, but it's what I've been given to work with.
My problem is that there is a separate application that will now be streaming data to the database (also handed to me) which is currently only updating the alias, which means it is only updating the "on" schema at any given time. That means that when the schemas get switched, all the data from this separate application vanishes from production (the schema it is in is now the "off" schema).
This application is using Hibernate 3.3.2 to update the database. There's Spring 3.0.6 in the mix as well, but not for the database updates. Finally, we're running on Java 1.6.
Can anyone point me in a direction to updating both "on" and "off" schemas simultaneously that does not involve rewriting the whole DAO layer using Spring JDBC to load two separate connection pools? I have not been able to find anything about getting hibernate to do this. Thanks in advance!
You shouldn't be updating two seperate databases this way, especially from the application's point of view. All it should know/care about is whether or not the data is there, not having to mess with two separate databases.
Frankly, this sounds like you may need to purchase an ETL tool. Even if you can't get it to update the 'on' schema from the 'off' one (fast enough to be practical), you will likely be able to use it to keep the two in sync (mirror changes from 'on' to 'off').
HA-JDBC is a replicating JDBC Driver we investigated for a short while. It will automatically replicate all inserts and updates, and distribute all selects. There are other database specific master-slave solutions as well.
On the other hand, I wouldn't recommend doing this for 4-8 hour procedures. Better lock the database before, update one database, and then backup-restore a copy, and then unlock again.