Is there any tool/way to export tables data into a file which is inserted on that day, and this job should Run on everyday.
If you are targeting specific tables that are growing rapidly every day, I would suggest implementing daily partitions on these specific tables using interval partitioning using the column that determines the record's date. This way each day's data could be easily archived using exchange partition or backed up. Also need to ensure that the partitioning column chosen is used for querying across the application to avoid scanning all partitions for SQLs and benefit from partition pruning.
If you are targeting all the application tables in your database, my suggestion does not apply. Thanks.
Related
I have a very huge Cassandra table with about 13 million entries. This table serves as a kind of a lookup table. That means there are no writes but only reads. I use Datastax Enterprise 4.8 (including Cassandra 2.1).
So, the content is very static, but from time to time (every few month) there is an update of the content. The problem is, that the old data can become outdated and new data appears. But the old data won't be overwritten (it stays in the table). It is necessary to remove the old data to have a clean database.
I have one requirement ... the database must be available during the update. It is okay to have a short time period (a few minutes) where old and new data exists side by side.
I already thought about the following solutions:
Write the new table directly as a SSTable and exchange it with the old one
Do the update as batch with an truncate of the old data at the beginning
Create a new table (with new name) and change the used table in the program (while running)
Add a version column, add new data with new version and delete old data (with old version) afterwards
Which of these solution is the best one? Or even better, is there a solution that solves my problem much more elegant?
Okay, after a lot of testing, here are my findings. All the mentioned measurements are based on 13 million datasets.
Write own SSTable
I have written a small Java tool that creates SSTables. Here you can find a good example how to do this with the CQLSSTableWriter. After the creation of the SSTable I have used the sstableloader command line tool (comes with Cassandra) to import it into Cassandra.
Conclusion
the creation of the SSTable goes pretty quick (~ 10 minutes)
the import of the SSTable is very slow (~ 6 hours)
you have to take care to you use the exact same java library version (cassandra-all.jar) then your Cassandra version, otherwise it can happen that the created SSTable is incompatible with Cassandra
Import with CQL and version column
I have written a small Java tool that executes CQL commands to insert the datasets into Cassandra. Additionally, I added a version column, so after the import, I can remove the old data. The downside is, that my only partition key is the version itself, so I can remove old datasets easily. To workaround this, I indexed the table with Solr and use Solr queries to search in that table. The fact, that the data is not distributed between single nodes is okay for us, the search still works like a charm. At least the data is replicated between several nodes.
Conclusion
the duration of the import is ok (~ 1.5 hours)
the load of the Cassandra nodes goes up heavily, I still have to investigate how this influences the experience of the "normal users" (but a quick check shows that this is still fine)
Result
I will use the second solution because it is faster and you don't have to take care of the correct library versions. In all my tools I use threading, so here I also have a big adjusting screw to find the best balance between concurrency and threading overhead. At the end I use a low number of threads in my tool (~8) but the executeAsync method of the Datastax Java driver.
I am working on a Spring-MVC application in which we are seeing that the database is growing big. The space is consumed by chat messages history mostly, and other stuff like old notifications, which are not that useful.
Because of which we thought of moving the guys to some text/XML file to give the DB some room to breath and increase the performance of queries thereby. Indexes are not that useful as too many insertions.
I wanted to know if there is any way, PostgreSQL or Hibernate has support for such a task, where data is picked out of db and saved in plain files, which can be accessed and result in atleast good performance gains.
I have only started looking up some stuff, so I don't have much in hand to show. Kindly let me know if there are any questions you guys have.
Thanks a lot.
I would use the PostgreSQL JSON storage and have two databases:
the current operations DB, the one where you are moving data away to slim it
the archive database where old data is aggregated to save storage
This way you can move data from the current database into the archive database without compromising ACID attributes and you can aggregate the old data to simplify retrieval, by grouping various related entities based on some common root entity, which you'll then use to access your old data.
This way the current operation database remains small enough, while the archive database can be shared. This way, it's easier to configure the current operation for high performance, while the archive one for scalability.
Anyway, hibernate doesn't support this out-of-the-box, but you can implement it using custom Hibernate types and JTA transactions.
I'm using Spring connecting to Sql Server 2008 R2 via JDBC.
All I need is to insert a large amount of data to a table in the database as fast as possible. I'm wondering which way is better:
Use Spring batch insert mention here
Create stored procedure in database and call it on Java side
Which one is better?
It depends on two things stored producer would take up the database time where as batch would take up time on the program side. so depending on what you are more concerned with it is really up to up. Me i would prefer the batch as to keep the database time free reducing errors that might occur. Hope this helps!
Spring batch is an excellent framework and it can be used as an ETL (Extract, Transform, Load) tool with respect to database.
Spring batch divides any import job in 3 steps:
1. Read : Read data from any source. It can be any other database, any file (XML, CSV or any other) or anything else
2. Process: Process input data, validate it and may convert it to your required objects.
3. Save: Save data into database or any custom file format
Spring batch is useful when you need long running jobs with restart/resume capabilities.
Also it is lot slower that any direct DB import tool like impdp for Oracle. Spring batch saves its state in database so it is an overhead and consumes long time. However you can hack spring batch and make it not save the state in DB but it costs loss of restart/resume capabilities.
So if speed is your prime requirement, you should choose some database specific option.
But if you need do some validation and/or processing Spring batch is an excellent option, you just need to configure it properly. Also Spring batch provides scalability and database independence.
I'm developing a singler user application that needs a database. Most tables will have a reasonable amount of data, but there are a few that may grow to a few millions of rows. None of my queries will return a large result set.
Anyone know if HSQLDB can handle such a large number of rows?
From the official HSQLDB page:
The latest version 2.2.9, released in August, supports up to 270 billion rows of data in a single database.
That said, it all depends on how you're configuring the server, since by default it uses memory tables that will not fit in the standard Heap memory, so you'll have to use cached tables.
HSQLDB can handle millions of rows. You can try some of the test classes which can create large databases. For example:
http://hsqldb.org/web/hsqlPerformanceTests.html
Or here:
https://sourceforge.net/p/hsqldb/svn/HEAD/tree/base/trunk/src/org/hsqldb/test/
Check the TestCacheSize and TestStressInsert classess.
You should use the built-in backup capability and regularly backup the database.
We are developing a SAAS based application. One of the requirements is to record every change in database tables i.e. create date/time based version of data. Client should be able to revert back to any version of data.
I have almost 30 tables in database, and data insertion frequency is 80,000 records added/updated per day through bulk import. However, client can also use GUI to insert data through forms (other than bulk import).
Before creating any strategy to implement this requirement, I would love have your comments/suggestion on how to implement this.
On a side note, I have reviewed this blog post and found it very good starting point but I still doubt on how to restore past data.
Database snapshot is a promising solution, but as I said earlier that this is a SAAS based application and we are storing multiple clients data in a single database, and snapshot would restore data for other clients as well.
Please suggest any strategy/plan on how to execute this requirement.
If you plan on using JPA/Hibernate to fetch your data, you can give Envers a shot.
Envers is a JBoss open-source project for maintaining versions of Database Entities. You can mark certain columns of the entire table with #Audited annotation to start tracking audit history. It typically stores all the audit data in a table with _AUDIT name. It also provides API to query historical data.
For details please go thru http://www.jboss.org/envers