I'm developing a singler user application that needs a database. Most tables will have a reasonable amount of data, but there are a few that may grow to a few millions of rows. None of my queries will return a large result set.
Anyone know if HSQLDB can handle such a large number of rows?
From the official HSQLDB page:
The latest version 2.2.9, released in August, supports up to 270 billion rows of data in a single database.
That said, it all depends on how you're configuring the server, since by default it uses memory tables that will not fit in the standard Heap memory, so you'll have to use cached tables.
HSQLDB can handle millions of rows. You can try some of the test classes which can create large databases. For example:
http://hsqldb.org/web/hsqlPerformanceTests.html
Or here:
https://sourceforge.net/p/hsqldb/svn/HEAD/tree/base/trunk/src/org/hsqldb/test/
Check the TestCacheSize and TestStressInsert classess.
You should use the built-in backup capability and regularly backup the database.
Related
I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.
You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.
You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.
Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.
I have a very huge Cassandra table with about 13 million entries. This table serves as a kind of a lookup table. That means there are no writes but only reads. I use Datastax Enterprise 4.8 (including Cassandra 2.1).
So, the content is very static, but from time to time (every few month) there is an update of the content. The problem is, that the old data can become outdated and new data appears. But the old data won't be overwritten (it stays in the table). It is necessary to remove the old data to have a clean database.
I have one requirement ... the database must be available during the update. It is okay to have a short time period (a few minutes) where old and new data exists side by side.
I already thought about the following solutions:
Write the new table directly as a SSTable and exchange it with the old one
Do the update as batch with an truncate of the old data at the beginning
Create a new table (with new name) and change the used table in the program (while running)
Add a version column, add new data with new version and delete old data (with old version) afterwards
Which of these solution is the best one? Or even better, is there a solution that solves my problem much more elegant?
Okay, after a lot of testing, here are my findings. All the mentioned measurements are based on 13 million datasets.
Write own SSTable
I have written a small Java tool that creates SSTables. Here you can find a good example how to do this with the CQLSSTableWriter. After the creation of the SSTable I have used the sstableloader command line tool (comes with Cassandra) to import it into Cassandra.
Conclusion
the creation of the SSTable goes pretty quick (~ 10 minutes)
the import of the SSTable is very slow (~ 6 hours)
you have to take care to you use the exact same java library version (cassandra-all.jar) then your Cassandra version, otherwise it can happen that the created SSTable is incompatible with Cassandra
Import with CQL and version column
I have written a small Java tool that executes CQL commands to insert the datasets into Cassandra. Additionally, I added a version column, so after the import, I can remove the old data. The downside is, that my only partition key is the version itself, so I can remove old datasets easily. To workaround this, I indexed the table with Solr and use Solr queries to search in that table. The fact, that the data is not distributed between single nodes is okay for us, the search still works like a charm. At least the data is replicated between several nodes.
Conclusion
the duration of the import is ok (~ 1.5 hours)
the load of the Cassandra nodes goes up heavily, I still have to investigate how this influences the experience of the "normal users" (but a quick check shows that this is still fine)
Result
I will use the second solution because it is faster and you don't have to take care of the correct library versions. In all my tools I use threading, so here I also have a big adjusting screw to find the best balance between concurrency and threading overhead. At the end I use a low number of threads in my tool (~8) but the executeAsync method of the Datastax Java driver.
I am working on a Spring-MVC application in which we are seeing that the database is growing big. The space is consumed by chat messages history mostly, and other stuff like old notifications, which are not that useful.
Because of which we thought of moving the guys to some text/XML file to give the DB some room to breath and increase the performance of queries thereby. Indexes are not that useful as too many insertions.
I wanted to know if there is any way, PostgreSQL or Hibernate has support for such a task, where data is picked out of db and saved in plain files, which can be accessed and result in atleast good performance gains.
I have only started looking up some stuff, so I don't have much in hand to show. Kindly let me know if there are any questions you guys have.
Thanks a lot.
I would use the PostgreSQL JSON storage and have two databases:
the current operations DB, the one where you are moving data away to slim it
the archive database where old data is aggregated to save storage
This way you can move data from the current database into the archive database without compromising ACID attributes and you can aggregate the old data to simplify retrieval, by grouping various related entities based on some common root entity, which you'll then use to access your old data.
This way the current operation database remains small enough, while the archive database can be shared. This way, it's easier to configure the current operation for high performance, while the archive one for scalability.
Anyway, hibernate doesn't support this out-of-the-box, but you can implement it using custom Hibernate types and JTA transactions.
We are working on a solution which crunches log files generated by systems and does various analysis operations on these logs to come up with different views which can help to triage the issues. For e.g. building a sequence of error messages which are repeating across the logs.
Currently we are loading the logdata in java collections and doing all operations by iterating/searching through these collections which is affecting the performance. We are thinking to instead load the data in a database and fire queries on the data to get optimized search results. And for the same we are thinking on using an in-memory db which will give a better performance than a persistent store as disk reads/writes will be minimized.
The amount of data to be analyzed at a time may go up to few GBs (2-4 GBs) and hence may exceed the RAM size on the machine.
Question:
What options can be considered for such an In-Memory db? Is GridGain a good option for the same?
Most of our solutions shall be deployed on a single node and hence distributed capabilities are not the priority. What other in-memory db's can be recommended for this purpose
You could try column store in-memory databases. They usually can achieve better compression ratio than row store db, and are designed for analytical tasks. Examples are MonetDB (open source), Vertica, InfiniDB and so on.
I am totally new to Hibernate , so this question might seem so naive for someone.
I am developing an application that requires in-memory tables, and writing the tables to disc only a periodic intervals to reduce write operations. I could have done this using some complex datastructures, but since ultimately my data is stored in database on a disc, I am searching for any in-memory Database computation feature in Java. Does Hibernate allow me to do this?
It sounds like you are looking for a hypersonic / HSQL in memory database. If you use this then you can set a property called write delay that will delay the write to disk like this:
<property name="connection.writedelay">100</property>
Have a look here.
Sure, you can use Hypersonic SQL in in-memory mode.