I have a very huge Cassandra table with about 13 million entries. This table serves as a kind of a lookup table. That means there are no writes but only reads. I use Datastax Enterprise 4.8 (including Cassandra 2.1).
So, the content is very static, but from time to time (every few month) there is an update of the content. The problem is, that the old data can become outdated and new data appears. But the old data won't be overwritten (it stays in the table). It is necessary to remove the old data to have a clean database.
I have one requirement ... the database must be available during the update. It is okay to have a short time period (a few minutes) where old and new data exists side by side.
I already thought about the following solutions:
Write the new table directly as a SSTable and exchange it with the old one
Do the update as batch with an truncate of the old data at the beginning
Create a new table (with new name) and change the used table in the program (while running)
Add a version column, add new data with new version and delete old data (with old version) afterwards
Which of these solution is the best one? Or even better, is there a solution that solves my problem much more elegant?
Okay, after a lot of testing, here are my findings. All the mentioned measurements are based on 13 million datasets.
Write own SSTable
I have written a small Java tool that creates SSTables. Here you can find a good example how to do this with the CQLSSTableWriter. After the creation of the SSTable I have used the sstableloader command line tool (comes with Cassandra) to import it into Cassandra.
Conclusion
the creation of the SSTable goes pretty quick (~ 10 minutes)
the import of the SSTable is very slow (~ 6 hours)
you have to take care to you use the exact same java library version (cassandra-all.jar) then your Cassandra version, otherwise it can happen that the created SSTable is incompatible with Cassandra
Import with CQL and version column
I have written a small Java tool that executes CQL commands to insert the datasets into Cassandra. Additionally, I added a version column, so after the import, I can remove the old data. The downside is, that my only partition key is the version itself, so I can remove old datasets easily. To workaround this, I indexed the table with Solr and use Solr queries to search in that table. The fact, that the data is not distributed between single nodes is okay for us, the search still works like a charm. At least the data is replicated between several nodes.
Conclusion
the duration of the import is ok (~ 1.5 hours)
the load of the Cassandra nodes goes up heavily, I still have to investigate how this influences the experience of the "normal users" (but a quick check shows that this is still fine)
Result
I will use the second solution because it is faster and you don't have to take care of the correct library versions. In all my tools I use threading, so here I also have a big adjusting screw to find the best balance between concurrency and threading overhead. At the end I use a low number of threads in my tool (~8) but the executeAsync method of the Datastax Java driver.
Related
I do UI vs DB validation for every step. There are around 30 to 40 queries are being used now. Earlier we stored in Excel sheet, which gave performance issues. Later switched to creating a class .txt file for each query. Please suggest the best approach.
The fastest way is to create a stored procedure for this queries. In this way if you are frequently using the same queries it will not be recompiled after first run, subsequently increasing your performance.
I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.
You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.
You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.
Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.
For the majority of the examples and tutorials I have seen it is always suggesting to execute SQL to Delete the table if it exists in the onUpgrade method.
Why would you want to delete the table as this would remove all the data, would it not be better to just replace the old DB version with the new version?
This would something that I could not understand and no where online outlined the reason.
Thanks
| Sam |
For simplicity. Dropping the old version and recreating a new version is simple and straightforward, though also destructive. In many cases, data loss like this is not a concern at development time.
Writing proper data migration code would be a topic for another example/tutorial as it inherently involves at least two versions of the database schema and thus the database helper. Including migration example in a simple do-this-to-get-started tutorial would just add unnecessary complexity.
Because they are tutorials: they assume you have no valuable data in them (or no data at all). In that case, the easiest way to upgrade a schema is to remove old tables and creating the new ones.
Don't take in mind it, you are right.
We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.
There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.
I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1
I'm developing a singler user application that needs a database. Most tables will have a reasonable amount of data, but there are a few that may grow to a few millions of rows. None of my queries will return a large result set.
Anyone know if HSQLDB can handle such a large number of rows?
From the official HSQLDB page:
The latest version 2.2.9, released in August, supports up to 270 billion rows of data in a single database.
That said, it all depends on how you're configuring the server, since by default it uses memory tables that will not fit in the standard Heap memory, so you'll have to use cached tables.
HSQLDB can handle millions of rows. You can try some of the test classes which can create large databases. For example:
http://hsqldb.org/web/hsqlPerformanceTests.html
Or here:
https://sourceforge.net/p/hsqldb/svn/HEAD/tree/base/trunk/src/org/hsqldb/test/
Check the TestCacheSize and TestStressInsert classess.
You should use the built-in backup capability and regularly backup the database.