I want to migrate some data from an existing database to Cassandra DB.
Post migration, I want to verify whether all the data were migrated successfully or not.
I was wondering whether Cassandra Driver for JAVA provides any internal implementation feature to verify the same so that I can reduce the unnecessary overhead incurred during the interaction with Cassandra DB?
It all depends on the type of database that you are migrating from you can either check row per row in your previous database and then make queries against cassandra to see if the rows are there. That would be the safest approach imho.
Then you can do some very complex stuff like having spark jobs that do the comparisons.
Or you can iterate over all the rows in cassandra and check against the original database. Something like this: Fetch all rows in cassandra
The list could go on and on. For details you would have to tell more about the originating database, data model in cassandra and give some context what it means for a row to be verified ... other than it's there.
Related
I'm currently building a Spring Boot Service with a h2 in-memory database.
This Database acts as an cache for a part of the data on a central db2 database with a different database schema.
Now when the Spring boot service starts it needs to populate the h2 database with the latest data from the central database.
How can I do this in the best way performance wise?
I'm currently looking to create an different data-source in my service to first get the data and then save the data to the h2.
This doesn't feel like a good solution and it would take quite a long time to populate the database.
If you want to use H2 instead of your DB2 database ... and if you don't want to re-create the database each time you run your app ...
... then consider using an H2 file, instead of in-memory:
http://www.h2database.com/html/features.html
jdbc:h2:[file:][<path>]<databaseName>
jdbc:h2:~/test
jdbc:h2:file:/data/sample
jdbc:h2:file:C:/data/sample (Windows only)
You can "initialize" the file whenever you want (perhaps just once).
Performance should be excellent.
Per your update:
I still need to access the central db to get the latest data in the
fastest way possible. The central db needs to stay for other services
also accessing this
The "fastest" way to get the very latest data ... is to query the central db directly. Period - no ifs/ands/buts.
But, if for whatever reason, you want to "cache" a subset of "recent" data ... then H2 is an excellent choice.
And if you don't want to "rebuild" each time you start your H2 database, then save H2 to a file instead of making it in-memory.
The performance difference between H2:mem and H2:file is small, compared to the network overhead of querying your central db.
'Hope that helps...
There is 6GB table (oracle 12), which changes approximately every 1 second (delete,insert,update).
I have restful application which uses hibernate to call database with some filters to get data from this DB table. there are cases when "user" wants 1GB data from table as csv or xlsx report, selecting and fetching data from database takes too long.
so I thought, maybe its better to take this whole DB table in to java object (in to heap) and make search and exports directly from this object.
but there is a problem:
my java object should be synchronized with oracle Db table.
What is the best way to listen/track to oracle table changes from java?
one way that I know is oracle util UTL_HTTP which can call web service directly from oracle, and using this util inform a service about data changes.
Can hibernate also listen to oracle table changes which is not initiated from the same service? I mean, is it possible that APP1 listen to some oracle table_1, and when APP2 changes some data in to oracle table_1, APP1 to know this automatically. if it can, please provide some examples.
also, I wonder what do you think about importing whole table data in to a service, and make filters in this object, will it be lightweight? (I know that ~6GB memory will always in use)
I would like to use Apache Ignite as failover read-only storage so my application will be able to access the most sensitive data if main storage (Oracle) is down.
So I need to
Start nodes
Create schema (execute DDL queries)
Load data from Oracle to Ignite
Seems like it's not the same as database caching and I don't need to use Cache. However, this page says that I need to implement a store to load a large amount of data from 3rd parties.
So, my questions are:
How to effectively transfer data from Oracle to Ignite? Data Streamers?
Who should init this transfer? First started node? How to do that? (tutorials explain how to achieve that via clients, should I follow this advice?)
Actually, I think, use of a cache store without read/write-through would be a suitable option here. You can configure a CacheJdbcPojoStore, for example, and call IgniteCache#loadCache(...) on your cache, once the cluster is up. More on this topic: https://apacheignite.readme.io/docs/3rd-party-store
If you don't want to use a cache store, then IgniteDataStreamer could be a good choice. This is the fastest way to upload big amount of data to the cluster. Data loading is usually performed from a client node, when all server nodes are up and running.
We have following requirement:
We are storing data in Cassandra and then we will be indexing the same data (or part of that data) in elastic search.
The issue is if something goes wrong while inserting in elastic search, the data inserted in Cassandra should be rollbacked.
Basically, we want to have transactions over multiple NoSQL databases. Is there a way to do it in Java (Spring)?
There is no standard way of doing transaction across multiple NoSQL databases nor its supported by Spring.
Two approaches comes to my mind:
You should be using only one database to achieve it.
for example in your case, You can use Cassandra to do transaction at Node, Data Center or all Data Center level. and send data to Elastic Search asynchronously after Cassandra transaction succeeds
But if you must do it, I would recommend using Redis for acquiring global distributed transaction lock. Note that this is very expensive operation and you need to take care failure and rollback by yourself.
What are the options to index large data from Oracle DB to elastic search cluster? Requirement is to index 300Million records one time into multiple indexes and also incremental updates having around approximate 1 Million changes every day.
I have tried JDBC plugin for elasticsearch river/feeder, both seems to be running inside or require locally running elastic search instance. Please let me know if there is any better option for running elastic search indexer as a standalone job (probably java based). Any suggestions will be very helpful.
Thanks.
We use ES as a reporting db and when new records are written to SQL we take the following action to get them into ES:
Write the primary key into a queue (we use rabbitMQ)
Rabbit picks up the primary key (when it has time) and queries the relation DB to get the info it needs and then writes the data into ES
This process works great because it handles both new data and old data. For old data just write a quick script to write 300M primary keys into rabbit and you're done!
there are many integration options - I've listed out a few to give you some ideas, the solution is really going to depend on your specific resources and requirements though.
Oracle Golden Gate will look at the Oracle DB transaction logs and feed them in real-time to ES.
ETL for example Oracle Data Integrator could run on a schedule and pull data from your DB, transform it and send to ES.
Create triggers in the Oracle DB so that data updates can be written to ES using a stored procedure. Or use the trigger to write flags to a "changes" table that some external process (e.g. a Java application) monitors and uses to extract data from the Oracle DB.
Get the application that writes to the Oracle DB to also feed ES. Ideally your application and Oracle DB should be loosely coupled - do you have an integration platform that can feed the messages to both ES and Oracle?