Java Recon Job - Fastest and generic solution - java

Currently, I am running my application in RDS and in the process of moving to MongoDB. For now, we have a synch job which syncs data from Oracle to Mongo as and when row gets added/modified or deleted.
Write is happening only on Oracle.
Planning to come up with a recon job which compares the source and the target data. I am trying for full recon which fetches all the data from oracle and then get then compares with MongoDB to find the descrepencies.
I am planning for the below approach.
Note, The oracle DB size could be in TeraBytes.
1) Get first thousand rows from oracle table A. ( Simple JDBC results approach )
2) For each entry, create a map of key values. ( Map)
3) Get the corresponding data from MongoDB and convert the data based on oracle format.
4) For each entry, create a map of key values.
5) Compare these two map to find if they are same. ( Oracle Map equals Mongoldb Map )
6) Repeat the same for next rows ....
But, this approach is taking much time even i do using multi threading. I do not have much idea on Big Data, but open for new ideas.
Is there any other way or technology which can be used here for parallel processing.
Note, there could be some tables which are mapped straight forward b/w oracle and Mongo. Few tables are in denormalized form in Mongo.
Thanks,

Related

How to optimize one big insert with hibernate

For my website, I'm creating a book database. I have a catalog, with a root node, each node have subnodes, each subnode has documents, each document has versions, and each version is made of several paragraphs.
In order to create this database the fastest possible, I'm first creating the entire tree model, in memory, and then I call session.save(rootNode)
This single save will populate my entire database (at the end when I'm doing a mysqldump on the database it weights 1Go)
The save coasts a lot (more than an hour), and since the database grows with new books and new versions of existing books, it coasts more and more. I would like to optimize this save.
I've tried to increase the batch_size. But it changes nothing since it's a unique save. When I mysqldump a script, and I insert it back into mysql, the operation coast 2 minutes or less.
And when I'm doing a "htop" on the ubuntu machine, I can see the mysql is only using 2 or 3 % CPU. Which means that it's hibernate who's slow.
If someone could give me possible techniques that I could try, or possible leads, it would be great... I already know some of the reasons, why it takes time. If someone wants to discuss it with me, thanks for his help.
Here are some of my problems (I think): For exemple, I have self assigned ids for most of my entities. Because of that, hibernate is checking each time if the line exists before it saves it. I don't need this because, the batch I'm executing, is executed only one, when I create the databse from scratch. The best would be to tell hibernate to ignore the primaryKey rules (like mysqldump does) and reenabeling the key checking once the database has been created. It's just a one shot batch, to initialize my database.
Second problem would be again about the foreign keys. Hibernate inserts lines with null values, then, makes an update in order to make foreign keys work.
About using another technology : I would like to make this batch work with hibernate because after, all my website is working very well with hibernate, and if it's hibernate who creates the databse, I'm sure the naming rules, and every foreign keys will be well created.
Finally, it's a readonly database. (I have a user database, which is using innodb, where I do updates, and insert while my website is running, but the document database is readonly and mYisam)
Here is a exemple of what I'm doing
TreeNode rootNode = new TreeNode();
recursiveLoadSubNodes(rootNode); // This method creates my big tree, in memory only.
hibernateSession.beginTrasaction();
hibernateSession.save(rootNode); // during more than an hour, it saves 1Go of datas : hundreads of sub treeNodes, thousands of documents, tens of thousands paragraphs.
hibernateSession.getTransaction().commit();
It's a little hard to guess what could be the problem here but I could think of 3 things:
Increasing batch_size only might not help because - depending on your model - inserts might be interleaved (i.e. A B A B ...). You can allow Hibernate to reorder inserts and updates so that they can be batched (i.e. A A ... B B ...).Depending on your model this might not work because the inserts might not be batchable. The necessary properties would be hibernate.order_inserts and hibernate.order_updates and a blog post that describes the situation can be found here: https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/
If the entities don't already exist (which seems to be the case) then the problem might be the first level cache. This cache will cause Hibernate to get slower and slower because each time it wants to flush changes it will check all entries in the cache by iterating over them and calling equals() (or something similar). As you can see that will take longer with each new entity that's created.To Fix that you could either try to disable the first level cache (I'd have to look up whether that's possible for write operations and how this is done - or you do that :) ) or try to keep the cache small, e.g. by inserting the books yourself and evicting each book from the first level cache after the insert (you could also go deeper and do that on the document or paragraph level).
It might not actually be Hibernate (or at least not alone) but your DB as well. Note that restoring dumps often removes/disables constraint checks and indices along with other optimizations so comparing that with Hibernate isn't that useful. What you'd need to do is create a bunch of insert statements and then just execute those - ideally via a JDBC batch - on an empty database but with all constraints and indices enabled. That would provide a more accurate benchmark.
Assuming that comparison shows that the plain SQL insert isn't that much faster then you could decide to either keep what you have so far or refactor your batch insert to temporarily disable (or remove and re-create) constraints and indices.
Alternatively you could try not to use Hibernate at all or change your model - if that's possible given your requirements which I don't know. That means you could try to generate and execute the SQL queries yourself, use a NoSQL database or NoSQL storage in a SQL database that supports it - like Postgres.
We're doing something similar, i.e. we have Hibernate entities that contain some complex data which is stored in a JSONB column. Hibernate can read and write that column via a custom usertype but it can't filter (Postgres would support that but we didn't manage to enable the necessary syntax in Hibernate).

How to compare Hive and Cassandra data in Java when there are around 1 million records

I am using Hive and Cassandra, table structure and data is the same in both Hive and Cassandra. There will be almost 1 million records. My requirement is that I need to check if each and every row has the same data in both Cassandra and Hive.
Can I compare two resultset objects directly? (one resultset with Cassandra data and another from Hive)
If we are iterating over resultset object, can resultset object hold 1 million records at a time? Will there be any performance issue?
What do we need to take care of when dealing with such huge data?
Well, some initial conditions seem strange for me.
First, 1M records is not a big deal for modern RDBMS, especially when we don't want to have real-time query responses.
Second, the fact that Hive and Cassandra tables structure are the same. Cassandra's paradigm is query-first modeling and it is good for some scenarios others than Hive.
However, for your question:
1. Yes. You can write Java (as I saw Java in the tag list) program, that would connect to both Hive and Cassandra via JDBC and compare resultset items one by one.
But you need to be sure that order of items is the same for Hive and Cassandra. That could be done via Hive queries as there not too many ways to do Cassandra ordering.
2. Resultset is just a cursor. It doesn't gather the whole data in memory, just some batch of records (it is configurable).
3. 1M or records it not a huge data, however, if you want to deal with billions of records, that would be it. But I could not provide you with a silver bullet to answer all questions dealing with huge data as each case is specific.
Anyway, for your case, I have some concerns:
I have no details of latest Cassandra's JDBC driver features and limitations.
You have not provided details of table structure and future data growth and complexity. I mean that now you have 1M rows with 10 columns in a single database, but later you could have 100M rows in the cluster of 10 Cassandra nodes.
If it's not a problem, then you can try your solution. Otherwise, for the simplicity of comparison, I'd suggest do the following:
1. Export Cassandra's data to Hive.
2. Compare data in two Hive tables.
I believe that would be straightforward and more robust.
But all above doesn't address the thing about the tools (Hive and Cassandra) selection for your task. You could find more about typical Cassandra usage cases here to be sure you've made the right choice.

Informix, MySQL and Oracle blob contains

We have an application that runs with any of IBM Informix, MySQL and Oracle, and we are using Java with Hibernate to connect to the database. We will store XML, CSV and other text-based files inside the database (clob column). The entities in Java are byte[] objects.
One feature request to the application is now to "grep" content inside the data. So I need to find all files with a specific content.
On regular char/varchar fields I can use like '%xyz%', but this is not working on byte[] / blobs.
The first approach was to load each entity, cast the byte[] into a string and use the contains method in Java. If the use enters any filter parameters on other (non-clob) columns, I will apply those filters before testing the clob in order to reduce the number of blobs I have to scan.
That worked quite well for 100 files (clobs) and as long as the application and database are on the same server. But I think it will get really slow if I have 1.000.000 files inside the database and the database is not always in the same network. So I think that is not a good idea.
My next thought was creating a database procedure. But I am not quite sure if this is possible for Informix, MySQL and Oracle. And I am not sure if this is possible.
The last but not favored method is to store the content of the data not inside a clob. Maybe I can use a different datatype for that?
Does anyone has a good idea how to realize that? I need a solution for all three DBMS. The application knows on what kind of DBMS it is connected to. So it would be okay, if I have three different solutions (one for each DBMS).
I am completely open to changing what kind of datatype I use (BLOB, CLOB ...) — I can modify that as I want.
Note: the clobs will range from about 5 KiB to about 500 KiB, with a maximum of 1 MiB.
Look into Apache Lucene or other text indexing library.
https://en.wikipedia.org/wiki/Lucene
http://en.wikipedia.org/wiki/Full_text_search
If you go with a DB specific solution like Oracle Text Search you will have to implement a custom solution for each database. I know from experience that Oracle Text search takes significant time to learn and involves a lot of tweaking to get just right.
Also, if you use a DB solution you would receive different results in each DB even if the data sets were the same (each DB would have it's own methods of indexing and retrieving the data).
By going with a 3rd party solution like Lucene -- you only have to learn one solution and results will be consistent regardless of the Db.

Best way to sort the data : DB Query or in Application Code

I have a Mysql table with some data (> million rows). I have a requirement to sort the data based on the below criteria
1) Newest
2) Oldest
3) top rated
4) least rated
What is the recommended solution to develop the sort functionality
1) For every sort reuest execute a DBQuery with required joins and orderBy conditions and return the sorted data
2) Get all the data (un sorted) from table, put the data in cache. Write custom comparators (java) to sort the data.
I am leaning towards #2 as the load on DB is only once. Moreover, application code is better than DBQuery.
Please share your thoughts....
Thanks,
Karthik
Do as much in the database as you can. Note that if you have 1,000,000 rows, returning all million is nearly useless. Are you going to display this on a web site? I think not. Do you really care about the 500,000th least popular post? Again, I think not.
So do the sorts in the database and return the top 100, 500, or 1000 rows.
It's much faster to do it in the database:
1) the database is optimized for I/O operations, and can use indices, and other DB optimizations to improve the response time
2) taking the data from the database to the application will get all data into memory. The app will have to look all the data to redorder it without optimized algorithms
3) the database only takes the minimun necessary data into mamemory, which can be much less than all the data whihc has to be moved to java
4) you can always create extra indices on the database to improve the query performance.
I would say that operation on DB will be always faster. You should ensure that caching on DB is ON and working properly. Ensure that you are not using now() in your query because it will disable mysql cache. Take a look here how mysql query cache works. In basic. Query is cached based on string so if query string differs every time you fetch no cache is used.
AFAIK usually it should run faster if you let the DB sort your data.
And regarding code on application level vs db level I would agree in the case of stored procedures but sorting in SELECTs is fine IMHO.
If you want to show the data to the user also consider paging (in which case you're better off with sorting on the db level anyway).
Fetching a million rows from the database sounds like a terrible idea. It will generate a lot of networking traffic and require quite some time to transfer all the data. Not mentioning amounts of memory you would need to allocate in your application for storing million of objects.
So if you can fetch only a subset with a query, do that. Overall, do as much filtering as you can in the database.
And I do not see any problem in ordering in a single queue. You can always use UNION if you can't do it as one SELECT.
You do not have four tasks, you have two:
sort newest IS EQUAL TO sort oldest
AND
sort top rated IS EQUAL TO sort least rated.
So you need to make two calls to db. Yes sort in db. then instead of calling to sort every time, do this:
1] track the timestamp of the latest record in the db
2] before calling to sort and retrieve entire list, check if date has changed
3] if date has not changed, use the list you have in memory
4] if date has changed, update the list
I know this is an old thread, but it comes up in my search, so I'd like to post my opinion.
I'm a bit old school, but for that many rows, I would consider dumping the data from your database (each RDBMS has it's own method. Looks like MySQLDump command for MySQL: Link )
You can then process this with sorting algorithms or tools that are available in your java libraries or operating system.
Be careful about the work your asking your database to do. Remember that it has to be available to service other requests. Don't "bring it to it's knees" servicing only one request, unless it's a nightly batch cycle type of scenario and you're certain it won't be asked to do anything else.

Distributed multimap based on HBase and Hadoop MapReduce

I'm sorry that I haven't deeply understood HBase and Hadoop MapReduce, but I think you can help me to find the way of using them, or maybe you could propose frameworks I need.
Part I
There is 1st stream of records that I have to store somewhere. They should be accessible by some keys depending on them. Several records could have the same key. There are quite a lot of them. I have to delete old records by timeout.
There is also 2nd stream of records, that is very intensive too. For each record (argument-record) I need to: get all records from 1st strem with that argument-record's key, find first corresponding record, delete it from 1st stream storage, return the result (res1) of merging these two records.
Part II
The 3rd stream of records is like 1st. Records should be accessable by keys (differ from that ones of part I). Several records as usual will have the same key. There are not so many of them like in the 1st stream. I have to delete old records by timeout.
For each res1 (argument-record) I have to: get all records from 3rd strem with that record's another key, map these records having res1 as parameter, reduce into result. 3rd stream records should stay unmodified in storage.
The records with the same key are prefered to be stored at the same node, and procedures that get records by the key and make some actions based on given argument-record are preferred to be run on the node where that records are.
Are HBase and Hadoop MapReduce applicable in my case? And how such app should look like (base idea)? If the answer is no, is there frameworks to buld such app?
Please, ask questions, if you couldn't get what I want.
I am relating to the storage backend technologies. Front end accepting records can be stateless and thereof trivially scalable.
We have streams of records and we want to join them on the fly. Some of records should be persisted why some (as far as I understood - 1st stream) are transient.
If we take scalability and persistence out of equation - it can be implemented in single java process using HashMap for randomly accessible data and TreeMap for data we want to store sorted
Now let see how it can be mapped into NoSQL technologies to gain scalability and performance we need.
HBase is distributed sorted map. So it can be good candidate for stream 2. If we used our key as hbase table key - we will gain data locality for the records with the same key.
MapReduce on top of HBase is also available.
Stream 1 looks like transient randomly accessed data. I think it does not make sense to pay a price of persistence for those records - so distributed in memory hashtable should do. For example: http://memcached.org/ Probably element of storage there will be list of records with the same key.
I still not 100% sure about 3rd stream requirements but need for secondary index (if it known beforehand) can be implemented on application level as another distributed map.
In a nutshell - my suggestion to pick up HBase for data you want to persist and store sorted and consider some more lightweight solutions for transient (but still considerable big) data.

Categories