Write to Hive in Java MapReduce Job - java

I am currently working on a Java MapReduce job, which should output data to a bucketed Hive table.
I think of two approaches:
First directly write to Hive via HCatalog. The problem is, that this approach does not support writing to a bucketed Hive table. Hence, when using a bucketed Hive table, I need to first write to a non-bucketed table and then copy it to the bucketed one.
The second option is to write the output to a text file and load this data into Hive afterwards.
What is the best practice here?
Which approach is more performant with a huge amount of data (with respect to memory and time taken)?
Which approach would be the better one, if I could also use non-bucketed Hive tables?
Thanks a lot!

For non-bucketed tables, you can store your MapReduce output in the table storage location. Then you'd only need to run MSCK REPAIR TABLE to update the metadata with the new partitions.
Hive's load command actually just copies the data to the table storage location.
Also, from HIVE documentation:
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
So you'd need to tweak your MapReduce output to fit these constrains.

Related

Comparing all table data of two different keyspaces (Database: Cassandra) with java code

1)I have to compare data from tables from 2 different keyspace of cassandra.
The data is huge in table.
How to make connection and compare each rows.
2)I have to write a script which will automatically run on every EOD and compare data of both the schemas(Java)
Excel export and comparison will work or should I have to use apache spark?
Kindly suggest the ways or any sample program if anybody has tried this.
Output should give us the rows which do not match.

How to compare Hive and Cassandra data in Java when there are around 1 million records

I am using Hive and Cassandra, table structure and data is the same in both Hive and Cassandra. There will be almost 1 million records. My requirement is that I need to check if each and every row has the same data in both Cassandra and Hive.
Can I compare two resultset objects directly? (one resultset with Cassandra data and another from Hive)
If we are iterating over resultset object, can resultset object hold 1 million records at a time? Will there be any performance issue?
What do we need to take care of when dealing with such huge data?
Well, some initial conditions seem strange for me.
First, 1M records is not a big deal for modern RDBMS, especially when we don't want to have real-time query responses.
Second, the fact that Hive and Cassandra tables structure are the same. Cassandra's paradigm is query-first modeling and it is good for some scenarios others than Hive.
However, for your question:
1. Yes. You can write Java (as I saw Java in the tag list) program, that would connect to both Hive and Cassandra via JDBC and compare resultset items one by one.
But you need to be sure that order of items is the same for Hive and Cassandra. That could be done via Hive queries as there not too many ways to do Cassandra ordering.
2. Resultset is just a cursor. It doesn't gather the whole data in memory, just some batch of records (it is configurable).
3. 1M or records it not a huge data, however, if you want to deal with billions of records, that would be it. But I could not provide you with a silver bullet to answer all questions dealing with huge data as each case is specific.
Anyway, for your case, I have some concerns:
I have no details of latest Cassandra's JDBC driver features and limitations.
You have not provided details of table structure and future data growth and complexity. I mean that now you have 1M rows with 10 columns in a single database, but later you could have 100M rows in the cluster of 10 Cassandra nodes.
If it's not a problem, then you can try your solution. Otherwise, for the simplicity of comparison, I'd suggest do the following:
1. Export Cassandra's data to Hive.
2. Compare data in two Hive tables.
I believe that would be straightforward and more robust.
But all above doesn't address the thing about the tools (Hive and Cassandra) selection for your task. You could find more about typical Cassandra usage cases here to be sure you've made the right choice.

Java Hadoop inserting and query large data in JSON Format

I have a requirement to build a system with Java and Hadoop to handle large data processing (in JSON Format). The system I'm going to create is including insert data to the file storage (whether it is HDFS or database) and query the processed data
I have a big picture of using Hadoop MapReduce to query the data that the user want.
But one thing that makes me confuse is how should I insert the data. Should I use HDFS and inserting the file using Java with Hadoop API? Or is it better to use another tools (e.g. HBase, Relational Database, NoSQL Database) to insert the data so that Hadoop MapReduce will take the input data from another tools that I will be used?
Please advise.
Thank you very much
I would suggest you to use HDFS/HIVE/JSONSerde approach.
The solution outline would look like.
Store your JSON data on HDFS.
Create external tables using hive and use jsonSerde to map json data to columns of your table.
Query your data using hiveQL.
In the above solution, Since hive is schema-on-read, Your json data will be parsed every time when you query the tables.
But if you want to parse the data once and if you have data arriving in batches (weekly, monthly), it would be good to parse the data once and create a staging table. which can be used for frequent querying to avoid repetitive parsing by serde.
I have an example created at :Hadoopgig

Download completeDB as CSV

I have a complicated database that deals with many tables. Some of them are related as in the following table:
Now the requirement is to collect all the corresponding rows in all of the tables from one identifiable field in the entry level table, and download it as CSV.
What comes to my mind is a simple iterative strategy and storing relevant data. But this seems inefficient since the query goes too long and have to iterate it a few times to get everything I need.
Is there any better approach to this problem? I'm using Jsp-Java-spring-MySQL.
I would suggest using a command line of mysql or mysqldump utility(fastest approach) and you can also use a DB tool like Toad or mysql workbench. See if these posts helps :
mysqlworkbench
mysqldump

Best way to have a fast access key-value storage for huge dataset (5 GB)

There is a dataset of ~5GB in size. This big dataset just has a key-value pair per line.
Now this needs to be read for the value of keys some billion times.
I have already tried disk based approach of MapDB, but it throws ConcurrentModification Exception and isn't mature enough to be used in production environment yet.
I also don't want to have it in a DB and make the call billion times (Though, certain level of in-memory caching can be done here).
Basically, I need to access these key-value dataset in mapper/reducer of a hadoop's job step.
So after trying out a bunch of things we are now using SQLite.
Following is what we did:
We load all the key-value pair data in a pre-defined database file (Indexed it on the key column, though it increased the file-size but was worth it.)
Store this file (key-value.db) in S3.
Now this is passed to the hadoop jobs as distributed cache.
In Configure of Mapper/Reducer the connection is opened (It takes around 50 ms) to the db file
In map/reduce method query this db with the key (It took negligible time, didn't even need to profile it, it was so negligible!)
Closed the connection in cleanup method of Mapper/Reducer
Try Redis. It seems this is exactly what you need.
I would try Oracle Berkerley DB Java Edition This support Maps and is both mature and scalable.
I noticed you tagged this with elastic-map-reduce... if you're running on AWS, maybe DynamoDB would be appropriate.
Also, I'd like to clarify: is this dataset going to be the input to your MapReduce job, or is it a supplementary dataset that will be accessed randomly during the MapReduce job?

Categories