I am writing a spark streaming job that consumes data from Kafka & writes to RDBMS. I am currently stuck because I do not know which would be the most efficient way to store this streaming data into RDBMS.
On searching, I found a few methods -
Using DataFrame
Using JdbcRDD
Creating connection & PreparedStatement inside foreachPartition() of rdd and using PreparedStatement.insertBatch()
I can not figure out which one would be the most efficient method of achieving my goal.
Same is the case with storing & retrieving data from HBase.
Can anyone help me with this ?
Related
In my problem I need to query a database and join the query results with a Kafka data stream in Flink. Currently this is done by storing the query results in a file and then use Flink's readFile functionality to create a DataStream of query results. What could be a better approach to bypass the intermediary step of writing to file and create a DataStream directly from query results?
My current understanding is that I would need to write a custom SourceFunction as suggested here. Is this the right and only way or are there any alternatives?
Are there any good resources for writing the custom SoruceFunctions or should I just look at current implementations for reference and customise them fro my needs?
One straightforward solution would be to use a lookup join, perhaps with caching enabled.
Other possible solutions include kafka connect, or using something like Debezium to mirror the database table into Flink. Here's an example: https://github.com/ververica/flink-sql-CDC.
I have SpringBoot project which will pull a large amount of data from one database, do some kind of transformation on it, and then insert it into a table in a PostgreSQL database. This process will continue for a few billion records so performance is key.
I've been researching trying to find the best way to do this, such as using an ORM or a JDBCTemplate for example. One thing I keep seeing constantly regarding bulk inserts into PostgreSQL is the COPY command. https://www.postgresql.org/docs/current/populate.html
I'm confused because using COPY requires the data to be written into a file, and while I've seen people saying to use it I've yet to come across a case where someone mentions how to get the data into the file. Isn't writing to a file slow? If writing to a file is slow, then the performance gains that COPY does bring, does this make it be like there was no gain at all?
These kind of data migration and conversion is better to handle in Stored procedures. Assuming that the source data is already loaded to postgres ( if not use postgres db utility to load the raw data to some flat table). Then write series of stored procs to transform the data and insert into the destination table.
I have done some complex data migration and i used this approach. If you have to do lot of complex data conversion, write some python script ( which is usually faster than spring boot/data setup), insert the parially converted data, then do some stored procs to do the final conversion.
It is better to keep the business logic to convert/massage data close to the datasource ( in stored procs) instead of pulling data to app server and reinserting them.
Hope it helps.
I have a requirement to build a system with Java and Hadoop to handle large data processing (in JSON Format). The system I'm going to create is including insert data to the file storage (whether it is HDFS or database) and query the processed data
I have a big picture of using Hadoop MapReduce to query the data that the user want.
But one thing that makes me confuse is how should I insert the data. Should I use HDFS and inserting the file using Java with Hadoop API? Or is it better to use another tools (e.g. HBase, Relational Database, NoSQL Database) to insert the data so that Hadoop MapReduce will take the input data from another tools that I will be used?
Please advise.
Thank you very much
I would suggest you to use HDFS/HIVE/JSONSerde approach.
The solution outline would look like.
Store your JSON data on HDFS.
Create external tables using hive and use jsonSerde to map json data to columns of your table.
Query your data using hiveQL.
In the above solution, Since hive is schema-on-read, Your json data will be parsed every time when you query the tables.
But if you want to parse the data once and if you have data arriving in batches (weekly, monthly), it would be good to parse the data once and create a staging table. which can be used for frequent querying to avoid repetitive parsing by serde.
I have an example created at :Hadoopgig
I want to write a Java Program which does the MapReduce Job(e.g. word count). The input is from the Redis. How can I write the Map Class to retrieve one by one from the Redis and do some process in the Map class, like I did before which read from HDFS?
There is no OOTB feature that allows us to do that. But you might find things like Jedis helpful. Jedis is a Java client using which you can read/write data to/from Redis. See this for an example.
If you are not strongly coupled to Java, you might also find R3 useful. R3 is a map reduce engine written in python using a Redis backend.
HTH
obviously,you need to customize your InputFormat.
Please read this tutorial to learn how to write your own custom InputFormat and RecordReader.
Put your keys in HDFS. In map(), just query from redis based on input key.
Try Redisson it's a Redis based In-Memory Data Grid for Java. It allows to execute Map Reduce over data stored in Redis.
More documentation here.
I am creating reports on mongodb using java . So here I need to use map reduce to create reports .I am having 3 replicas in production . For reports queries I do not want make request to primary mongo database . I want to make request to only secondary replica ,So here if we use map reduce it will create a temporary collection.
1) Here is there any problem if i set read preferences as secondary
for reports using map reduce?
2) will create temporary collection on
secondary replica?
3) Is there any other way to use secondary
replica for report purpose since i do not want to create traffic on
primary database?4) will i get correct desired results since having
huge data?
Probably the easiest way to do this is to just connect to the secondary directly, instead of connecting to the Replica Set with ReadPreference.SECONDARY_ONLY. In that case, it will definitely create a temporary on the secondary and you should have the correct results (of course!).
I would also advice you to look at the Aggregation Framework though, as it's a lot faster and often easier to use and debug than Map Reduce jobs. It's not as powerful, but I have yet had to find a situation where I couldn't use the Aggregation Framework for my aggregation and reporting needs.