MapReduce read input from Redis - java

I want to write a Java Program which does the MapReduce Job(e.g. word count). The input is from the Redis. How can I write the Map Class to retrieve one by one from the Redis and do some process in the Map class, like I did before which read from HDFS?

There is no OOTB feature that allows us to do that. But you might find things like Jedis helpful. Jedis is a Java client using which you can read/write data to/from Redis. See this for an example.
If you are not strongly coupled to Java, you might also find R3 useful. R3 is a map reduce engine written in python using a Redis backend.
HTH

obviously,you need to customize your InputFormat.
Please read this tutorial to learn how to write your own custom InputFormat and RecordReader.

Put your keys in HDFS. In map(), just query from redis based on input key.

Try Redisson it's a Redis based In-Memory Data Grid for Java. It allows to execute Map Reduce over data stored in Redis.
More documentation here.

Related

Creating a Flink DataStream from database query results

In my problem I need to query a database and join the query results with a Kafka data stream in Flink. Currently this is done by storing the query results in a file and then use Flink's readFile functionality to create a DataStream of query results. What could be a better approach to bypass the intermediary step of writing to file and create a DataStream directly from query results?
My current understanding is that I would need to write a custom SourceFunction as suggested here. Is this the right and only way or are there any alternatives?
Are there any good resources for writing the custom SoruceFunctions or should I just look at current implementations for reference and customise them fro my needs?
One straightforward solution would be to use a lookup join, perhaps with caching enabled.
Other possible solutions include kafka connect, or using something like Debezium to mirror the database table into Flink. Here's an example: https://github.com/ververica/flink-sql-CDC.

Spark: sparkSession read from the result of an http response

Small question regarding Spark and how to read from the result of a http response please.
It is well known Spark can take as datasource some database, or CSV, etc...
sparkSession.read().format("csv").load("path/to/people.csv");
sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load()
May I ask how to read from the result of an http call directly please?
Without having to dump the data back inside another intermediate csv / intermediate database table.
For instance the csv and database would contains millions of rows, and once read, the job needs to perform some kind of map reduce operation.
Now, the exact same data comes from the result of an http call. It is small enough for the network layer, but the information contained inside the payload is big, so I would like to apply the same map reduce.
How to read from the response of an http call please?
Thank you
You have two options for reading data in Spark:
Read directly to the driver and distribute to the executors (not scalable as everything passes through driver)
Read directly from the executors
The built in data sources like csv, parquet etc all implement reading from the executors so the job can scale with the data. They define how each partition of the data should be read - e.g. if we have 10 executors, how do you cut up the data source into 10 sections so each executor can directly read one section.
If you want to load from a HTTP request you will either have to read through the driver and distribute, which may be OK if you know the data is going to be less than ~10mb. Otherwise you would need to implement a custom data source to allow the executors to each read partition, can read here for more: https://aamargajbhiye.medium.com/speed-up-apache-spark-job-execution-using-a-custom-data-source-fd791a0fa4b0
Will finish by saying that this second option is almost definitely an anti-pattern. You will likely be much better off providing an intermediate staging environment (e.g. S3/GCS), calling the server to load the data to the intermediate store and then reading to Spark on completion. In scenario 2, you will likely end up putting too much load on the server, amongst other issues.
In previous lifetimes, I created a custom datasource. It is not the most trivial thing to do, but this GitHub repo explains it: https://github.com/jgperrin/net.jgp.books.spark.ch09.
When it comes to reading from a network stream, make sure that only one executor does it.

example retrieving redis values as stream in java

I have a redis key/value store holding blobs (size in the tens of MB), and the jedis client I am using in my java application returns a byte array from the jedis connection's get method. Currently, I have to wrap the result in a stream in order to process the bytes. Are there any alternatives that would allow me to stream the result directly? other clients or ways to use Jedsi? thanks for any advice.
If you don't find any available existing driver to perform what you like, You can calling directly redis from your java code.
The protocol used by a redis server, RESP (REdis Serialization Protocol) is very simple. I studied it and implemented a complete java driver, just to test my abilities in less than half day. Here is the link to RESP specification. You can for example start from an existing driver and add a custom functionality to stream the data as you like.
There is no mentioned types for storing large object in redis.
But you can store it as string and for buffered streaming you can use GETRANGE command of redis which returns string for given range .
Get length of data using GETLEN.
Use redis pipeline to create series of GETRANGE command to read different pages of data. Similarly to set data you can use SETRANGE command.
Ref: redis commands
Please find specific implementation of mentioned redis commands in your redis java client.

reading data from secondary mongo database

I am creating reports on mongodb using java . So here I need to use map reduce to create reports .I am having 3 replicas in production . For reports queries I do not want make request to primary mongo database . I want to make request to only secondary replica ,So here if we use map reduce it will create a temporary collection.
1) Here is there any problem if i set read preferences as secondary
for reports using map reduce?
2) will create temporary collection on
secondary replica?
3) Is there any other way to use secondary
replica for report purpose since i do not want to create traffic on
primary database?4) will i get correct desired results since having
huge data?
Probably the easiest way to do this is to just connect to the secondary directly, instead of connecting to the Replica Set with ReadPreference.SECONDARY_ONLY. In that case, it will definitely create a temporary on the secondary and you should have the correct results (of course!).
I would also advice you to look at the Aggregation Framework though, as it's a lot faster and often easier to use and debug than Map Reduce jobs. It's not as powerful, but I have yet had to find a situation where I couldn't use the Aggregation Framework for my aggregation and reporting needs.

Adding/Viewing/Deleting Data from HBase using PHP and Mapreduce in Java?

Hi friends
i am generating a web crawler, i like to know some things about that,
1)Can i use Map reduce to Fetch the Data from the NET
2)Can i able to save the Fetched data to HBase?
3)Can i able to Write an App in PHP for Fetch the Data from HBase?if yes can u gave me a code snippet??How can i Adding/Viewing/Deleting Data from HBase using PHP
For your questions, yes, it can all be done. How you approach it depends on what exactly you want to achieve.
1) Your main control would need to partition the task. You would likely maintain some kind of list of addresses to crawl, possible running sequential mapreduce tasks that each time read the list in, split the list between mappers which could do the crawling, and write directly to hbase or another intermediary. They would also probably output generated urls to crawl next which in turn would be filtered down to uniques in the reduce phase, with the reduce outputting the list of things to crawl next. You'd need to maintain a list of recently crawled stuff and filter that out too, but that's not specific to MR/Hbase.
2) You can use table output format to send the outputs to hbase. You can also just make HBase connections with HTable and write directly in your mapper.
3) As TheDeveloper said, yes, with thrift. His link is good.
For questions number 3, you can interact with Hbase from PHP, but you need to do it via the Thrift interface. See this blog post for more info. Hope this helps
Can be done easily via REST using Stargate.

Categories