I have a redis key/value store holding blobs (size in the tens of MB), and the jedis client I am using in my java application returns a byte array from the jedis connection's get method. Currently, I have to wrap the result in a stream in order to process the bytes. Are there any alternatives that would allow me to stream the result directly? other clients or ways to use Jedsi? thanks for any advice.
If you don't find any available existing driver to perform what you like, You can calling directly redis from your java code.
The protocol used by a redis server, RESP (REdis Serialization Protocol) is very simple. I studied it and implemented a complete java driver, just to test my abilities in less than half day. Here is the link to RESP specification. You can for example start from an existing driver and add a custom functionality to stream the data as you like.
There is no mentioned types for storing large object in redis.
But you can store it as string and for buffered streaming you can use GETRANGE command of redis which returns string for given range .
Get length of data using GETLEN.
Use redis pipeline to create series of GETRANGE command to read different pages of data. Similarly to set data you can use SETRANGE command.
Ref: redis commands
Please find specific implementation of mentioned redis commands in your redis java client.
Related
I'm using Hazelcast Change data capture (CDC) in my application. (Reason I'm using CDC because if use jdbc or other alternative feature to load data into cache its taking to much of time). So CDC will have a data sync between database and Hazelcast Jet.
StreamSource<ChangeRecord> source = PostgresCdcSources.postgres("source")
.setCustomProperty("plugin.name", "pgoutput").setDatabaseAddress("127.0.0.1").setDatabasePort(5432)
.setDatabaseUser("postgres").setDatabasePassword("root").setDatabaseName("postgres")
.setTableWhitelist("tblName").build();
here I have following steps
Pipeline pipeline = Pipeline.create();
// filter records based on deleted false
StreamStage<ChangeRecord> deletedFlagRecords = pipeline.readFrom(source).withoutTimestamps()
.filter(deletedFalse);
deletedFlagRecords.filter(idBasedFetch).writeTo(Sinks.logger());
Here I'm using StreamSource<ChangeRecord> source object as input for my pipeLine. As you know source object is a Stream type. But in my case pipeLine data process is depends upon the user input data (some metadata). If I do any updates or delete in the db. jet will updates in all the stream instances. Since my data processing is depends upon the user data I don't want to use stream type after the first step. Only this first StreamSource<ChangeRecord> source; required in the form of stream. In this next step I just want to this for batch process. So how to use source in the batch processing.
pipeLine.readFrom(source) //always return Stream type. so how to convert this into batch type. I tried one more way like read from source and Sink everything to map.
pipeLine.readFrom(source).writeTo(Sinks.map("dbStreamedData", e -> e.key(), e -> e.value()));
Again construct pipeLine readFrom from map.
pipeline.readFrom(Sources.map("dbStreamedData")).writeTo(Sinks.logger());
this is just returning null data. so Any suggestions would be helpful.
Pipeline.readFrom returns either StreamStage or BatchStage, depending on the source. Sources.map() is a batch source, it will read the map once and complete. PostgresCdcSources.postgres() is a streaming source, it will connect to the DB and will keep returning events as they happen until cancelled.
You need to pick a source depending on your use case, if this is your question.
Using a CDC source only makes sense if you need your data to be continuously updated. E.g. react to each update in the database, or possibly load data into a Map and then run a batch job repeatedly at some time interval on an in-memory snapshot.
In this case, it's likely you want the first to happen only after the CDC source is up-to-date - after it read all current state from the database and is only receiving updates as they are made to the database. Unfortunately, at the moment (Hazelcast 5.0) there is no way to tell when this happens using Jet API.
It might be possible that you can use some domain-specific information - having a timestamp field that you query for, last inserted record is present in the map or similar.
If you want to run a single batch job on data from a database table you should use a jdbc source.
(Reason I'm using CDC because if use jdbc or other alternative feature to load data into cache its taking to much of time)
Using CDC has its overhead and this is not something we usually see. Using plain SQL query like SELECT * FROM table with the jdbc source is faster than CDC source. Maybe you don't measure time it takes to process whole current state? It it really takes more time to load data using jdbc than CDC please file an issue with a reproducer.
Small question regarding Spark and how to read from the result of a http response please.
It is well known Spark can take as datasource some database, or CSV, etc...
sparkSession.read().format("csv").load("path/to/people.csv");
sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load()
May I ask how to read from the result of an http call directly please?
Without having to dump the data back inside another intermediate csv / intermediate database table.
For instance the csv and database would contains millions of rows, and once read, the job needs to perform some kind of map reduce operation.
Now, the exact same data comes from the result of an http call. It is small enough for the network layer, but the information contained inside the payload is big, so I would like to apply the same map reduce.
How to read from the response of an http call please?
Thank you
You have two options for reading data in Spark:
Read directly to the driver and distribute to the executors (not scalable as everything passes through driver)
Read directly from the executors
The built in data sources like csv, parquet etc all implement reading from the executors so the job can scale with the data. They define how each partition of the data should be read - e.g. if we have 10 executors, how do you cut up the data source into 10 sections so each executor can directly read one section.
If you want to load from a HTTP request you will either have to read through the driver and distribute, which may be OK if you know the data is going to be less than ~10mb. Otherwise you would need to implement a custom data source to allow the executors to each read partition, can read here for more: https://aamargajbhiye.medium.com/speed-up-apache-spark-job-execution-using-a-custom-data-source-fd791a0fa4b0
Will finish by saying that this second option is almost definitely an anti-pattern. You will likely be much better off providing an intermediate staging environment (e.g. S3/GCS), calling the server to load the data to the intermediate store and then reading to Spark on completion. In scenario 2, you will likely end up putting too much load on the server, amongst other issues.
In previous lifetimes, I created a custom datasource. It is not the most trivial thing to do, but this GitHub repo explains it: https://github.com/jgperrin/net.jgp.books.spark.ch09.
When it comes to reading from a network stream, make sure that only one executor does it.
I have a requirement to read a large data set from a postgres database which needs to be accessible via a rest api endpoint. The client consuming the data will then need to transform the data into csv format(might need to support json and xml later on).
On the server side we are using Spring Boot v2.1.6.RELEASE and spring-jdbc v5.1.8.RELEASE.
I tried using paging and loop through all the pages and store the result into a list and return the list but resulted in OutOfMemory error as the data set does not fit into memory.
Streaming the large data set looks like a good way to handle memory limits.
Is there any way that I can just return a Stream of all the database entities and also have the rest api return the same to the client? How would the client deserialize this stream?
Are there any other alternatives other than this?
If your data is so huge that it doesn't fit into memory - I'm thinking gigabytes or more - then it's probably too big to reasonably provide as single HTTP response. You will hold the connection open for a very long time. If you have a problem mid-way through, the client will need to start all over at the beginning, potentially minutes ago.
A more user-friendly API would introduce pagination. Your caller could specify a page size and the index of the page to fetch as part of their request
For example
/my-api/some-collection?size=100&page=50
This would represent fetching 100 items, starting from the 5000th (5000 - 5100)
Perhaps you could place some reasonable constraints on the page size based on what you are able to load into memory at one time.
I want to write a Java Program which does the MapReduce Job(e.g. word count). The input is from the Redis. How can I write the Map Class to retrieve one by one from the Redis and do some process in the Map class, like I did before which read from HDFS?
There is no OOTB feature that allows us to do that. But you might find things like Jedis helpful. Jedis is a Java client using which you can read/write data to/from Redis. See this for an example.
If you are not strongly coupled to Java, you might also find R3 useful. R3 is a map reduce engine written in python using a Redis backend.
HTH
obviously,you need to customize your InputFormat.
Please read this tutorial to learn how to write your own custom InputFormat and RecordReader.
Put your keys in HDFS. In map(), just query from redis based on input key.
Try Redisson it's a Redis based In-Memory Data Grid for Java. It allows to execute Map Reduce over data stored in Redis.
More documentation here.
Hi friends
i am generating a web crawler, i like to know some things about that,
1)Can i use Map reduce to Fetch the Data from the NET
2)Can i able to save the Fetched data to HBase?
3)Can i able to Write an App in PHP for Fetch the Data from HBase?if yes can u gave me a code snippet??How can i Adding/Viewing/Deleting Data from HBase using PHP
For your questions, yes, it can all be done. How you approach it depends on what exactly you want to achieve.
1) Your main control would need to partition the task. You would likely maintain some kind of list of addresses to crawl, possible running sequential mapreduce tasks that each time read the list in, split the list between mappers which could do the crawling, and write directly to hbase or another intermediary. They would also probably output generated urls to crawl next which in turn would be filtered down to uniques in the reduce phase, with the reduce outputting the list of things to crawl next. You'd need to maintain a list of recently crawled stuff and filter that out too, but that's not specific to MR/Hbase.
2) You can use table output format to send the outputs to hbase. You can also just make HBase connections with HTable and write directly in your mapper.
3) As TheDeveloper said, yes, with thrift. His link is good.
For questions number 3, you can interact with Hbase from PHP, but you need to do it via the Thrift interface. See this blog post for more info. Hope this helps
Can be done easily via REST using Stargate.