Spark: sparkSession read from the result of an http response - java

Small question regarding Spark and how to read from the result of a http response please.
It is well known Spark can take as datasource some database, or CSV, etc...
sparkSession.read().format("csv").load("path/to/people.csv");
sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load()
May I ask how to read from the result of an http call directly please?
Without having to dump the data back inside another intermediate csv / intermediate database table.
For instance the csv and database would contains millions of rows, and once read, the job needs to perform some kind of map reduce operation.
Now, the exact same data comes from the result of an http call. It is small enough for the network layer, but the information contained inside the payload is big, so I would like to apply the same map reduce.
How to read from the response of an http call please?
Thank you

You have two options for reading data in Spark:
Read directly to the driver and distribute to the executors (not scalable as everything passes through driver)
Read directly from the executors
The built in data sources like csv, parquet etc all implement reading from the executors so the job can scale with the data. They define how each partition of the data should be read - e.g. if we have 10 executors, how do you cut up the data source into 10 sections so each executor can directly read one section.
If you want to load from a HTTP request you will either have to read through the driver and distribute, which may be OK if you know the data is going to be less than ~10mb. Otherwise you would need to implement a custom data source to allow the executors to each read partition, can read here for more: https://aamargajbhiye.medium.com/speed-up-apache-spark-job-execution-using-a-custom-data-source-fd791a0fa4b0
Will finish by saying that this second option is almost definitely an anti-pattern. You will likely be much better off providing an intermediate staging environment (e.g. S3/GCS), calling the server to load the data to the intermediate store and then reading to Spark on completion. In scenario 2, you will likely end up putting too much load on the server, amongst other issues.

In previous lifetimes, I created a custom datasource. It is not the most trivial thing to do, but this GitHub repo explains it: https://github.com/jgperrin/net.jgp.books.spark.ch09.
When it comes to reading from a network stream, make sure that only one executor does it.

Related

Java - Batch Processing

I'm trying to generate a CSV file based on a list of objects returned by a web service method.
The problem is that I want to retrieve all of the objects available, but the call will 'fail' if I try to get more than 100 entries (the method has 2 parameters which give me the possibility to specify the interval of objects I want to retrieve, ex: from 10 to 50, from 45 to 120, etc.).
I thought of making sequential calls while incrementing the two indexes which represent the interval, but someone suggested that I should use batch processing for this. As far as I searched the internet I only found examples on how to export database data or xml files into csv, using Spring Batch.
Could someone explain me how should I handle this situation? Or at least point me to an example/tutorial similar to what I need? Thank you very much!!
If you try to load all data from a single request through a webservice , you are exposed to get a memory or timeout exception because data too much large in response, maybe you should try make some calls to your webservice, something like a paginated request, after each response you can insert response in your local database.
When all calls are over, call a process and build your csv file.
regards.

Hash based partitioning

I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).
Any idea if this is possible with Spring Batch?
How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?
Thanks,
Mickael
What you're describing is something normally seen in batch processing. You have a couple options here:
Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.
In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.
Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.
There is more info in the docs:
Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.
I think your question is somehow duplicate to this: multithreaded item reader

Are batchlets the correct way of implementing ETL steps in JavaEE Batch?

I am studying Javaee Batch API (jsr-352) in order to test the feasibility of changing out current ETL tool for our own solution using this technology.
My goal is to build a job in which I:
get some (dummy) data from a datasource in step1,
some other data from other data-source in step2 and
merge them in step3.
I would like to process each item and not write to a file, but send it to the next step. And also store the information for further use. I could do that using batchlets and jobContext.setTransientUserData().
I think I am not getting the concepts right: as far as I understood, JSR-352 is meant for this kind of ETL tasks, but it has 2 types of steps: chunk and batchlets. Chunks are "3-phase-steps", in which one reads, processes and writes the data. Batchlets are tasks that are not performed on each item on the data, but once (as calculating totals, sending email and others).
My problem is that my solution is not correct if I consider the definition of batchlets.
How could one implement this kinf od job using Javaee Batch API?
I think you better to use chunk rather than batchlet to implement ETLs. typical chunk processing with a datasource is something like following:
ItemReader#open(): open a cursor (create Connection, Statement and ResultSet) and save them as instance variables of ItemReader.
ItemReader#readItem(): create and return a object that contains data of a row using ResultSet
ItemReader#close(): close JDBC resources
ItemProcessor#processItem(): do calculation and create and return a object which contains result
ItemWriter#writeItems(): save calculated data to database. open Connection, Statement and invoke executeUpdate() and close them.
As to your situation, I think you have to choose one data which considerble as primary one, and open a cursor for it in ItemReader#open(). then get another one in ItemProcessor#processItem() for each item.
Also I recommend you to read useful examples of chunk processing:
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-1/
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-2/
My blog entries about JBatch and chunk processing:
http://www.nailedtothex.org/roller/kyle/category/JBatch

is spring-batch for me, even though I don't have a usage for itemReader and itemWriter?

spring-batch newbie: I have a series of batches that
read all new records (since the last execution) from some sql tables
upload all the new records to hadoop
run a series of map-reduce (pig) jobs on all the data (old and new)
download all the output to local and run some other local processing on all the output
point is, I don't have any obvious "item" - I don't want to relate to the specific lines of text in my data, I work with all of it as one big chunk and don't want any commit intervals and such...
however, I do want to keep all these steps loosely coupled - as in, step a+b+c might succeed for several days and accumulate processed stuff while step d keeps failing, and then when it finally succeeds it will read and process all of the output of it's previous steps.
SO: is my "item" a fictive "working-item" which will signify the entire new data? do I maintain a series of queues myself and pass this fictive working-items between them?
thanks!
people always assume that the only use of spring batch is really only for the chunk processing. that is a huge feature, but what's overlooked is the visibility of the processing and job control.
give 5 people the same task with no spring batch and they're going to implement flow control and visibility their own way. give 5 people the same task and spring batch and you may end up with custom tasklets all done differently, but getting access to the job metadata and starting and stopping jobs is going to be consistent. from my perspective it's a great tool for job management. if you already have your jobs written, you can implement them as custom tasklets if you don't want to rewrite them to conform the 'item' paradigm. you'll still see benefits.
I don't see the problem. Your scenario seems like a classic application of Spring Batch to me.
read all new records (since the last execution) from some sql tables
Here, an item is a record
upload all the new records to hadoop
Same here
run a series of map-reduce (pig) jobs on all the data (old and new)
Sounds like a StepListener or ChunkListener
download all the output to local and run some other local processing on all the output
That's the next step.
The only problem I see is if you don't have Domain Objects for your records. But even then, you can work with maps or arrays, while still using ItemReaders and ItemWriters.

Adding/Viewing/Deleting Data from HBase using PHP and Mapreduce in Java?

Hi friends
i am generating a web crawler, i like to know some things about that,
1)Can i use Map reduce to Fetch the Data from the NET
2)Can i able to save the Fetched data to HBase?
3)Can i able to Write an App in PHP for Fetch the Data from HBase?if yes can u gave me a code snippet??How can i Adding/Viewing/Deleting Data from HBase using PHP
For your questions, yes, it can all be done. How you approach it depends on what exactly you want to achieve.
1) Your main control would need to partition the task. You would likely maintain some kind of list of addresses to crawl, possible running sequential mapreduce tasks that each time read the list in, split the list between mappers which could do the crawling, and write directly to hbase or another intermediary. They would also probably output generated urls to crawl next which in turn would be filtered down to uniques in the reduce phase, with the reduce outputting the list of things to crawl next. You'd need to maintain a list of recently crawled stuff and filter that out too, but that's not specific to MR/Hbase.
2) You can use table output format to send the outputs to hbase. You can also just make HBase connections with HTable and write directly in your mapper.
3) As TheDeveloper said, yes, with thrift. His link is good.
For questions number 3, you can interact with Hbase from PHP, but you need to do it via the Thrift interface. See this blog post for more info. Hope this helps
Can be done easily via REST using Stargate.

Categories