How to parallelize processing from S3 to S3

How to parallelize processing from S3 to S3 - java

I have a process that identifies an object on an S3 bucket that must be converted using our (fairly simplistic) custom Java code. The output of this conversion is written to a different prefix on the S3 bucket. So it's a simple, isolated job:
Read the input stream of the S3 object
Convert the object
Write the output as a new S3 object or objects
This process is probably only a few thousands lines of data on the S3 object, but hundreds (maybe thousands) of objects. What is a good approach to running this process on several machines? It appears that I could use Kinesis, EMR, SWF, or something I cook up myself. Each approach has quite a learning curve. Where should I start?

Given that it is a batch process and volume will grow (for 'only' 100GB it can be an overkill), Amazon Elastic Map Reduce (EMR) seems like a nice took for the job. Using EMR, you can process the data in your Hadoop Map Reduce jobs, Hive queries or Pig Scripts (and others), reading the data directly form S3. Also, you can use S3DistCP to transfer and compress data in parallel to and from the cluster, if necessary.
There is a free online introductory course to EMR and Hadoop at http://aws.amazon.com/training/course-descriptions/bigdata-fundamentals/
Also, you can take a free lab at https://run.qwiklabs.com/focuses/preview/1055?locale=en

You can try Amazon SQS to queue each job and then process them in parallel on different machines (it has a much easier learning curve than Amazon EMR / SWF).
Keep in mind though that with SQS you might receive the same message twice and thus process the same file twice if your code doesn't account for this (as opposed to SWF which guarantees that an activity is only performed once).
Also, if your processing code is not utilizing all the resources of the machine it's running on, you can download & process multiple files in parallel on the same machine as S3 will probably handle the load just fine (with multiple concurrent requests).

Related

What is the right way to create/write a large file in java that are generated by a user?

I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?

The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.

Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.

Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)

What Java library can buffer incoming data in memory and later write to a destination?

I'm looking for a Java Library to help buffer objects in memory before writing them out to a destination.
Example: pull records from a messaging queue and keep them in memory until total records size is equal to 100MB or 15 minutes have passed. Once the threshold is met, perform an operation on these records such as writing them to file OR making API calls to upload data to an AWS S3 bucket.
Such functionality is nearly identical to AWS Firehose (assuming you want to write to an S3 bucket). Unfortunately I'm not able to use AWS and looking for alternatives. It doesn't sound too difficult to implement a custom solution, but I'm trying to avoid re-inventing the wheel if possible.

Apache Beam / Google Dataflow Final step to run only once

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.

If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.

How does Hadoop run in "real-time" against non-stale data?

My abysmally-rudimentary understanding of Hadoop and its "data ingest" tools (such as Flume or Sqoop) is that Hadoop must always run its MR jobs against data that is stored in structured files on its HDFS. And, that these tools (again, Flume, Sqoop, etc.) are responsible for essentially importing data from disparate systems (RDBMS, NoSQL, etc.) into HDFS.
To me, this means that Hadoop will always be running on "stale" (for lack of a better word) data that is minutes/hours/etc. old. Because, to import big data from these disparate systems onto HDFS takes time. By the time MR can even run, the data is stale and may no longer be relevant.
Say we have an app that has real-time constraints of making a decision within 500ms of something occurring. Say we have a massive stream of data that is being imported into HDFS, and because the data is so big it takes, say, 3 seconds to even get the data on to HDFS. Then say that the MR job that is responsible for making the decision takes 200ms. Because the loading of the data takes so long, we've already blown our time constraint, even though the MR job processing the data would be able to finish inside the given window.
Is there a solution for this kind of big data problem?

With the help of tools Apache Spark streaming API & another one is Storm which you can use for real time stream processing.

Hadoop HDFS java client usage

I have a java application which needs to read and write files to HDFS. I do use
FileSystem fs = FileSystem.get(configuration);
And it works well.
Now the question is : should I keep this reference and use it as a singleton or should I use it only once and get a new one each time?
If it matters, I need to say that the application targets a quite high traffic.
Thanks

I think the answer depends on relation of two numbers - network bandwidth (between HDFS client and HDFS cluster) and amount of data per second you can feed to HDFS client. If first is higher - then having a few connections in the same time makes sense.
Usually 2-3 concurrent connections are optimal

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.