I'm doing some POC work with Redshift, loading data via S3 json files, using the copy command from a Java program. This POC is testing an initial data migration that we'd do to seed Redshift, not daily use. My data is split into about 7500 subfolders in S3, and I'd like to be able to insert the subfolders in parallel. Each subfolder contains about 250 json files, with about 3000 rows each to insert.
The single threaded version of my class loads files from one of my s3 subfolders in about 20 seconds (via a copy command). However, when I introduce a second thread (each thread gets a redshift db connection from a BoneCP connection pool), each copy command, except for the 1st one, takes about 40 seconds. When I run a query in Redshift to show all running queries, Redshift says that it's running two queries at the same time (as expected). However, it's as if the 2nd query is really waiting for the 1st to complete before it starts work. I expected that each copy command would still take only 20 seconds each. The Redshift console shows that I only get up to 60% CPU usage running single or double threaded.
Could this be because I only have 1 node in my Redshift cluster? Or is Redshift unable to open multiple connections to S3 to get the data? I'd appreciate any tips for how to get some performance gains by running multi-threaded copy commands.
Amazon Redshift loads data from Amazon S3 in parallel, utilising all nodes. From your test results, it would appear that running multiple COPY commands does not improve performance, since all nodes are already involved in the copy process.
For each table, always load as many files as possible in a single COPY command, rather than appending later. If you are loading multiple tables, it is likely best to do them sequentially (but your testing might find that loading multiple smaller tables can be done in parallel).
Some references:
Use a Single COPY Command to Load from Multiple Files
Using the COPY Command to Load from Amazon S3
Quora: What are the ways to improve copy performance on Redshift?
Related
I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.
If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.
I am new to Apache-Spark,
I have a requirement to read millions(~5 million) of records from Oracle database, then do some processing on these records , and write the processed records to a file.
At present ,this is done in Java , and in this process
- the records in DB are categorized into different sub sets, based on some data criteria
- In the Java process, 4 threads are running in parallel
- Each thread reads a sub set of records , processes and writes processed records to a new file
- finally it merges all these files into single file.
Still It takes around half an hour to complete the whole process .
So I would like to know , if Apache Spark could make this process fast- read millions of records from Oracle DB, process these, and write to a file ?
If Spark can make this process faster, what is the best approach to be used to implement this in my process? Also wWill it be effective in a non-clustered environment too?
Appreciate the help.
Yeah you can do that using Spark, it's built for distributed processing ! http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
You should be using a well configured spark cluster to achieve the same. Performance is something you need to fine tune by adding more worker nodes as required.
I have a process that identifies an object on an S3 bucket that must be converted using our (fairly simplistic) custom Java code. The output of this conversion is written to a different prefix on the S3 bucket. So it's a simple, isolated job:
Read the input stream of the S3 object
Convert the object
Write the output as a new S3 object or objects
This process is probably only a few thousands lines of data on the S3 object, but hundreds (maybe thousands) of objects. What is a good approach to running this process on several machines? It appears that I could use Kinesis, EMR, SWF, or something I cook up myself. Each approach has quite a learning curve. Where should I start?
Given that it is a batch process and volume will grow (for 'only' 100GB it can be an overkill), Amazon Elastic Map Reduce (EMR) seems like a nice took for the job. Using EMR, you can process the data in your Hadoop Map Reduce jobs, Hive queries or Pig Scripts (and others), reading the data directly form S3. Also, you can use S3DistCP to transfer and compress data in parallel to and from the cluster, if necessary.
There is a free online introductory course to EMR and Hadoop at http://aws.amazon.com/training/course-descriptions/bigdata-fundamentals/
Also, you can take a free lab at https://run.qwiklabs.com/focuses/preview/1055?locale=en
You can try Amazon SQS to queue each job and then process them in parallel on different machines (it has a much easier learning curve than Amazon EMR / SWF).
Keep in mind though that with SQS you might receive the same message twice and thus process the same file twice if your code doesn't account for this (as opposed to SWF which guarantees that an activity is only performed once).
Also, if your processing code is not utilizing all the resources of the machine it's running on, you can download & process multiple files in parallel on the same machine as S3 will probably handle the load just fine (with multiple concurrent requests).
I am setup single node and multinode (1 master and 1 slave) cluster. When I try to run my application, it takes same time for both i.e. single node and multi-node. In my application, I am copying data from HDFS to local file system and then performing processing on it. Is this because I have files stored locally and the files are not accessible to other nodes in the cluster? I am providing a file which is actually divided into 3 chunks so logically it should be processed faster on multi-node.
Any idea?
Thanks!
When I try to run my application, it takes same time for both i.e.
single node and multi-node.
Well the difference in time taken will vary depending on the type of operation performed and the amount of load generated by your application. For example, copying few MB's of data will take almost the same time on both single and multi-node cluster. Even, single node cluster might show up good results for small data set as compared to multi-node cluster. The actual power of Hadoop lies in processing of colossal volumes of data sets by utilizing multi-node clusters for parallel processing.
In my application, I am copying data from HDFS to local file system
and then performing processing on it.
I do not see any sense in copying data on local file system for processing in a multi-node environment. In this way you are limiting yourself from using the power of distributed computing.
I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single threaded.
I'm thinking in terms of Map/Reduce. Is there a way to distribute the loading process to the nodes themselves. So each node will load a part of the file say 60 GB each. I don't want to do this manually from each node (that defeats the purpose). If there is way to do this using Java and Map/Reduce I would love to read about it. I know Hadoop can process wildcard input files. Say each 60GB chunk is named like this: file_1, file_2, file_3..I can then use file_* for my next MR jobs. The trouble I'm having is understanding how to efficiently load the file first into hadoop in a fast / multi-threaded way.
Thanks in advance!
Edit:
distcp - seems to be doing parallel copying into HDFS but only between clusters, and not within a cluster. I wonder why they didn't think of that, and if they did, what are the limitations or bottlenecks around this.
Also http://blog.syncsort.com/2012/06/moving-data-into-hadoop-faster/ seems to document benchmarks around this topic but they're using DMExpress (commercial tool) to do the loading. It would be great to have an Open Source alternative.
With your configuration, I don't know if parallelization of writes improve your performances because you want to write one file.
Suppose we have default configuration. Default replication factor is 3, so your file is considered as written when each blocks of your file is written on 3 machines of your cluster (in your case, in all machines of your cluster).
If you have more than one disk per machine, dividing your file on smallest part (as part as disk used by HDFS on one machine) can help to improve writing performance only if your application is the only one to use the cluster and you are not limited by your network. In this case your bottleneck is your disks.
If you can manage divided file on your clients a simple way to be sure all parts of your file is copied on HDFS is to create a directory which is the name of your file concatenated with a suffix which showing that the file is on copy. This directory contains all parts of your file. When all copying threads are finished you can rename the directory without suffix. Your clients can access to all parts of file only when suffix is removed. Rename consists on an operation in metadata on Namenode it is a fastest operation as compared as file copy.
Others solutions :
Usage of a marker file is not the best option because you lose an HDFS block (by default block size if 128 MB).
Recreating the file from its parts is similar to a rewriting of data so it is inefficient.