I am setup single node and multinode (1 master and 1 slave) cluster. When I try to run my application, it takes same time for both i.e. single node and multi-node. In my application, I am copying data from HDFS to local file system and then performing processing on it. Is this because I have files stored locally and the files are not accessible to other nodes in the cluster? I am providing a file which is actually divided into 3 chunks so logically it should be processed faster on multi-node.
Any idea?
Thanks!
When I try to run my application, it takes same time for both i.e.
single node and multi-node.
Well the difference in time taken will vary depending on the type of operation performed and the amount of load generated by your application. For example, copying few MB's of data will take almost the same time on both single and multi-node cluster. Even, single node cluster might show up good results for small data set as compared to multi-node cluster. The actual power of Hadoop lies in processing of colossal volumes of data sets by utilizing multi-node clusters for parallel processing.
In my application, I am copying data from HDFS to local file system
and then performing processing on it.
I do not see any sense in copying data on local file system for processing in a multi-node environment. In this way you are limiting yourself from using the power of distributed computing.
Related
I'm doing some POC work with Redshift, loading data via S3 json files, using the copy command from a Java program. This POC is testing an initial data migration that we'd do to seed Redshift, not daily use. My data is split into about 7500 subfolders in S3, and I'd like to be able to insert the subfolders in parallel. Each subfolder contains about 250 json files, with about 3000 rows each to insert.
The single threaded version of my class loads files from one of my s3 subfolders in about 20 seconds (via a copy command). However, when I introduce a second thread (each thread gets a redshift db connection from a BoneCP connection pool), each copy command, except for the 1st one, takes about 40 seconds. When I run a query in Redshift to show all running queries, Redshift says that it's running two queries at the same time (as expected). However, it's as if the 2nd query is really waiting for the 1st to complete before it starts work. I expected that each copy command would still take only 20 seconds each. The Redshift console shows that I only get up to 60% CPU usage running single or double threaded.
Could this be because I only have 1 node in my Redshift cluster? Or is Redshift unable to open multiple connections to S3 to get the data? I'd appreciate any tips for how to get some performance gains by running multi-threaded copy commands.
Amazon Redshift loads data from Amazon S3 in parallel, utilising all nodes. From your test results, it would appear that running multiple COPY commands does not improve performance, since all nodes are already involved in the copy process.
For each table, always load as many files as possible in a single COPY command, rather than appending later. If you are loading multiple tables, it is likely best to do them sequentially (but your testing might find that loading multiple smaller tables can be done in parallel).
Some references:
Use a Single COPY Command to Load from Multiple Files
Using the COPY Command to Load from Amazon S3
Quora: What are the ways to improve copy performance on Redshift?
My abysmally-rudimentary understanding of Hadoop and its "data ingest" tools (such as Flume or Sqoop) is that Hadoop must always run its MR jobs against data that is stored in structured files on its HDFS. And, that these tools (again, Flume, Sqoop, etc.) are responsible for essentially importing data from disparate systems (RDBMS, NoSQL, etc.) into HDFS.
To me, this means that Hadoop will always be running on "stale" (for lack of a better word) data that is minutes/hours/etc. old. Because, to import big data from these disparate systems onto HDFS takes time. By the time MR can even run, the data is stale and may no longer be relevant.
Say we have an app that has real-time constraints of making a decision within 500ms of something occurring. Say we have a massive stream of data that is being imported into HDFS, and because the data is so big it takes, say, 3 seconds to even get the data on to HDFS. Then say that the MR job that is responsible for making the decision takes 200ms. Because the loading of the data takes so long, we've already blown our time constraint, even though the MR job processing the data would be able to finish inside the given window.
Is there a solution for this kind of big data problem?
With the help of tools Apache Spark streaming API & another one is Storm which you can use for real time stream processing.
I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single threaded.
I'm thinking in terms of Map/Reduce. Is there a way to distribute the loading process to the nodes themselves. So each node will load a part of the file say 60 GB each. I don't want to do this manually from each node (that defeats the purpose). If there is way to do this using Java and Map/Reduce I would love to read about it. I know Hadoop can process wildcard input files. Say each 60GB chunk is named like this: file_1, file_2, file_3..I can then use file_* for my next MR jobs. The trouble I'm having is understanding how to efficiently load the file first into hadoop in a fast / multi-threaded way.
Thanks in advance!
Edit:
distcp - seems to be doing parallel copying into HDFS but only between clusters, and not within a cluster. I wonder why they didn't think of that, and if they did, what are the limitations or bottlenecks around this.
Also http://blog.syncsort.com/2012/06/moving-data-into-hadoop-faster/ seems to document benchmarks around this topic but they're using DMExpress (commercial tool) to do the loading. It would be great to have an Open Source alternative.
With your configuration, I don't know if parallelization of writes improve your performances because you want to write one file.
Suppose we have default configuration. Default replication factor is 3, so your file is considered as written when each blocks of your file is written on 3 machines of your cluster (in your case, in all machines of your cluster).
If you have more than one disk per machine, dividing your file on smallest part (as part as disk used by HDFS on one machine) can help to improve writing performance only if your application is the only one to use the cluster and you are not limited by your network. In this case your bottleneck is your disks.
If you can manage divided file on your clients a simple way to be sure all parts of your file is copied on HDFS is to create a directory which is the name of your file concatenated with a suffix which showing that the file is on copy. This directory contains all parts of your file. When all copying threads are finished you can rename the directory without suffix. Your clients can access to all parts of file only when suffix is removed. Rename consists on an operation in metadata on Namenode it is a fastest operation as compared as file copy.
Others solutions :
Usage of a marker file is not the best option because you lose an HDFS block (by default block size if 128 MB).
Recreating the file from its parts is similar to a rewriting of data so it is inefficient.
I'm dealing with kind of a bizarre use case where I need to make sure that File A is local to Machine A, File B is local to Machine B, etc. When copying a file to HDFS, is there a way to control which machines that file will reside on? I know that any given file will be replicated across three machines, but I need to be able to say "File A will DEFINITELY exist on Machine A". I don't really care about the other two machines -- they could be any machines on my cluster.
Thank you.
I don't think so, because in general when the file is greater than 64MB(chunk size) the primary replicas of file chunks will reside on multiple servers.
HDFS is a distributed files system and HDFS is cluster (one machine or lots of machine) specific and once file is at HDFS you loose the machine or machines concept underneath. And that abstraction is what makes it best use case. If file size is bigger then replication block size the file will be cut into block size and based on replication factor, those blocks will be copied to other machine in your cluster. Those blocks move based on
In your case, if you have 3 nodes cluster (+1 main namenode), your source file size is 1 MB, your replication size is 64MB, and replication factor is 3, then you will have 3 copies of blocks in all 3 nodes consisting your 1MB file however from HDFS perspective you will still have only 1 file. Once file copies to HDFS, you really dont consider the machine factor because at machine level there is no file, it is file blocks.
If you really want to make sure for whatever reason, you can do is set the replication factor to 1 and have 1 node cluster which will guarantee your bizarre requirement.
Finally you can always use FSimage viewer tools in your Hadoop cluster to see where the file blocks are located. More details are located here.
I found this recently that may address what you are looking to do: Controlling HDFS Block Placement
I have a java application which needs to read and write files to HDFS. I do use
FileSystem fs = FileSystem.get(configuration);
And it works well.
Now the question is : should I keep this reference and use it as a singleton or should I use it only once and get a new one each time?
If it matters, I need to say that the application targets a quite high traffic.
Thanks
I think the answer depends on relation of two numbers - network bandwidth (between HDFS client and HDFS cluster) and amount of data per second you can feed to HDFS client. If first is higher - then having a few connections in the same time makes sense.
Usually 2-3 concurrent connections are optimal