I am trying to transfer lots of files from local to hadoop hdfs.
In my java code, I just have one connection to hadoop. But I call
fileSystem.transferFromLocal
simultaneous in 50 threads.
I think this might be not a good way,because it's really slow.
Could anyone please give me some suggestion about this? Thank you very much.
You need to figure out the bottle neck causing the slow transfer, it could be any. Just increasing the number of threads won't increase the HDFS writes proportionally. Without getting into details about your Hadoop cluster, it's difficult to diagnose the problem.
Here are some of the things to consider
Check the network band width between the local machine and the Hadoop cluster.
Local disk i/o could also be the bottle neck.
Try increasing the number of data nodes. Note that the data is directly streamed from the client to the first data node in the pipe line. The first forwards to the second, which forwards to the next data node.
Check any configuration parameters to fine tune HDFS.
Check the Architecture Guide for more details about HDFS.
Related
I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?
The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.
Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.
Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)
I need to pull the files concurrently from remote server using single SFTP connection in Java code.
I've already got few links to pull the files one by one on single connection.
Like:
To use sftpChannel.ls("Path to dir"); which will returns list of files in the given path as a vector and you have to iterate on the vector to download each file sftpChannel.get();
But I want to pull multiple files concurrently for eg. 2 files at a time on single connection.
Thank You!
The ChannelSftp.get method returns an InputStream.
So you can call the get multiple times, acquiring a stream for each download. And then keep polling the streams until all reach the end-of-file.
Though I do not see, what advantage this gives you over a sequential download.
If you want to improve performance, you first need to know, what is the bottleneck.
The typical bottlenecks are:
Network speed: If you are saturating the network speed already, you cannot improve anything.
Network latency: If the latency is the bottleneck, increasing size of an SFTP request queue may help. Use the ChannelSftp.setBulkRequests method (the default is 16, so use a higher number)
CPU: If the CPU is the bottleneck, you either have to improve efficiency of the encryption implementation, or spread the load across CPU cores. Spreading the encryption load of a single session/connection is tricky and would have to be supported on low-level SSH implementation. I do not think JSch or any other implementation supports that.
Disk: If a disk drive (local or remote) is the bottleneck (unlikely), the parallel transfers as shown above may help, even when using a single connection, if the parallel transfers use a different disk drive each.
For more in-depth information, see my answers to:
Why is FileZilla SFTP file transfer max capped at 1.3MiB/sec instead of saturating available bandwidth? rsync and WinSCP are even slower
Why is FileZilla so much faster than PSFTP?
I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single threaded.
I'm thinking in terms of Map/Reduce. Is there a way to distribute the loading process to the nodes themselves. So each node will load a part of the file say 60 GB each. I don't want to do this manually from each node (that defeats the purpose). If there is way to do this using Java and Map/Reduce I would love to read about it. I know Hadoop can process wildcard input files. Say each 60GB chunk is named like this: file_1, file_2, file_3..I can then use file_* for my next MR jobs. The trouble I'm having is understanding how to efficiently load the file first into hadoop in a fast / multi-threaded way.
Thanks in advance!
Edit:
distcp - seems to be doing parallel copying into HDFS but only between clusters, and not within a cluster. I wonder why they didn't think of that, and if they did, what are the limitations or bottlenecks around this.
Also http://blog.syncsort.com/2012/06/moving-data-into-hadoop-faster/ seems to document benchmarks around this topic but they're using DMExpress (commercial tool) to do the loading. It would be great to have an Open Source alternative.
With your configuration, I don't know if parallelization of writes improve your performances because you want to write one file.
Suppose we have default configuration. Default replication factor is 3, so your file is considered as written when each blocks of your file is written on 3 machines of your cluster (in your case, in all machines of your cluster).
If you have more than one disk per machine, dividing your file on smallest part (as part as disk used by HDFS on one machine) can help to improve writing performance only if your application is the only one to use the cluster and you are not limited by your network. In this case your bottleneck is your disks.
If you can manage divided file on your clients a simple way to be sure all parts of your file is copied on HDFS is to create a directory which is the name of your file concatenated with a suffix which showing that the file is on copy. This directory contains all parts of your file. When all copying threads are finished you can rename the directory without suffix. Your clients can access to all parts of file only when suffix is removed. Rename consists on an operation in metadata on Namenode it is a fastest operation as compared as file copy.
Others solutions :
Usage of a marker file is not the best option because you lose an HDFS block (by default block size if 128 MB).
Recreating the file from its parts is similar to a rewriting of data so it is inefficient.
I have a java application which needs to read and write files to HDFS. I do use
FileSystem fs = FileSystem.get(configuration);
And it works well.
Now the question is : should I keep this reference and use it as a singleton or should I use it only once and get a new one each time?
If it matters, I need to say that the application targets a quite high traffic.
Thanks
I think the answer depends on relation of two numbers - network bandwidth (between HDFS client and HDFS cluster) and amount of data per second you can feed to HDFS client. If first is higher - then having a few connections in the same time makes sense.
Usually 2-3 concurrent connections are optimal
I am writing code to transfer files to hadoop hdfs parallel. So I have many threads calling filesystem.copyFromLocalFile.
I think the cost of opening a filesystem is not small, so I just have one filesystem opened in my project. So I though there might be a a problem when so many threads calling it at the same time. But so far, it works fine with no problem.
Could anyone please give me some information about this copy method?
Thank you very much& have a great weekend.
I see the following design points to consider:
a) Where will be bottleneck of the process? I think in 2-3 parallel copy operations local disk or 1GB Ethernet will became a bottleneck. You can do it in form of multithreaded application or you can run a few processes. In any case I do not think you need a high level of parallelism.
b) Error handling. Failure of the one thread should not stop the whole process, and, in the same time file should not be lost. What I am usually doing in such cases is to agree that in a worst case file can be copied twice. If it is Ok - system can work in simple "copy then delete" scenario.
c) If you copy from the one of the cluster nodes - HDFS will became unbalanced, since one replica will be stored on the host from where you copy. You will need to do the balance constantly.
Can you tell me what more information you want about copyFromLocalFile()?
I'm not sure but I guess in your case, threads share the same resource among themselves. Since, you have only one instance of FileSystem, each thead will probably share this object in a time sharing basis among themselves.