loading data into hdfs in parallel - java

I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single threaded.
I'm thinking in terms of Map/Reduce. Is there a way to distribute the loading process to the nodes themselves. So each node will load a part of the file say 60 GB each. I don't want to do this manually from each node (that defeats the purpose). If there is way to do this using Java and Map/Reduce I would love to read about it. I know Hadoop can process wildcard input files. Say each 60GB chunk is named like this: file_1, file_2, file_3..I can then use file_* for my next MR jobs. The trouble I'm having is understanding how to efficiently load the file first into hadoop in a fast / multi-threaded way.
Thanks in advance!
Edit:
distcp - seems to be doing parallel copying into HDFS but only between clusters, and not within a cluster. I wonder why they didn't think of that, and if they did, what are the limitations or bottlenecks around this.
Also http://blog.syncsort.com/2012/06/moving-data-into-hadoop-faster/ seems to document benchmarks around this topic but they're using DMExpress (commercial tool) to do the loading. It would be great to have an Open Source alternative.

With your configuration, I don't know if parallelization of writes improve your performances because you want to write one file.
Suppose we have default configuration. Default replication factor is 3, so your file is considered as written when each blocks of your file is written on 3 machines of your cluster (in your case, in all machines of your cluster).
If you have more than one disk per machine, dividing your file on smallest part (as part as disk used by HDFS on one machine) can help to improve writing performance only if your application is the only one to use the cluster and you are not limited by your network. In this case your bottleneck is your disks.
If you can manage divided file on your clients a simple way to be sure all parts of your file is copied on HDFS is to create a directory which is the name of your file concatenated with a suffix which showing that the file is on copy. This directory contains all parts of your file. When all copying threads are finished you can rename the directory without suffix. Your clients can access to all parts of file only when suffix is removed. Rename consists on an operation in metadata on Namenode it is a fastest operation as compared as file copy.
Others solutions :
Usage of a marker file is not the best option because you lose an HDFS block (by default block size if 128 MB).
Recreating the file from its parts is similar to a rewriting of data so it is inefficient.

Related

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.
Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers?
Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.
Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2?) because every record has its own deflate block. But the record offsets must be known in advance.
But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

Creating a large temporary file in a platform-agnostic way

What's the best way of creating a large temporary file in Java, and being sure that it's on disk, not in RAM somewhere?
If I use
Path tempFile = Files.createTempFile("temp-file-name", ".tmp");
then it works fine for small files, but on my Linux machine, it ends up being stored in /tmp. On many Linux boxes, that's a tmpfs filesystem, backed by RAM, which will cause trouble if the file is large. The appropriate way of doing this on such a box is to put it in /var/tmp, but hard-coding that path doesn't seem very cross-platform to me.
Is there a good cross-platform way of creating a temporary file in Java and being sure that it's backed by disk and not by RAM?
There is no platform-independent way to determine free disk space. Actually there is not even a good platform-dependent way; things that happen are zfs filesystems (which may be compressing your data on the fly), directories that are being filled by other applications, or network shares that are simply lying to you.
I know of these options:
Assume that it is an operating concern. I.e. whoever uses the software should have an administrator who is aware of how much space is left on what device, and who expects to be able to explicitly configure the partition that should hold the data. I'd start considering this at several tens of GB, and prefer this at a few 100 GBs.
Assume it's really a temporary file. Document that the application needs xxx GB of temporary space (whatever rough estimate you can give them - my application says "needs ca. 100 GB for every automatic update that you keep on disk").
Abuse the user cache for the file. The XDG standard has $XDG_CACHE_HOME for the cache; the cache directory is supposed to be nice and big (take a look at the ~/.cache/ of anybody using a Linux machine). On Windows, you'd simply use %TEMP% but that's okay because %TEMP% is supposed to be big anyway.
This gives the following strategy: Try environment variables, first XDG_CACHE_HOME (if it's nonempty, it's a Posix system with XDG conventions), then TMP (if it's nonempty, it's a Posix system and you don't have a better option than /tmp anyway), finally TEMP in case it's Windows.

When copying a file to HDFS, how to control what nodes that file will reside on?

I'm dealing with kind of a bizarre use case where I need to make sure that File A is local to Machine A, File B is local to Machine B, etc. When copying a file to HDFS, is there a way to control which machines that file will reside on? I know that any given file will be replicated across three machines, but I need to be able to say "File A will DEFINITELY exist on Machine A". I don't really care about the other two machines -- they could be any machines on my cluster.
Thank you.
I don't think so, because in general when the file is greater than 64MB(chunk size) the primary replicas of file chunks will reside on multiple servers.
HDFS is a distributed files system and HDFS is cluster (one machine or lots of machine) specific and once file is at HDFS you loose the machine or machines concept underneath. And that abstraction is what makes it best use case. If file size is bigger then replication block size the file will be cut into block size and based on replication factor, those blocks will be copied to other machine in your cluster. Those blocks move based on
In your case, if you have 3 nodes cluster (+1 main namenode), your source file size is 1 MB, your replication size is 64MB, and replication factor is 3, then you will have 3 copies of blocks in all 3 nodes consisting your 1MB file however from HDFS perspective you will still have only 1 file. Once file copies to HDFS, you really dont consider the machine factor because at machine level there is no file, it is file blocks.
If you really want to make sure for whatever reason, you can do is set the replication factor to 1 and have 1 node cluster which will guarantee your bizarre requirement.
Finally you can always use FSimage viewer tools in your Hadoop cluster to see where the file blocks are located. More details are located here.
I found this recently that may address what you are looking to do: Controlling HDFS Block Placement

Hadoop HDFS java client usage

I have a java application which needs to read and write files to HDFS. I do use
FileSystem fs = FileSystem.get(configuration);
And it works well.
Now the question is : should I keep this reference and use it as a singleton or should I use it only once and get a new one each time?
If it matters, I need to say that the application targets a quite high traffic.
Thanks
I think the answer depends on relation of two numbers - network bandwidth (between HDFS client and HDFS cluster) and amount of data per second you can feed to HDFS client. If first is higher - then having a few connections in the same time makes sense.
Usually 2-3 concurrent connections are optimal

how to find out the size of file and directory in java without creating the object?

First please dont overlook because you might think it as common question, this is not. I know how to find out size of file and directory using file.length and Apache FileUtils.sizeOfDirectory.
My problem is, in my case files and directory size is too big (in hundreds of mb). When I try to find out size using above code (e.g. creating file object) then my program becomes so much resource hungry and slows down the performance.
Is there any way to know the size of file without creating object?
I am using
for files File file1 = new file(fileName); long size = file1.length();
and for directory, File dir1 = new file (dirPath); long size = fileUtils.sizeOfDirectiry(dir1);
I have one parameter which enables size computing. If parameter is false then it goes smoothly. If false then program lags or hangs.. I am calculating size of 4 directory and 2 database files.
File objects are very lightweight. Either there is something wrong with your code, or the problem is not with the file objects but with the HD access necessary for getting the file size. If you do that for a large number of files (say, tens of thousands), then the harddisk will do a lot of seeks, which is pretty much the slowest operation possible on a modern PC (by several orders of magnitude).
A File is just a wrapper for the file path. It doesn't matter how big the file is only its file name.
When you want to get the size of all the files in a directory, the OS needs to read the directory and then lookup each file to get its size. Each access takes about 10 ms (because that's a typical seek time for a hard drive) So if you have 100,000 file it will take you about 17 minutes to get all their sizes.
The only way to speed this up is to get a faster drive. e.g. Solid State Drives have an average seek time of 0.1 ms but it would still take 10 second or more to get the size of 100K files.
BTW: The size of each file doesn't matter because it doesn't actually read the file. Only the file entry which has it s size.
EDIT: For example, if I try to get the sizes of a large directory. It is slow at first but much faster once the data is cached.
$ time du -s /usr
2911000 /usr
real 0m33.532s
user 0m0.880s
sys 0m5.190s
$ time du -s /usr
2911000 /usr
real 0m1.181s
user 0m0.300s
sys 0m0.840s
$ find /usr | wc -l
259934
The reason the look up is so fast the fist time is that the files were all installed at once and most of the information is available continuously on disk. Once the information is in memory, it takes next to no time to read the file information.
Timing FileUtils.sizeOfDirectory("/usr") take under 8.7 seconds. This is relatively slow compared with the time it takes du, but it is processing around 30K files per second.
An alterative might be to run Runtime.exec("du -s "+directory); however, this will only make a few seconds difference at most. Most of the time is likely to be spent waiting for the disk if its not in cache.
We had a similar performance problem with File.listFiles() on directories with large number of files.
Our setup was one folder with 10 subfolders each with 10,000 files.
The folder was on a network share and not on the machine running the test.
We were using a FileFilter to only accept files with known extensions or a directory so we could recourse down the directories.
Profiling revealed that about 70% of the time was spent calling File.isDirectory (which I assume Apache is calling). There were two calls to isDirectory for each file (one in the filter and one in the file processing stage).
File.isDirectory was slow cause it had to hit the network share for each file.
Reversing the order of the check in the filter to check for valid name before valid directory saved a lot of time, but we still needed to call isDirectory for the recursive lookup.
My solution was to implement a version of listFiles in native code, that would return a data structure that contained all the metadata about a file instead of just the filename like File does.
This got rid of the performance problem but added a maintenance problem of having to native code maintained by Java developers (lucking we only supported one OS).
I think that you need to read the Meta-Data of a file.
Read this tutorial for more information. This might be the solution you are looking for:
http://download.oracle.com/javase/tutorial/essential/io/fileAttr.html
Answering my own question..
This is not the best solution but works in my case..
I have created a batch script to get the size of the directory and then read it in java program. It gives me less execution time when number of files in directory are more then 1L (That is always in my case).. sizeOfDirectory takes around 30255 ms and with batch script i get 1700 ms.. For less number of files batch script is costly.
I'll add to what Peter Lawrey answered and add that when a directory has a lot of files inside it (directly, not in sub dirs) - the time it takes for file.listFiles() it extremely slow (I don't have exact numbers, I know it from experience). The amount of files has to be large, several thousands if I remember correctly - if this is your case, what fileUtils will do is actually try to load all of their names at once into memory - which can be consuming.
If that is your situation - I would suggest restructuring the directory to have some sort of hierarchy that will ensure a small number of files in each sub-directory.

Categories