A colleague of mine was telling me that he didn't see any issues with using tarballs to hold the data sent to a MapReduce job. My understanding of how Hadoop and MR/Spark work together is that the preferred hadoop storage formats are designed so that the data files can be split along the Hadoop block size and fanned out to MR mappers or Spark workers (to be partitioned). Tar strikes me as a really terrible format to do this because AFAIK Tar is absolutely not in any way designed to accommodate the way Hadoop and its various job engines work. Am I missing something here?
Related
I have read lots of blog entries and articles about the "Small Files problem in hadoop", but a lot of them simply seem to be a copy-paste of the previous. Furthermore they all seem a little bit dated, and the last ones (2015ish) describe anyway what this cloudera blog did in the early 2009.
Does this mean no archiving solution has been found in 6 years?
Here is the reason of my research: I need to move and catalogue files as they come, in different numbers, sometimes even singlely, and then store them in HDFS.
These files will be later be accessed and returned in a web service layer (must be fast), to be opened and seen by people or softwares.
The files may be videos, images, documents, whatever and need to be accessed later using an ID I produce with the Java class UUID.
The choice to use hdfs is completey personal of my PM, as I've proposed HBase to compensate the lack of indexing in HDFS (although I'm not sure it is an optimal solution), but he has asked me to look anyway outisde of HBase in case of having to deal with bigger files (so far the biggest among 1000 has been 2MB, but we expect 1Gb videos).
As far as I have understood, the small files problem happen when you use MapReduce jobs, for memory consumption, but I was wondering:
Does it really matter how many files are there in HDFS if I am using Spark to extract them? Or if I am using webhdfs/v1/ ? Or Java?
Talking about storing a group of small files, so far I've found three main solutions, all of which are quite inconvenient in production environment:
HAR: looks fantastic with the indexed file extraction, but the fact that I cannot append or add new files is quite troublesome. Does the opening and recreation of HARs weigh a lot on the system?
Sequence Files have the opposite pros and cons: you can append files, but they're not indexed, so there is a O(n) look-up time. Is it worth it?
Merge them: impossible to do in my case.
Is there some new technology I'm missing out regarding this common problem? Something on the lines of Avro or Parquet for files?
Here some feedback to your solutions:
a) HAR is not appendable. You can unarchive and archive your har archive with the new files via HDFS command line interface. Both methods are implemented as MapReduce job, so execution time depends on your compute cluster as well as size of your archive files. Me and my colleague use and developed AHAR. A tool that allows you to append data more efficiency without rewriting the whole archive.
b) As far as I know, you are right with a high index look-up time. But note, with HAR you also have a higher look-up time due to a two step indexing strategy.
This post gives you are very good overview about the small file problem and possible solutions. Maybe you can "just" increase the memory at the NameNode.
I have a question regarding implementation of hadoop in one of my projects. Basically the requirement is that, we receive buch of logs on daily basis containing information regarding videos(When it was played, when it stopped, which user playe it etc).
What we have to do is analyze these files and return stats data in response to an HTTP request.
Example request: http://somesite/requestData?startDate=someDate&endDate=anotherDate. Basically this request asks for count of all videos played between a date Range.
My question is can we use hadoop to solve this?
I have read in various articles hadoop is not real time. So to approach this scenario should i use hadoop in conjunction with MySQL?
What i have thought of doing is to write a Map/Reduce job and store count for each video for each day in mysql. The hadoop job can be scheduled to run like once a day. Mysql data can then be used to serve the request in real time.
Is this approach correct? Is hive useful in this in any way? Please provide some guidance on this.
Yes, your approach is correct - you can create the per day data with MR job or Hive and store them in MySQL for serving in real time.
However newer versions of Hive when configured with Tez can provide decent query performance. You could try storing your per day data in Hive serve them directly from there. If the query is a simple select, it should be fast enough.
Deciding using Hadoop is an investment, as you'll need clusters and development/operational effort.
For a Hadoop solution to make sense, your data must be big. Big, as in terabytes of data, coming in real fast, possibly without proper catalog information. If you can store/process your data in your current environment, run your analysis there.
Assuming your aim is not educational, I strongly recommend you to reconsider your choice of Hadoop. Unless you have real big data, it'll only cost you more effort.
On the other hand, if you really need a distributed solution, I think your approach of daily runs is correct, accept that there are better alternatives to writing a Map/Reduce job, such as Hive, Pig or Spark.
I have .exe file (I don't have source files so I won't be able to edit the program) taking as parameter path to file which be processing and on the end giving results. For example in console I run this program as follow : program.exe -file file_to_process [other_parametrs]. I have also jar executable file which take two parameters file_to_process and second file and [others_parameters]. In both cases I would like to split input file into smallest part and run programs in parallel. Is there any way to do it efficient with Apache Spark Java framework. I'm new with parallel computations and I read about RDD and pipe operator but I don't know if it would be good in my case because I have path to file.
I will be very grateful for some help or tips.
I have run into similar issues recently, and I have a working code with spark 2.1.0. The basic idea is that, you put your exe with its dependencies such as dll into HDFS or your local and use addFiles to add them into driver, which will also copy them into work executors. Then you can load your file as a RDD, and use mapPartitionsWithIndex function to save each partition into local and execute the exe (use SparkFiles.get to get the path from the work executor) to that partition using Process.
Hope that helps.
I think the general answer is "no". Spark is a framework and in general it administers very specific mechanisms for cluster configuration, shuffling its own data, read big inputs (based on HDFS), monitoring task completion and retries and performing efficient computation. It is not well suited for a case where you have a program you can't touch and that expects a file from the local filesystem.
I guess you could put your inputs on HDFS, then, since Spark accepts arbitrary java/Scala code, you could use whatever language facilities you have to dump to a local file, launch a process (i.e.this), then build some complex logic to monitor for completion (maybe based on the content of the output). the mapPartitions() Spark method would be the one best suited for this.
That said, I would not recommend it. It will be ugly, complex, require you to mess with permissions on the nodes and things like that and would not take good advantage of Spark's strengths.
Spark is well suited for you problem though, especially if each line of your file can be processed independently. I would look to see if there is a way to get the program's code, a library that does the same or if the algorithm is trivial enough to re-implement.
Probably not the answer you were looking for though :-(
I want to make a better performance for data processing using Hadoop MapReduce. So, do I need to use it along with Hadoop DFS? Or maybe MapReduce can be use with other type of data distributed? Show me the way, please....
Hadoop is a framework which includes Map Reduce programming model for computation and HDFS for storage.
HDFS stands for hadoop distributed file system which is inspired from Google File System. The overall Hadoop project is inspired based on the research paper published by Google.
research.google.com/archive/mapreduce-osdi04.pdf
http://research.google.com/archive/mapreduce.html
Using Map Reduce programming model data will be computed in parallel way in different nodes across the cluster which will decrease the processing time.
You need to use HDFS or HBASE to store your data in the cluster to get the high performance. If you like to choose normal file system, then there will not be much difference. Once the data goes to distributed system, automatically it will be divided across different block and replicated by default 3 times to avoid fault tolerance. All these will not be possible with normal file system
Hope this helps!
First, your idea is wrong. Performance of Hadoop MapReduce is not directly related to the performance of HDFS. It is considered to be slow because of its architecture:
It processes data with Java. Each separate mapper and reducer is a separate instance of JVM, which need to be invoked, which takes some time
It puts intermediate data on the HDDs many times. At minimum, mappers write their results (one), reducers reads and merges them, writing result set to disks (two), reducer results written back to your filesystem, usually HDFS (three). You can find more details on the process here: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/.
Second, Hadoop is open framework and it supports many different filesystems. You can read data from FTP, S3, local filesystem (NFS share, for instance), MapR-FS, IBM GPFS, GlusterFS by RedHat, etc. So you are free to choose the one you like. The main idea for MapReduce is to specify InputFormat and OutputFormat that would be able to work with your filesystem
Spark at the moment is considered to be a faster replacement of the Hadoop MapReduce as it puts much of the computations to the memory. But its use really depends on your case
i am trying to write a hadoop mapreduce program in java. For which the input is an array and output is also an array. But till now i have only seen people use inputs and outputs as files for it. So i was just wondering if mapreduce can have any other input and output formats.
Thanks
A wide variety of the Input and Output formats are supported by Hadoop. Check the subclasses of InputFormat and OutputFormat. Extend the InputFormat and OutputFormat if any custom formats are required. Check this article from Cloudera on DB input/output format.
Hadoop is a file system and the point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Hadoop mapreduce classes comes with support for reading the different types of files supported by hadoop (text files, sequential files) you can also write your own sources e.g. HBase comes with a map-reduce wrapper that reads its format of file. I haven't tried that but you can, as the article pointed by Praveen demonstrate, read from other sources
Output is even easier - since you're writing Java code you can do whatever in your reduce phase, so if you want to, say, put messages into a queue in the reduce phase just do that