hadoop/emr how to store key-value pairs - java

I am running a series of MapReduce jobs on EMR. However, the 3rd MapReduce job needs the data output from the 2nd MapReduce job, and the output is essentially over a million key-value pairs (both the key and the value are less than 1KB). Is there a good way to store this information in a distributed store on the same machine as the EMR so the subsequent jobs can access the information? I looked at DistributedCache, but it's more for storing files? I am not sure if Hadoop is optimized for storing a million tiny files..
Or maybe I can somehow use another MapReduce job to combine all of the key-value pairs into ONE output file, and then put that entire file into DistributedCache.
Please advise. Thanks!

Usually, the output of a map reduce job is stored in HDFS (or S3). The number of reducers of this job determines the number of output files. How come you have a million of tiny files? Do you run a million reducers? I'm not so sure.
So if you define a single reducer for your 2nd job, you'll automatically end up with a single output file, which will be stored in HDFS. Your 3rd job will be able to access and process this file as input. If the 2nd job needs multiple reducers, you'll have multiple output files. 1 million key-value pairs with key and value of 1 KB each give you a < 2 GB file. With a HDFS block size of 64 MB, you'll end up with result files with size N*64 MB, which will allow the 3rd job to process the blocks in parallel (multiple mappers).
You should use DistributedCache only if the whole file needs to be read in every single mapper. However with a size of max. 2 GB it is a rather flawed approach.

Related

Google Dataflow batch file processing poor performance

I'm trying to build a pipeline using Apache Beam 2.16.0 for processing large amount of XML files. Average count is seventy million per 24 hrs, and at peak load it can go up to half a billion.
File sizes varies from ~1 kb to 200 kb (sometimes it can be even bigger, for example 30 mb)
File goes through various transformations and final destination is BigQuery table for further analysis. So, first I read xml file, then deserialize into POJO (with help of Jackson) and then apply all required transformations. Transformations works pretty fast, on my machine I was able to get about 40000 transformations per second, depending on file size.
My main concern is file reading speed. I have feeling that all reading is done only via one worker, and I don't understand how this can be paralleled. I tested on 10k test files dataset.
Batch job on my local machine (MacBook pro 2018: ssd, 16 gb ram and 6-core i7 cpu) can parse about 750 files/sec. If I run this on DataFlow, using n1-standard-4 machine, I can get only about 75 files/sec. It usually doesn't scale up, but even if it does (sometimes up to 15 workers), I can get only about 350 files/sec.
More interesting is streaming job. It immediately starts from 6-7 workers and on UI I can see 1200-1500 elements/sec, but usually it doesn't show speed, and if I select last item on page, it shows that it already processed 10000 elements.
The only difference between batch and stream job is this option for FileIO:
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
Why this makes such a big difference in processing speed?
Run parameters:
--runner=DataflowRunner
--project=<...>
--inputFilePattern=gs://java/log_entry/*.xml
--workerMachineType=n1-standard-4
--tempLocation=gs://java/temp
--maxNumWorkers=100
Run region and bucket region are the same.
Pipeline:
pipeline.apply(
FileIO.match()
.withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)
.filepattern(options.getInputFilePattern())
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
.apply("xml to POJO", ParDo.of(new XmlToPojoDoFn()));
Example of xml file:
<LogEntry><EntryId>0</EntryId>
<LogValue>Test</LogValue>
<LogTime>12-12-2019</LogTime>
<LogProperty>1</LogProperty>
<LogProperty>2</LogProperty>
<LogProperty>3</LogProperty>
<LogProperty>4</LogProperty>
<LogProperty>5</LogProperty>
</LogEntry>
Real life file and project are much more complex, with lots of nested nodes and huge amount of transformation rules.
Simplified code on GitHub: https://github.com/costello-art/dataflow-file-io
It contains only "bottleneck" part - reading files and deserializing into POJO.
If I can process about 750 files/sec on my machine (which is one powerful worker), then I expect to have about 7500 files/sec on similar 10 workers in Dataflow.
I was trying to make a test code with some functionality to check the behavior of the FileIO.match and the number of workers [1].
In this code I set the value numWorkers to 50, but you can set the value you need. What I could see is that the FileIO.match method will find all the links that match these patterns but after that, you must deal with the content of each file separately.
For example, in my case I created a method that receives each file and then I divided the content by "new_line (\n)" character (but here you can handle it as you want, it depends also on the type of file, csv, xml, ...).
Therefore, I transformed each line to TableRow, format that BigQuery understands, and return each value separately (out.output(tab)), this way, Dataflow will handle the lines in different workers depending the workload of the pipeline, for example 3000 lines in 3 different workers, each one with 1000 lines.
At the end, since it is a batch process, Dataflow will wait to process all the lines and then insert this in BigQuery.
I hope this test code helps you with yours.
[1] https://github.com/GonzaloPF/dataflow-pipeline/blob/master/java/randomDataToBQ/src/main/fromListFilestoBQ.java

How Hadoop distribute data and mapreduce task across multiple data nodes

i am new to hadoop and i read many pages of hadoop mapreduce and hdfs but still not able to clear one concept.
May be this question is foolish or unusal,if it is so than so sorry for that.
My question is, suppose i had created a word count program for a file of size 1 GB in hadoop in which the map function will take each line as a input and output as a key-value pair and reduce function will take input
as key-value pair and simply iterate list and count total number of times a word came in that file.
Now my question is since this file is stored in chunks across multiple data nodes and the map-reduce execute on each data-node parallely. Say my file is stored on two datanode and file on first data-node contains word "hadoop" 5 times and file on second data-node contains word "hadoop" 7 times.So basically
output of whole map reduce process will be:
hadoop:7
hadoop:5
as 2 map-reduce functions are executed on 2 different data-nodes parallely,
But output should be sum of count of "hadoop" word on both file and that is :
hadoop:13
So how would i achieve this or am i missing some concept here.Please help i am badly stuck with this concept and i am so sorry if i am unable to make you understand what i want to ask.
You might have read many pages of Hadoop Mapreduce and HDFS but you seemed to have missed the ones containing the stage after Map and before Reduce, which is called Shuffle and Sort.
Basically what it does is, it shuffles the data from all mappers and sends the lines with same keys to the same reducer in a sorted order. So, in your case, both hadoop 7 and hadoop 5 will go the same reducer which will reduce it to hadoop 12 (Not 13!)
You can get more information about Shuffle and Sort easily on the web. There are questions like this too which you can read.
I think you are completly missing the concept of the reducer because thats exactly its function ,the reducer input will be a key(in this case hadoop) and a list of values associated with this key(7 and 5) , so your reducer program will iterate the values list and do the summation and then hadoop,13.

Split large space separated file into chucks of small files

I have an input file of about 2 GB. It contains numbers (duplicates possible) from 1 to 9999 and are space separated. I want to read the file in small chunks (chunks of say 100000 or 20000). What approach should I take?
I am planning to process these chunks of data on different nodes in distributed fashion. I cannot use HDFS or any other file system that would chunk data automatically.
When you store that 2GB of data in the HDFS, it will be broken down into blocks. The default block size for HDFS is 64MB. You can set it to any size that you wish. For example, if you set the size to be 100MB, your data will be broken down into approximately 20 blocks.
On the other hand, when you are processing the data through MapReduce, you can decide the amount of data that you want to process by defining the number of mappers to use. You do this by setting up the split size.
For example, if you have 20 blocks of size 100MB in your HDFS like mentioned earlier, if you do not set any split size, Hadoop will figure that out for you and assign 20 mappers. But if you specify, for example the split size to be 25MB, then you will have 80 mappers processing your data.
It is important to note, this is just an example. In practical, the higher number of mappers does not mean faster processing time. You'd have to look into optimisation in order to get the best number of splits to use.
Hope this helps.

Hadoop: Processing large serialized objects

I am working on development of an application to process (and merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file on different hosts. But as deserialization will require the all the blocks to be present on single host, its gonna hit the performance drastically. How can I deal this situation where different blocks have to cant be individually processed, unlike text files ?
There's two issues: one is that each file must (in the initial stage) be processed in whole: the mapper that sees the first byte must handle all the rest of that file. The other problem is locality: for best efficiency, you'd like all the blocks for each such file to reside on the same host.
Processing files in whole:
One simple trick is to have the first-stage mapper process a list of filenames, not their contents. If you want 50 map jobs to run, make 50 files each with that fraction of the filenames. This is easy and works with java or streaming hadoop.
Alternatively, use a non-splittable input format such as NonSplitableTextInputFormat.
For more details, see "How do I process files, one per map?" and "How do I get each of my maps to work on one complete input-file?" on the hadoop wiki.
Locality:
This leaves a problem, however, that the blocks you are reading from are disributed all across the HDFS: normally a performance gain, here a real problem. I don't believe there's any way to chain certain blocks to travel together in the HDFS.
Is it possible to place the files in each node's local storage? This is actually the most performant and easiest way to solve this: have each machine start jobs to process all the files in e.g. /data/1/**/*.data (being as clever as you care to be about efficiently using local partitions and number of CPU cores).
If the files originate from a SAN or from say s3 anyway, try just pulling from there directly: it's built to handle the swarm.
A note on using the first trick: If some of the files are much larger than others, put them alone in the earliest-named listing, to avoid issues with speculative execution. You might turn off speculative execution for such jobs anyway if the tasks are dependable and you don't want some batches processed multiple times.
It sounds like your input file is one big serialized object. Is that the case? Could you make each item its own serialized value with a simple key?
For example, if you were wanting to use Hadoop to parallelize the resizing of images you could serialize each image individually and have a simple index key. Your input file would be a text file with the key values pairs being index key and then serialized blob would be the value.
I use this method when doing simulations in Hadoop. My serialized blob is all the data needed for the simulation and the key is simply an integer representing a simulation number. This allows me to use Hadoop (in particular Amazon Elastic Map Reduce) like a grid engine.
I think the basic (unhelpful) answer is that you can't really do this, since this runs directly counter to the MapReduce paradigm. Units of input and output for mappers and reducers are records, which are relatively small. Hadoop operates in terms of these, not file blocks on disk.
Are you sure your process needs everything on one host? Anything that I'd describe as a merge can be implemented pretty cleanly as a MapReduce where there is no such requirement.
If you mean that you want to ensure certain keys (and their values) end up on the same reducer, you can use a Partitioner to define how keys are mapped onto reducer instances. Depending on your situation, this may be what you really are after.
I'll also say it kind of sounds like you are trying to operate on HDFS files, rather than write a Hadoop MapReduce. So maybe your question is really about how to hold open several SequenceFiles on HDFS, read their records and merge, manually. This isn't a Hadoop question then, but, still doesn't need blocks to be on one host.

Sort a file with huge volume of data given memory constraint

Points:
We process thousands of flat files in a day, concurrently.
Memory constraint is a major issue.
We use thread for each file process.
We don't sort by columns. Each line (record) in the file is treated as one column.
Can't Do:
We cannot use unix/linux's sort commands.
We cannot use any database system no matter how light they can be.
Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.
In that situation, how would you sort the records/lines in a file?
It looks like what you are looking for is
external sorting.
Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.
As other mentionned, you can process in steps.
I would like to explain this with my own words (differs on point 3) :
Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).
Sort the N records in memory, write them to a temp file. Loop on T until you are done.
Open all the T temp files at the same time, but read only one record per file. (Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.
Advantages:
The memory consumption is as low as you want.
You only do the double of disk accesses comparing to a everything-in-memory policy. Not bad! :-)
Exemple with numbers:
Original file with 1 million records.
Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
Open the 100 temp file at a time, read the first record in memory.
Compare the first records, write the smaller and advance this temp file.
Loop on step 5, one million times.
EDITED
You mentionned a multi-threaded application, so I wonder ...
As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one thread to process only one sort at a time, not as a multi-threaded application.
If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.
After reading this response :
Sort a file with huge volume of data given memory constraint
I suggest you consider this distribution sort. It could be huge gain in your context.
The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)
You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.
Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.
In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exe file, in a command prompt you can do this:
C:> sqlite3.exe dbLines.db
sqlite> create table tabLines(line varchar(5000));
sqlite> create index idx1 on tabLines(line);
sqlite> .separator '\r\n'
sqlite> .import 'FileToImport' TabLines
then:
sqlite> select * from tabLines order by line;
or save to a file:
sqlite> .output out.txt
sqlite> select * from tabLines order by line;
sqlite> .output stdout
I would spin up an EC2 cluster and run Hadoop's MergeSort.
Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.
Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.
As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.
You could use the following divide-and-conquer strategy:
Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.
Suppose you have this input file where each line represents a record
Alan Smith
Jon Doe
Bill Murray
Johnny Cash
Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:
<file1>
Alan Smith
<file2>
Bill Murray
<file10>
Jon Doe
Johnny Cash
Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.
Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.
The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.
If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.
Here is a way to do it without heavy use of sorting in-side Java and without using DB.
Assumptions : You have 1TB space and files contain or start with unique number, but are unsorted
Divide the files N times.
Read those N files one by one, and create one file for each line/number
Name that file with corresponding number.While naming keep a counter updated to store least count.
Now you can already have the root folder of files marked for sorting by name or pause your program to give you the time to fire command on your OS to sort the files by names. You can do it programmatically too.
Now you have a folder with files sorted with their name, using the counter start taking each file one by one, put numbers in your OUTPUT file, close it.
When you are done you will have a large file with sorted numbers.
I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.
You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.
Advantages: No need to worry about writing the best sorting algorithm.
Disadvantage: You will need disk space, slower processing.
https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files
You can do it with only two temp files - source and destination - and as little memory as you want.
On first step your source is the original file, on last step the destination is the result file.
On each iteration:
read from the source file into a sliding buffer a chunk of data half size of the buffer;
sort the whole buffer
write to the destination file the first half of the buffer.
shift the second half of the buffer to the beginning and repeat
Keep a boolean flag that says whether you had to move some records in current iteration.
If the flag remains false, your file is sorted.
If it's raised, repeat the process using the destination file as a source.
Max number of iterations: (file size)/(buffer size)*2
You could download gnu sort for windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm Even if that uses too much memory, it can merge smaller sorted files as well. It automatically uses temp files.
There's also the sort that comes with windows within cmd.exe. Both of these commands can specify the character column to sort by.
File sort software for big file https://github.com/lianzhoutw/filesort/ .
It is based on file merge sort algorithm.
If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.
You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.

Categories