We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?
UPDATE: Adding a bit of context
We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?
If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.
While creating your job this can be done with the following line of code
job.setNumReduceTasks(0);
I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.
First google match on a map-only example I found:
Map-Only MR jobs
Setting the reducers to zero would increase the data locality. This means the intermediate data that have been generated by the Mappers will be stored on HDFS. Of course, you will not have any control of choosing the which nodes will store the intermediate data and if its size is greater than the number of the mapper slots * block size, then the remote access would be attempt to avoid starvation. My advice is to use delay scheduler and set locality-delay-node-ms and locality-delay-rack-ms to a large value (i.e. the maximum expected running time for your mappers). This will make the delay scheduler waits as much as possible before requesting data remotely. However, this may lead to resource under-utilization and increasing the running time (e.g. any node that does not store any data block will be idle for a long time locality-delay-node-ms + locality-delay-rack-ms).
Related
I have a program which accesses a single RocksDB using multiple threads.
Our workflow for a given document is to read the cache, do some work, then update the cache.
My code uses chained CompletableFutures to process multiple documents in order (and processes the first document before starting the subsequent document). So my RocksDB workload consists of (read, write) repeated several times for the same key.
Most of the time we get the correct value from the cache for each run through the workflow, but occasionally we will get stale data. Each operation could run on one of many threads in the Executor, but they will never run in parallel for the same key.
Is there a way to ensure that we get strong consistency? I wrote a unit test to confirm that this happens, and it happens between 1-3% of the time. I even added a read-after-write, and that reduced the inconsistency, but did not eliminate it.
Not sure what you are referring to as strong consistency is rocksdb is strongly consistent - there is no across the network replication going on where you would see eventual consistency
if you want to get a snapshotted read use a snapshot sequence identifier when doing your reads
Sounds more like a threading issue where your reads and writes are happening in non-determenistic order
I have some Java processes(Socket programs) running on different servers, some on the same network and some on different networks. These processes together have the job to maintain a global counter. A client can connect to any of these processes and issue command to increase, decrease or get the counter value. The global counter should be eventually consistent(Network partition can occur and we can recover from it).
The solution I have thought of so far is to maintain a count of increments and decrements on each node for all the nodes. When an increment command is issued on a node, it increments its own local copy of its counts of increments and then broadcasts its increment and decrement count. The nodes that receive this broadcast take the max of the received counts and their local copy of the sender's counts and stores the result as the latest count. When a get command is issued on any node it gives the difference of the sums of all the increments and decrements. I assume this will take care of cases where broadcasts are received out of order and other unreliabilities. I don't want to use any persistence layer.
Is there a better way to implement this?
What protocol should I use to broadcast the counts? Will gossip on UDP work? Any Java libraries that might help?
You may be aware of this design pattern, but it still may be inspiring: https://en.wikipedia.org/wiki/Observer_pattern
You could simply make all of the instances of the program observe all of the other instances, then they will all notify each other if any one changes (check out the diagram in that link).
As far as a Java libraries, check these out, see if any of them make your life easier:
http://mina.apache.org/
http://commons.apache.org/proper/commons-net/
http://hc.apache.org/
It sounds like you need a PNCounter from Akka's Distributed Data library. It uses Gossip to communicate the counter's state to the network. You also have fine grained control over read and write consistency. So, for example, you can do a ReadMajority where "the value will be read and merged from a majority of replicas".
Incidentally, the PNCounter works as you describe, using two distributed counters to maintain increments and decrements.
Ive setup a 3 node cluster that was distributing tasks (steps? jobs?) pretty evenly until the most recent which has all been assigned to one machine.
Topology (do we still use this term for flink?):
kafka (3 topics on different feeds) -> flatmap -> union -> map
Is there something about this setup that would tell the cluster manager to put everything on one machine?
Also - what are the 'not set' values in the image? Some step I've missed? Or some to-be-implemented UI feature?
It is actually on purpose that Flink schedules your job on a single TaskManager. In order to understand it let me quickly explain Flink's resource scheduling algorithm.
First of all, in the Flink world a slot can accommodate more than one task (parallel instance of an operator). In fact, it can accommodate one parallel instance of each operator. The reason for this is that Flink not only executes streaming jobs in a streaming fashion but also batch jobs. With streaming fashion I mean that Flink brings all operators of your dataflow graph online so that intermediate results can be streamed directly to downstream operators where they are consumed. Per default Flink tries to combine one task of each operator in one slot.
When Flink schedules the tasks to the different slots, then it tries to co-locate the tasks with their inputs to avoid unnecessary network communication. For sources, the co-location depends on the implementation. For file-based sources, for example, Flink tries to assign local file input splits to the different tasks.
So if we apply this to your job, then we see the following. You have three different sources with parallelism 1. All sources belong to the same resource sharing group, thus the single task of each operator will deployed to the same slot. The initial slot is randomly chosen from the available instances (actually it depends on the order of the TaskManager registration at the JobManager) and then filled up. Let's say the chosen slot is on machine node1.
Next we have the three flat map operators which have a parallelism of 2. Here again one of the two sub-tasks of each flat map operator can be deployed to the same slot which already accommodates the three sources. The second sub-task, however, has to placed in a new slot. When this happens Flink tries to choose a free slot which is co-located to a slot in which one of the task's inputs is deployed (again to reduce network communication). Since only one slot of node1 is occupied and thus 31 are still free, it will deploy the 2nd sub-task of each flatMap operator also to node1.
The same now applies to the tumbling window reduce operation. Flink tries to co-locate all the tasks of the window operator with it's inputs. Since all of its inputs run on node1 and node1 has enough free slots to accommodate 6 sub-tasks of the window operator, they will be scheduled to node1. It's important to note, that 1 window task will run in the slots which contains the three sources and one task of each flatMap operator.
I hope this explains why Flink only uses the slots of a single machine for the execution of your job.
The problem is that you are building a global window on an unkeyed (ungrouped) stream, so the window has to run on one machine.
Maybe you can also express your application logic differently so that you can group the stream.
The "(not set)" part is probably an issue in Flink's DataStream API, which is not setting default operator names.
Jobs implemented against the DataSet API will look like this:
Does Hadoop process replicas also? For example worker node i, in mapper phase, processes the data stored on that machine only. After data (not replica, but original) is finished to be processed in mapper phase or maybe not finished, can there be a case that, machine i processes replica data stored on that machine? Or replica is used only when some node does off?
Yes, processing replicas also would happen on a specific scenario called Speculative execution.
If the machine i takes too much time to process the data block stored in that machine , then the job's Application master would start a duplicate parallel mapper against the another replica of the data block stored in a different machine. This new speculative mapper will run in the machine j where the replica is stored.
Whichever mapper completes the execution first, its outputs will be considered.The other slow running mapper and its resources will be removed.
by default, the Speculative execution is enabled. You could toggle this by modifying the below properties.
mapreduce.map.speculative
mapreduce.reduce.speculative
By any case, not more than one replica of the data block will be stored in the same machine. Every replica of the data block will be kept in different machines.
The master node(jobtracker) may or may not pick the original data, in fact it doesn't maintain any info about out of the 3 replica which is original. Because when it saves the data it does a checksum verification on the file and saves it cleanly. Now when jobtracker wants to pick up a slot for the mapper, it takes so many things to account like number of free map slots, overhead of a tasktracker and other things. And last but not least data locality, so the closest node which satisfies almost all criteria will only be picked, it doesn't bother whether it is original or a replica and as mentioned even it doesn't maintain that identity.
I am writing an indexing app for MapReduce.
I was able to split inputs with NLineInputFormat, and now I've got few hundred mappers in my app. However, only 2/mashine of those are active at the same time, the rest are "PENDING". I believe that such a behavior slows the app significantly.
How do I make hadoop run at least 100 of those at the same time per machine?
I am using the old hadoop api syntax. Here's what I've tried so far:
conf.setNumMapTasks(1000);
conf.setNumTasksToExecutePerJvm(500);
none of those seem to have any effect.
Any ideas how I can make the mappers actually RUN in parallel?
The JobConf.setNumMapTasks() is just a hint to the MR framework and I am not sure the effect of calling it. In your case the total number of map tasks across the whole job should be equal to the total number of lines in the input divided by the number of lines configured in the NLineInputFormat. You can find more details on the total number of map/reduce tasks across the whole job here.
The description for mapred.tasktracker.map.tasks.maximum says
The maximum number of map tasks that will be run simultaneously by a task tracker.
You need to configure the mapred.tasktracker.map.tasks.maximum (which is defaulted to 2) to change the number of map tasks run parallely on a particular node by the task tracker. I could not get the documentation for 0.20.2, so I am not sure if the parameter exists or if the same parameter name is used in 0.20.2 release.