Does Hadoop process replicas also? For example worker node i, in mapper phase, processes the data stored on that machine only. After data (not replica, but original) is finished to be processed in mapper phase or maybe not finished, can there be a case that, machine i processes replica data stored on that machine? Or replica is used only when some node does off?
Yes, processing replicas also would happen on a specific scenario called Speculative execution.
If the machine i takes too much time to process the data block stored in that machine , then the job's Application master would start a duplicate parallel mapper against the another replica of the data block stored in a different machine. This new speculative mapper will run in the machine j where the replica is stored.
Whichever mapper completes the execution first, its outputs will be considered.The other slow running mapper and its resources will be removed.
by default, the Speculative execution is enabled. You could toggle this by modifying the below properties.
mapreduce.map.speculative
mapreduce.reduce.speculative
By any case, not more than one replica of the data block will be stored in the same machine. Every replica of the data block will be kept in different machines.
The master node(jobtracker) may or may not pick the original data, in fact it doesn't maintain any info about out of the 3 replica which is original. Because when it saves the data it does a checksum verification on the file and saves it cleanly. Now when jobtracker wants to pick up a slot for the mapper, it takes so many things to account like number of free map slots, overhead of a tasktracker and other things. And last but not least data locality, so the closest node which satisfies almost all criteria will only be picked, it doesn't bother whether it is original or a replica and as mentioned even it doesn't maintain that identity.
Related
I have a program which accesses a single RocksDB using multiple threads.
Our workflow for a given document is to read the cache, do some work, then update the cache.
My code uses chained CompletableFutures to process multiple documents in order (and processes the first document before starting the subsequent document). So my RocksDB workload consists of (read, write) repeated several times for the same key.
Most of the time we get the correct value from the cache for each run through the workflow, but occasionally we will get stale data. Each operation could run on one of many threads in the Executor, but they will never run in parallel for the same key.
Is there a way to ensure that we get strong consistency? I wrote a unit test to confirm that this happens, and it happens between 1-3% of the time. I even added a read-after-write, and that reduced the inconsistency, but did not eliminate it.
Not sure what you are referring to as strong consistency is rocksdb is strongly consistent - there is no across the network replication going on where you would see eventual consistency
if you want to get a snapshotted read use a snapshot sequence identifier when doing your reads
Sounds more like a threading issue where your reads and writes are happening in non-determenistic order
I have a cluster with 3 nodes (in different machines) and I have a "business logic" that use a distributed lock at startup.
Sometimes when there is more latency every node acquires the exclusive lock with success because the cluster isn't already "startup" so each node does not yet see the other.
Subsequently the nodes see each other and the cluster is correctly configured with 3 nodes. I know there is a "MemberShipListener" to capture the event "Member added" so I could execute again the "business logic", but I would to know if there is a method to ensure when the cluster startup is properly finished in order to wait to execute the "business logic" until the cluster is on.
I tried to use hazelcast.initial.wait.seconds but configure the right seconds isn't deterministic and I don't know if this also delay the member join operations.
Afaik, there is no such thing in Hazelcast. As the cluster is dynamic, a node can go and leave at any time, so the cluster is never "complete" or not.
You can, however :
Configure an initial wait, like you described, in order to help with initial latencies
use hazelcast.initial.min.cluster.size to define the minimum number of members hazelcast is waiting for at start
Define a minimal quorum : the minimal number of nodes for the cluster to be considered as useable/healthy (see cluster quorum)
Use the PartitionService to check is the cluster is safe, or if there are pending migrations
Ive setup a 3 node cluster that was distributing tasks (steps? jobs?) pretty evenly until the most recent which has all been assigned to one machine.
Topology (do we still use this term for flink?):
kafka (3 topics on different feeds) -> flatmap -> union -> map
Is there something about this setup that would tell the cluster manager to put everything on one machine?
Also - what are the 'not set' values in the image? Some step I've missed? Or some to-be-implemented UI feature?
It is actually on purpose that Flink schedules your job on a single TaskManager. In order to understand it let me quickly explain Flink's resource scheduling algorithm.
First of all, in the Flink world a slot can accommodate more than one task (parallel instance of an operator). In fact, it can accommodate one parallel instance of each operator. The reason for this is that Flink not only executes streaming jobs in a streaming fashion but also batch jobs. With streaming fashion I mean that Flink brings all operators of your dataflow graph online so that intermediate results can be streamed directly to downstream operators where they are consumed. Per default Flink tries to combine one task of each operator in one slot.
When Flink schedules the tasks to the different slots, then it tries to co-locate the tasks with their inputs to avoid unnecessary network communication. For sources, the co-location depends on the implementation. For file-based sources, for example, Flink tries to assign local file input splits to the different tasks.
So if we apply this to your job, then we see the following. You have three different sources with parallelism 1. All sources belong to the same resource sharing group, thus the single task of each operator will deployed to the same slot. The initial slot is randomly chosen from the available instances (actually it depends on the order of the TaskManager registration at the JobManager) and then filled up. Let's say the chosen slot is on machine node1.
Next we have the three flat map operators which have a parallelism of 2. Here again one of the two sub-tasks of each flat map operator can be deployed to the same slot which already accommodates the three sources. The second sub-task, however, has to placed in a new slot. When this happens Flink tries to choose a free slot which is co-located to a slot in which one of the task's inputs is deployed (again to reduce network communication). Since only one slot of node1 is occupied and thus 31 are still free, it will deploy the 2nd sub-task of each flatMap operator also to node1.
The same now applies to the tumbling window reduce operation. Flink tries to co-locate all the tasks of the window operator with it's inputs. Since all of its inputs run on node1 and node1 has enough free slots to accommodate 6 sub-tasks of the window operator, they will be scheduled to node1. It's important to note, that 1 window task will run in the slots which contains the three sources and one task of each flatMap operator.
I hope this explains why Flink only uses the slots of a single machine for the execution of your job.
The problem is that you are building a global window on an unkeyed (ungrouped) stream, so the window has to run on one machine.
Maybe you can also express your application logic differently so that you can group the stream.
The "(not set)" part is probably an issue in Flink's DataStream API, which is not setting default operator names.
Jobs implemented against the DataSet API will look like this:
We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?
UPDATE: Adding a bit of context
We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?
If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.
While creating your job this can be done with the following line of code
job.setNumReduceTasks(0);
I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.
First google match on a map-only example I found:
Map-Only MR jobs
Setting the reducers to zero would increase the data locality. This means the intermediate data that have been generated by the Mappers will be stored on HDFS. Of course, you will not have any control of choosing the which nodes will store the intermediate data and if its size is greater than the number of the mapper slots * block size, then the remote access would be attempt to avoid starvation. My advice is to use delay scheduler and set locality-delay-node-ms and locality-delay-rack-ms to a large value (i.e. the maximum expected running time for your mappers). This will make the delay scheduler waits as much as possible before requesting data remotely. However, this may lead to resource under-utilization and increasing the running time (e.g. any node that does not store any data block will be idle for a long time locality-delay-node-ms + locality-delay-rack-ms).
While a Hadoop Job is running or in progress if I write something to HDFS or Hbase then will that
data be visible to all nodes in the cluster
1.)immediately?
2.)If not immediately then after how much time?
3.)Or the time really cannot be determined?
HDFS is strongly consistent, so once a write has completed successfully, the new data should be visible across all nodes immediately. Clearly the actual writing takes some time - see replication pipelining for some details on this.
This is in contrast to eventually consistent systems, where it may take an indefinite time (though often only a few milliseconds) before all nodes see a consistent view of the data.
Systems such as Cassandra have tunable consistency - each read and write can be performed at a different level of consistency to suit the operation being performed.
In best of my understanding the data is visible immediately, after write operation is finished.
Lets see some aspects of the process:
When client writes to HDFS data is written in all replicas, and after the write operation finished it should be perfectly available
There is also only one place with metadata - NameNode which also do not have any notion of isolation which would enable hiding data till some larger peace of work is done.
HBase is a different case - since it will write only LOG to HDFS immediately and its HFiles will be updated with new data after compaction only. In the same time - after HBase itself write something into HDFS - data will be visible immediately.
In HDFS data is visible once it is flushed or synced using hflush() or hsync() method - these methods were introduced in 0.21 version I guess. HFlush gives you a guarantee that data is visible to all readers. Hsync gives you a guarantee that data was saved to disk (altough it may still be in your disk cache). The write method does not give you any guarantees. To answer your question - in HDFS data is visible immediately to everyone after doing hflush() or hsync().