flink - cluster not using cluster - java

Ive setup a 3 node cluster that was distributing tasks (steps? jobs?) pretty evenly until the most recent which has all been assigned to one machine.
Topology (do we still use this term for flink?):
kafka (3 topics on different feeds) -> flatmap -> union -> map
Is there something about this setup that would tell the cluster manager to put everything on one machine?
Also - what are the 'not set' values in the image? Some step I've missed? Or some to-be-implemented UI feature?

It is actually on purpose that Flink schedules your job on a single TaskManager. In order to understand it let me quickly explain Flink's resource scheduling algorithm.
First of all, in the Flink world a slot can accommodate more than one task (parallel instance of an operator). In fact, it can accommodate one parallel instance of each operator. The reason for this is that Flink not only executes streaming jobs in a streaming fashion but also batch jobs. With streaming fashion I mean that Flink brings all operators of your dataflow graph online so that intermediate results can be streamed directly to downstream operators where they are consumed. Per default Flink tries to combine one task of each operator in one slot.
When Flink schedules the tasks to the different slots, then it tries to co-locate the tasks with their inputs to avoid unnecessary network communication. For sources, the co-location depends on the implementation. For file-based sources, for example, Flink tries to assign local file input splits to the different tasks.
So if we apply this to your job, then we see the following. You have three different sources with parallelism 1. All sources belong to the same resource sharing group, thus the single task of each operator will deployed to the same slot. The initial slot is randomly chosen from the available instances (actually it depends on the order of the TaskManager registration at the JobManager) and then filled up. Let's say the chosen slot is on machine node1.
Next we have the three flat map operators which have a parallelism of 2. Here again one of the two sub-tasks of each flat map operator can be deployed to the same slot which already accommodates the three sources. The second sub-task, however, has to placed in a new slot. When this happens Flink tries to choose a free slot which is co-located to a slot in which one of the task's inputs is deployed (again to reduce network communication). Since only one slot of node1 is occupied and thus 31 are still free, it will deploy the 2nd sub-task of each flatMap operator also to node1.
The same now applies to the tumbling window reduce operation. Flink tries to co-locate all the tasks of the window operator with it's inputs. Since all of its inputs run on node1 and node1 has enough free slots to accommodate 6 sub-tasks of the window operator, they will be scheduled to node1. It's important to note, that 1 window task will run in the slots which contains the three sources and one task of each flatMap operator.
I hope this explains why Flink only uses the slots of a single machine for the execution of your job.

The problem is that you are building a global window on an unkeyed (ungrouped) stream, so the window has to run on one machine.
Maybe you can also express your application logic differently so that you can group the stream.
The "(not set)" part is probably an issue in Flink's DataStream API, which is not setting default operator names.
Jobs implemented against the DataSet API will look like this:

Related

Kafka KTable - shared aggregation across machines

Assume that I have a topic with numerous partitions. Im writing K/V data in there and want to aggregate said data in Tumbling Windows by keys.
Assume that I've launched as many worker instances as I have partitions and each worker instance is running on a separate machine.
How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?
How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
In general, Kafka Streams ensures that all values for the same key will be processed by the same (and only one) stream task, which also means only one application instance (what you described as "worker instance") will process the values for that key. Note that an app instance may run 1+ stream tasks, but these tasks are isolated.
This behavior is achieved through the partitioning of the data, and Kafka Streams ensures that a partition is always processed by the same and only one stream task. The logical link to keys/values is that, in Kafka and Kafka Streams, a key is always sent to the same partition (there is a gotcha here, but I'm not sure whether it makes sense to go into details for the scope of this question), hence one particular partition -- among possible many partitions -- contains all the values for the same key.
In some situations, such as when joining two streams A and B, you must ensure though that the aggregation will operate on the same key to ensure that data from both streams are co-located in the same stream task -- which, again, is all about ensuring that the relevant input stream partitions and thus matching the keys (from A and B, respectively) are made available in the same stream task. A typical method you'd use here is selectKey(). Once that is done, Kafka Streams ensures that, for joining the two streams A and B as well as for creating the joined output stream, all values for the same key will be processed by the same stream task and thus the same application instance.
Example:
Stream A has key userId with value { georegion }.
Stream B has key georegion with value { continent, description }.
Joining two streams only works (as of Kafka 0.10.0) when both streams use the same key. In this example, this means that you must re-key (and thus re-partition) stream A so that the resulting key is changed from userId to georegion. Otherwise, as of Kafka 0.10, you can't join A and B because data is not co-located in the stream task that is responsible for actually performing the join.
In this example, you could re-key/re-partition stream A via:
// Kafka 0.10.0.x (latest stable release as of Sep 2016)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId)).through("rekeyed-topic")
// Upcoming versions of Kafka (not released yet)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId))
The through() call is only required in Kafka 0.10.0 to actually trigger re-partitioning, and later versions of Kafka will do these automatically for you (this upcoming functionality is already completed and available in Kafka trunk).
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?
In general, no. The behavior above is achieved through partitioning, not through state stores.
Sometimes state stores are involved because of the operations you have defined for a stream, which might explain why you were asking this question. For example, a windowing operation will require state to be managed, and thus a state store will be created behind the scenes. But your actual question -- "insuring that the resultant aggregations include all values for each key" -- has nothing to do with state stores, it's about the partitioning behavior.
With worker instance, I assume you mean a Kafka Streams application instance, right? (Because there is no master/worker pattern in Kafka Streams -- it's a library and not a framework -- we do not use the term "worker".)
If you want to co-locate data per key, you need to partition the data by key. Thus, either your data is partitioned by key by your external producer when data gets written into a topic from the beginning on. Or you explicitly set a new key within Kafka Streams application (using for example selectKey() or map()) and re-distributed via a call to through().
(The explicit call to through() will not be necessary in future releases, ie, 0.10.1 and Kafka Streams will re-distribute records automatically if necessary.)
If messages/record should be partitioned, the key must not be null. You can also change the partitioning schema via producer configuration partitioner.class (see https://kafka.apache.org/documentation.html#producerconfigs).
Partitioning is completely independent from StateStores, even if StateStores are usually used on top of partitioned data.

How to parallelize an RDD of LinkedLists?

I am developing an application in Spark, using the Spark Streaming Framework.
Right now my aim is to learn how parallellization works in Spark, and how I can use it to speed up my input data processing.
Here is my question:
I have a DStream that in each Batch Interval has an RDD in which only one partition has data, 4 LinkedLists within that partition to be precise (I am not sure of how many partitions the RDD has exactly, perhaps 4 given the number of cores in my pc, since I am running in local mode).
I use the following to try and parallelize my RDD:
JavaDStream<LinkedList<Integer>> rddWithPartitions=rddWithDataInOnePartition.repartition(4);
That is, with this, I intend to parallellize my RDD so that it has one LinkedList per partition, and not four in one single partition.
When I do a rddWithPartitions.print(), I indeed see what I think are 4 partitions filled with data, but when I go to the Spark UI, namely the Executors, I only see one, meaning (I think) that I am only using one Worker, and thus parallellization wasn't achieved.
I do have more than one task (although it is three and not four, as I thought would be the case), but I am not sure if I am using all four cores from my pc, each one processing one partition of my RDD.
How can I make sure that I achieved this parallellization?
I hope I was not confusing.
Thank you so much.

Java - How can I ensure I run a single instance of a process in a clustered environment

I have a jvm process that wakes a thread every X minutes.
If a condition is true -> it starts a job (JobA).
Another jvm process does almost the same but if the condition is true -
it throws a message to a message broker which triggers the job in another server (JobB).
Now, to avoid SPOF problem I want to add another instance of this machine in my cloud.
But than I want ensure I run a single instance of a JobA each time.
What are my options?
There are a number of patterns to solve this common problem. You need to choose based on your exact situation and depending on which factor has more weight in your case (performance, correctness, fail-tolerance, misfires allowed or not, etc). The two solution-groups are:
The "Quartz" way: you can use a JDBCStore from the Quartz library which (partially) was designed for this very reason. It allows multiple nodes to communicate, and share state and workload between each other. This solution gives you a probably perfect solution at the cost of some extra coding and setting up a shared DB (9 tables I think) between the nodes.
Alternatively your nodes can take care of the distribution itself: locking on a resource (single record in a DB for example) can be enough to decide who is in charge for that iteration of the execution. Sharing previous states however will require a bit more work.

Is it possible to restrict a MapReduce job from accessing remote data?

We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?
UPDATE: Adding a bit of context
We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?
If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.
While creating your job this can be done with the following line of code
job.setNumReduceTasks(0);
I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.
First google match on a map-only example I found:
Map-Only MR jobs
Setting the reducers to zero would increase the data locality. This means the intermediate data that have been generated by the Mappers will be stored on HDFS. Of course, you will not have any control of choosing the which nodes will store the intermediate data and if its size is greater than the number of the mapper slots * block size, then the remote access would be attempt to avoid starvation. My advice is to use delay scheduler and set locality-delay-node-ms and locality-delay-rack-ms to a large value (i.e. the maximum expected running time for your mappers). This will make the delay scheduler waits as much as possible before requesting data remotely. However, this may lead to resource under-utilization and increasing the running time (e.g. any node that does not store any data block will be idle for a long time locality-delay-node-ms + locality-delay-rack-ms).

How to tell MapReduce how many mappers to use at the same time?

I am writing an indexing app for MapReduce.
I was able to split inputs with NLineInputFormat, and now I've got few hundred mappers in my app. However, only 2/mashine of those are active at the same time, the rest are "PENDING". I believe that such a behavior slows the app significantly.
How do I make hadoop run at least 100 of those at the same time per machine?
I am using the old hadoop api syntax. Here's what I've tried so far:
conf.setNumMapTasks(1000);
conf.setNumTasksToExecutePerJvm(500);
none of those seem to have any effect.
Any ideas how I can make the mappers actually RUN in parallel?
The JobConf.setNumMapTasks() is just a hint to the MR framework and I am not sure the effect of calling it. In your case the total number of map tasks across the whole job should be equal to the total number of lines in the input divided by the number of lines configured in the NLineInputFormat. You can find more details on the total number of map/reduce tasks across the whole job here.
The description for mapred.tasktracker.map.tasks.maximum says
The maximum number of map tasks that will be run simultaneously by a task tracker.
You need to configure the mapred.tasktracker.map.tasks.maximum (which is defaulted to 2) to change the number of map tasks run parallely on a particular node by the task tracker. I could not get the documentation for 0.20.2, so I am not sure if the parameter exists or if the same parameter name is used in 0.20.2 release.

Categories