kafka -> storm -> flink : unexpected block data - java

Im moving a topology from storm to flink. The topology has been reduced to KafkaSpout->Bolt. The bolt is just counting packets and not trying to decode them.
The compiled .jar is submitted to flink via flink -c <entry point> <path to .jar> and hits the following error:
java.lang.Exception: Call to registerInputOutput() of invokable failed
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:529)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot instantiate user function.
at org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:190)
at org.apache.flink.streaming.runtime.tasks.StreamTask.registerInputOutput(StreamTask.java:174)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:526)
... 1 more
Caused by: java.io.StreamCorruptedException: unexpected block data
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1365)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:294)
at org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:255)
at org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:175)
... 3 more
My question(s):
Did I miss a configuration step w/re the KafkaSpout? This was working when used in vanilla-storm.
Are there specific versions of the storm libraries that I need to use? I'm including 0.9.4 with my build.
Something else that I might have missed?
Should I be using the storm KafkaSpout or would I be better served by writing my own using the flink KafkaSource?
EDIT:
Here are the relevant pieces of code:
Topology:
BrokerHosts brokerHosts = new ZkHosts(configuration.getString("kafka/zookeeper"));
SpoutConfig kafkaConfig = new SpoutConfig(brokerHosts, configuration.getString("kafka/topic"), "/storm_env_values", "storm_env_DEBUG");
FlinkTopologyBuilder builder = new FlinkTopologyBuilder();
builder.setSpout("environment", new KafkaSpout(kafkaConfig), 1);
builder.setBolt("decode_bytes", new EnvironmentBolt(), 1).shuffleGrouping("environment");
Init:
FlinkLocalCluster cluster = new FlinkLocalCluster(); // replaces: LocalCluster cluster = new LocalCluster();
cluster.submitTopology("env_topology", conf, buildTopology());
The bolt is based on BaseRichBolt. The execute() fn just logs the presence of any packet to debug. No other code in there.

I just had look at this. There is one issues right now but I got it working locally. You can apply this hot fixed to your code and build the compatibility layer by yourself.
KafkaSpout registers metrics. However, metrics are currently not supported by the compatibility layer. You need to remove the exception in FlinkTopologyContext.registerMetric(...) and just return null. (There is already a open PR that work on the integration of metrics, thus I don't want to push this hot fix into master branch)
Furhtermore, you need to add some configuration parameters to your query manually:
I just made up some values here:
Config c = new Config();
List<String> zkServers = new ArrayList<String>();
zkServers.add("localhost");
c.put(Config.STORM_ZOOKEEPER_SERVERS, zkServers);
c.put(Config.STORM_ZOOKEEPER_PORT, 2181);
c.put(Config.STORM_ZOOKEEPER_SESSION_TIMEOUT, 30);
c.put(Config.STORM_ZOOKEEPER_CONNECTION_TIMEOUT, 30);
c.put(Config.STORM_ZOOKEEPER_RETRY_TIMES, 3);
c.put(Config.STORM_ZOOKEEPER_RETRY_INTERVAL, 5);
You need to add some additional dependencies to your project:
Additionally to flink-storm you need:
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka</artifactId>
<version>0.9.4</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.1.1</version>
</dependency>
This works for me, using Kafka_2.10-0.8.1.1 and FlinkLocalCluster execute within Eclipse.
It also works in a local Flink cluster started via bin/start-local-streaming.sh. For this, using bin/flink run command, you need to use FlinkSubmitter instead of FlinkLocalCluster. Furthermore, you need the following dependencies for your jar:
<include>org.apache.storm:storm-kafka</include>
<include>org.apache.kafka:kafka_2.10</include>
<include>org.apache.curator:curator-client</include>
<include>org.apache.curator:curator-framework</include>
<include>com.google.guava:guava</include>
<include>com.yammer.metrics:metrics-core</include>

Related

How can I acquire JanusGraphManagement over a remote connection?

I have a docker container running the gremlin-server.
It was started via:
./bin/gremlin-server.sh conf/gremlin-server/gremlin-server.yaml
From within a docker container, running this image:
https://hub.docker.com/r/janusgraph/janusgraph
The server is up and is listening at port 8182
$ docker ps
6019adda6081 janusgraph/janusgraph "docker-entrypoint.s…" 2 days ago Up 26 hours 0.0.0.0:8182->8182/tcp
I am interested in using a schema and indexes.
Janus offers this here: https://docs.janusgraph.org/basics/schema/
The following Is the configuration I use to attempt to connect to the gremlin-server:
AbstractConfiguration config = new BaseConfiguration();
config.setListDelimiter('/');
// contents of conf/remote-graph.properties
config.setProperty("gremlin.remote.driver.sourceName", "g");
config.setProperty("gremlin.remote.remoteConnectionClass", "org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteConnection");
// contents of conf/remote-objects.yaml:
config.setProperty("clusterConfiguration.hosts", databaseUrl);
config.setProperty("clusterConfiguration.port", 8182);
config.setProperty("clusterConfiguration.serializer.className", "org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0/");
config.setProperty("storage.backend", "cql");
config.setProperty("clusterConfiguration.serializer.config.ioRegistries", "org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry");
When I call
GraphTraversalSource g = traversal().withRemote(config);
I get a traversal source and everything seems fine. However, to use the management stuff that Janus provides, I seem to need a JanusGraphManagement object. I cannot get the generic Graph object above and cast it to a JanusGraph. The docs suggest using a JanusGraphFactory: https://docs.janusgraph.org/basics/configuration/#janusgraphfactory
So I call
JanusGraph janusGraph = JanusGraphFactory.open(config);
I get the following stack trace:
Exception in thread "main" java.lang.IllegalArgumentException: Could not find implementation class: org.janusgraph.diskstorage.cql.CQLStoreManager
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:60)
at org.janusgraph.diskstorage.Backend.getImplementationClass(Backend.java:440)
at org.janusgraph.diskstorage.Backend.getStorageManager(Backend.java:411)
at org.janusgraph.graphdb.configuration.builder.GraphDatabaseConfigurationBuilder.build(GraphDatabaseConfigurationBuilder.java:50)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:161)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:132)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:112)
at com.activitystream.database.GraphMigration.migrateDatabase(GraphMigration.java:69)
at com.activitystream.runners.persistence.DataStores.migrateDatabase(DataStores.java:27)
at com.activitystream.runners.persistence.EntityPersistenceRunner.main(EntityPersistenceRunner.java:23)
Caused by: java.lang.ClassNotFoundException: org.janusgraph.diskstorage.cql.CQLStoreManager
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:315)
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:56)
... 9 more
Is it possible to modify the schema over a remote connection?
If it is not possible, how can one modify the schema?
Any insight would be appreciated.
You basically have two choices - either:
Interact with your JanusGraphManagement object by way of scripts sent to Gremlin Server (typically by way of a session but I guess you could package an entire "management script" together and submit it as one request) or
Bypass Gremlin Server and instantiation your JanusGraphManagement object locally as directed in the JanusGraph documentation.
There is no way to have return a JanusGraphManagement to your client as it is not a serializable object that can be sent back from the server.

Kafka Streams error "TaskAssignmentException: unable to decode subscription data: version=4"

During deployment with only changed Kafka-Streams version from 1.1.1 to 2.x.x (without changing application.id), we got exceptions on app node with older Kafka-Streams version and, as a result, Kafka streams changed state to error and closed, meanwhile app node with new Kafka-Streams version consumes messages fine.
If we upgrade from 1.1.1 to 2.0.0, got error unable to decode subscription data: version=3; if from 1.1.1 to 2.3.0: unable to decode subscription data: version=4.
It might be really painful during canary deployment, e.g. we have 3 app nodes with previous Kafka-Streams version, and when we add one more node with a new version, all existing 3 nodes will be in error state. Error stack trace:
TaskAssignmentException: unable to decode subscription data: version=4
at org.apache.kafka.streams.processor.internals.assignment.SubscriptionInfo.decode(SubscriptionInfo.java:128)
at org.apache.kafka.streams.processor.internals.StreamPartitionAssignor.assign(StreamPartitionAssignor.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:358)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onJoinLeader(AbstractCoordinator.java:520)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.access$1100(AbstractCoordinator.java:93)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:472)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:455)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:822)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:802)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:204)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:127)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:563)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:390)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:293)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:364)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:290)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1149)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:831)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:788)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Issue is reproducible on both Kafka broker versions 1.1.0 and 2.1.1, even with the simple Kafka-Streams DSL example:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("default.key.serde", "org.apache.kafka.common.serialization.Serdes$StringSerde");
props.put("default.value.serde", "org.apache.kafka.common.serialization.Serdes$StringSerde");
props.put("application.id", "xxx");
StreamsBuilder streamsBuilder = new StreamsBuilder();
streamsBuilder.<String, String>stream("source")
.mapValues(value -> value + value)
.to("destination");
KafkaStreams kafkaStreams = new KafkaStreams(streamsBuilder.build(), props);
Is it a bug of kafka-streams? Does exist any workaround to prevent such failure?

Zero flow files in onTrigger() method of AbstractProcessor of Apache Nifi

I am developing a custom processor for Apache NiFi. I have created nar of my processor and put it in the lib folder of nifi and started the nifi. I have setup the remote debugger in eclipse and enabled breakpoint on first line of onTrigger(). While debugging I am running one processor at a time in my nifi pipeline. I can find single flow file in the input queue of my custom processor, however my custom processor is not receiving any flow file. When I start my custom processor, it hits breakpoint inside onTrigger() method. Inside thie method, when I do:
public class MyCustomProc extends AbstractProcessor {
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
List<FlowFile> flowFiles = session.get(5000);
if (flowFiles == null || flowFiles.size() == 0) {
return;
}
//...
flowFiles turns out to be of size zero!!! I am not able to guess in which direction should I check to find why this is happening. Any hint how I can diagnose this?
Edit
Stacktrace
2019-05-02 18:08:09,456 ERROR [Timer-Driven Process Thread-10] c.c.product.module.submodule.MyCustomProcessor MyCustomProcessor[id=016a1008-8956-1dbf-bd66-993e0ce98668] MyCustomProcessor[id=016a1008-8956-1dbf-bd66-993e0ce98668] failed to process due to org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=408fbb3d-7cc2-48bc-be8f-6d0afdbddaf2,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1556800468726-1, container=default, section=1], offset=261, length=591447],offset=0,name=188149730353200,size=591447] transfer relationship not specified; rolling back session: {}
org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=408fbb3d-7cc2-48bc-be8f-6d0afdbddaf2,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1556800468726-1, container=default, section=1], offset=261, length=591447],offset=0,name=188149730353200,size=591447] transfer relationship not specified
at org.apache.nifi.controller.repository.StandardProcessSession.checkpoint(StandardProcessSession.java:251)
at org.apache.nifi.controller.repository.StandardProcessSession.commit(StandardProcessSession.java:321)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:28)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1122)
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:147)
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:128)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
PS1: This method returns immediately from inside if's body, which gives me following exception:
org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord transfer relationship not specified
This exceptions keeps recurring forever, since flow file in the input queue of my custom processor.
PS2: I am getting following error in the apps.log, though I am unsure if this is the source of the problem:
2019-05-02 18:17:32,394 ERROR [Timer-Driven Process Thread-4] o.a.nifi.groups.StandardProcessGroup Unable to synchronize StandardProcessGroup[identifier=d25747e6-719e-3ed9-c6c5-56794af6555c] with Flow Registry because Process Group was placed under Version Control using Flow Registry with identifier 80016ab0-bfab-152b-ffff-ffffc441867c but cannot find any Flow Registry with this identifier
It is normal behavior to sometimes get zero flow files, which is why processors have the check that you have at the beginning.
The FlowFileHandlingException means that a flow file was obtained from the session, either from get or create, and that flow file was not transferred anywhere and was not removed, so basically it is unaccounted for. This could not happen from just returning at the beginning in that if statement, so the rest of the processor code is executing and producing this error. You haven't provided the rest of the code so we can't see the problem.
The second issue is fairly self-explanatory. You have a process group under version control, but the registry client that was used to start version control somehow no longer exists. I don't know how you created this scenario because I believe the UI/API won't let you delete a registry client that has active flows under version control, but you should be able to stop version control on the process group.

Spark ML- failing to load model using MatrixFactorizationModel

I am trying to implement recommender system using Spark collaborative filtering.
First I prepare model and save to disk:
MatrixFactorizationModel model = trainModel(inputDataRdd);
model.save(jsc.sc(), "/op/tc/model/");
When I load model using separate process the program fails with below exception:
Code:
static JavaSparkContext jsc ;
private static Options options;
static{
SparkConf conf = new SparkConf().setAppName("TC recommender application");
conf.set("spark.driver.allowMultipleContexts", "true");
jsc= new JavaSparkContext(conf);
}
MatrixFactorizationModel model = MatrixFactorizationModel.load(jsc.sc(),
"/op/tc/model/");
Exception:
Exception in thread "main" java.io.IOException: Not a file:
maprfs:/op/tc/model/data
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1114)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1107)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.countApproxDistinctUserProduct(MatrixFactorizationModel.scala:96)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:126)
at com.aexp.cxp.recommendation.ProductRecommendationIndividual.main(ProductRecommendationIndividual.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:742)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Is there any configuration i need to set to load the model? any suggestion would be great help.
In Spark as in any other distributed computing framework, it is important to understand where the code runs when you are trying to debug it. It is also important to have access to various types. For example, in YARN, you would have:
the master logs if your record them yourself
the aggregated slave logs (thanks YARN, useful feature !)
the YARN node manager (will for example tell you why a container was killed etc)
etc
Digging into Spark issues can be quite time consuming if you don't look at the right place from the start. Now more specifically on this question, you have a clear stacktrace, which is not always the case, so you should use it to your advantage.
The top of the stacktrace is
Exception in thread "main" java.io.IOException: Not a file:
maprfs:/op/tc/model/data at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
As you can see, the Spark job was executing a map operation when it failed. Who executes a map ? The slaves, therefore you have to make sure your file is available on all slaves, not only on the master.
More generally, you always need to make a clear distinction in your head between the code you are writing for the master and the code you are writing for the slaves. This will help you detecting this kind of interactions, as well as references to non-serializable objects and such common mistakes.

Samza/Kafka Failed to Update Metadata

I am currently working on writing a Samza Script that will just take data from a Kafka topic and output the data to another Kafka topic. I have written a very basic StreamTask however upon execution I am running into an error.
The error is below:
Exception in thread "main" org.apache.samza.SamzaException: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 193 ms.
at org.apache.samza.coordinator.stream.CoordinatorStreamSystemProducer.send(CoordinatorStreamSystemProducer.java:112)
at org.apache.samza.coordinator.stream.CoordinatorStreamSystemProducer.writeConfig(CoordinatorStreamSystemProducer.java:129)
at org.apache.samza.job.JobRunner.run(JobRunner.scala:79)
at org.apache.samza.job.JobRunner$.main(JobRunner.scala:48)
at org.apache.samza.job.JobRunner.main(JobRunner.scala)
Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 193 ms
I not entirely sure how to configure or have the script write the required Kafka metadata. Below is my code for the StreamTask and the properties file. In the properties file I added the Metadata section to see if that would assist in the process afterwards but to no avail. Is that the right direction or am I missing something entirely?
import org.apache.samza.task.StreamTask;
import org.apache.samza.task.MessageCollector;
import org.apache.samza.task.TaskCoordinator;
import org.apache.samza.system.SystemStream;
import org.apache.samza.system.IncomingMessageEnvelope;
import org.apache.samza.system.OutgoingMessageEnvelope;
/*
* Take all messages received and send them to
* a Kafka topic called "words"
*/
public class TestStreamTask implements StreamTask{
private static final SystemStream OUTPUT_STREAM = new SystemStream("kafka" , "words"); // create new system stream for kafka topic "words"
#Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator){
String message = (String) envelope.getMessage(); // pull message from stream
for(String word : message.split(" "))
collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, word, 1)); // output messsage to new system stream for kafka topic "words"
}
}
# Job
job.factory.class=org.apache.samza.job.yarn.YarnJobFactory
job.name=test-words
# YARN
yarn.package.path=file://${basedir}/target/${project.artifactId}-${pom.version}-dist.tar.gz
# Task
task.class=samza.examples.wikipedia.task.TestStreamTask
task.inputs=kafka.test
task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
task.checkpoint.system=kafka
task.checkpoint.replication.factor=1
# Metrics
metrics.reporters=snapshot,jmx
metrics.reporter.snapshot.class=org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory
metrics.reporter.snapshot.stream=kafka.metrics
metrics.reporter.jmx.class=org.apache.samza.metrics.reporter.JmxReporterFactory
# Serializers
serializers.registry.string.class=org.apache.samza.serializers.StringSerdeFactory
serializers.registry.metrics.class=org.apache.samza.serializers.MetricsSnapshotSerdeFactory
# Systems
systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
systems.kafka.samza.msg.serde=string
systems.kafka.consumer.zookeeper.connect=localhost:2181/
systems.kafka.consumer.auto.offset.reset=largest
systems.kafka.producer.bootstrap.servers=localhost:9092
# Metadata
systems.kafka.metadata.bootstrap.servers=localhost:9092
This question is about Kafka 0.8 which should be out of support if I am not mistaken.
This fact, combined with the context of people only running into this issue sometimes, but not all the time (and nobody seems to struggle with this in recent years), gives me very good confidence that upgrading to a more recent version of Kafka will resolve the problem.

Categories