Push DataSet of spark to kafka - TimeoutException - java

I Want to push data to kafka from spark job.
I am using spark kafka streaming as following..
pivotDataDataset.selectExpr("CAST(columnName as STRING) value")
.write()
.format("kafka")
.option("kafka.bootstrap.servers","kafkaServerIp:9092")
.option("topic", "topicname")
.save();
It is giving me following exception.
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, prod-fdphadoop-krios-dn-1015, executor 1): org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1.apply$mcV$sp(KafkaWriter.scala:89)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1.apply(KafkaWriter.scala:89)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1.apply(KafkaWriter.scala:89)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecutio Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
18/11/26 21:38:47 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, prod-fdphadoop-krios-dn-1015, executor 1): org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
I am trying to debug this.. but not able to do the same.

Related

Spark use saveAsText get SparkException: Task failed while writing rows, IllegalArgumentException

when I use Spark: rdd.saveAsText(path, LzopCodec.class),get Exception like this, Is that casue java buffer size limit?
maybe there is one row string size is over 2G ?
Caused by: java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:810)
at org.apache.hadoop.io.Text.encode(Text.java:451)
at org.apache.hadoop.io.Text.set(Text.java:198)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2$$anonfun$31$$anonfun$apply$54.apply(RDD.scala:1505)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2$$anonfun$31$$anonfun$apply$54.apply(RDD.scala:1504)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:125)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1413)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3224 in stage 22.0 failed 4 times, most recent failure: Lost task 3224.3 in stage 22.0 (TID 10280, hadoop12.fj8.yiducloud.cn, executor 14): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:810)
at org.apache.hadoop.io.Text.encode(Text.java:451)
at org.apache.hadoop.io.Text.set(Text.java:198)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2$$anonfun$31$$anonfun$apply$54.apply(RDD.scala:1505)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2$$anonfun$31$$anonfun$apply$54.apply(RDD.scala:1504)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:125)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1413)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)

Spark ERROR executor: Exception in task 0.0 in stage 0.0 (tid 0) java.lang.ArithmeticException

I got the error bellow when I ran an application Java Web using Cassandra 3.11.9 and Spark 3.0.1.
My question is why did it happen only after deploy the application? In the development environment it did not occur.
2021-03-24 08:50:41.150 INFO 19613 --- [uler-event-loop]
org.apache.spark.scheduler.DAGScheduler : ShuffleMapStage 0
(collectAsList at FalhaService.java:60) failed in 7.513 s due to Job
aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most
recent failure: Lost task 0.0 in stage 0.0 (TID 0) (GDBHML08 executor
driver): java.lang.ArithmeticException: integer overflow at
java.lang.Math.toIntExact(Math.java:1011) at
org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaDate(DateTimeUtils.scala:90)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$DateConverter$.toCatalystImpl(CatalystTypeConverters.scala:306)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$DateConverter$.toCatalystImpl(CatalystTypeConverters.scala:305)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:252)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:242)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$.$anonfun$createToCatalystConverter$2(CatalystTypeConverters.scala:426)
at
com.datastax.spark.connector.datasource.UnsafeRowReader.read(UnsafeRowReaderFactory.scala:34)
at
com.datastax.spark.connector.datasource.UnsafeRowReader.read(UnsafeRowReaderFactory.scala:21)
at
com.datastax.spark.connector.datasource.CassandraPartitionReaderBase.$anonfun$getIterator$2(CassandraScanPartitionReaderFactory.scala:110)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:496) at
com.datastax.spark.connector.datasource.CassandraPartitionReaderBase.next(CassandraScanPartitionReaderFactory.scala:66)
at
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
at
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace: 2021-03-24 08:50:41.189 INFO 19613 ---
[nio-8080-exec-2] org.apache.spark.scheduler.DAGScheduler : Job 0
failed: collectAsList at FalhaService.java:60, took 8.160348 s
The line's code that it is in this error:
List<Row> rows = dataset.collectAsList();
The code's block:
Dataset<Row> dataset = session.sql(sql.toString());
List<Row> rows = dataset.collectAsList();
ListIterator<Row> t = rows.listIterator();
while (t.hasNext()) {
Row row = t.next();
grafico = new EstGraficoRelEstTela();
grafico.setSuperficie(row.getLong(0));
grafico.setSubsea(row.getLong(1) + row.getLong(2));
grafico.setNomeTipoSensor(row.getString(3));
graficoLocalFalhas.add(grafico);
}
session.close();
Thanks,
It looks like that you have incorrect data in the database, some date field that is far into the future. If you look into the source code, you can see that it's converting first into milliseconds, and then converting into days, and this conversion overflows the integer. And this may explain why the code works in dev environment...
You may ask your administrator to check files for corrupted data, for example, using the nodetool scrub command.
P.S. are you sure that you're using Spark 3.0.1? The location of the function in the error is matching the Spark 3.1.1...

PySpark 3.0.1 Failing to run Distributed training in Tensorflow 2.1.0

I'm trying to train a simple fashion_mnist model on tensorflow as per the original TensorBoard Api docs on hyper parameter tuning you can find here
Currently, for testing purposes, I'm running on standalone mode so. master = 'local[*]'
I have installed pyspark==3.0.1 and tensorflow==2.1.0. The following is what I'm trying to run:
# For a given hyper parameter, this will run the train & return the model + accuracy which I'm looking for.
# This works when I run without spark.
def train(hparam) -> Tuple[Model, Any]:
fashion_mnist = fashion
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = Sequential([
Flatten(),
Dense(hparam['num_units'], activation=tf.nn.relu),
Dropout(hparam['dropout']),
Dense(10, activation=tf.nn.softmax),
])
model.compile(
optimizer=hparam['optimizer'],
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],
)
model.fit(x_train, y_train, epochs=1) # Run with 1 epoch to speed things up for demo purposes
_, accuracy = model.evaluate(x_test, y_test)
return model, accuracy
Here's my spark code which I run.
if __name__ == '__main__':
hp_nums = hp.HParam('num_units', hp.Discrete([16, 32]))
hp_dropouts = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
hp_opts = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
all_params = [] ##contains a list of different hparams
for num_units in hp_nums.domain.values:
for dropout_rate in (hp_dropouts.domain.min_value, hp_dropouts.domain.max_value):
for optimizer in hp_opts.domain.values:
hparams = {
'num_units': num_units,
'dropout': dropout_rate,
'optimizer': optimizer,
}
all_params.append(hparams)
spark_sess = SparkSession.builder.master(
'local[*]'
).appName(
'LocalTraining'
).getOrCreate()
res = spark_sess.sparkContext.parallelize(
all_hparams, len(all_hparams)
).map(
train #above function
).collect()
temp = 0.0
best_model = None
for model, acc in res:
if acc > temp:
best_model = model
print("best accuracy is -> " + str(temp))
This looks alright to me and works for any simple mapreduce (like the basic examples). Which makes me believe my environment is perfect and alright.
My Environment:
java : Java 11.0.8 2020-07-14 LTS
python: Python 3.6.5
pyspark: 3.0.1
tensorflow: 2.1.0
Keras: 2.3.1
windows: 10 (if this really matters)
cores : 8 (i5 10th gen)
Memory: 6G
But When I run the above piece of code. I get the following error. I can see the training run and it just stops after 1 executor runs
59168/60000 [============================>.] - ETA: 0s - loss: 0.7350 - accuracy: 0.7471
60000/60000 [==============================] - 3s 42us/step - loss: 0.7331 - accuracy: 0.7477
20/12/05 14:03:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
0/12/05 14:03:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
20/12/05 14:03:57 INFO TaskSchedulerImpl: Cancelling stage 0
20/12/05 14:03:57 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/12/05 14:03:57 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
20/12/05 14:03:57 INFO TaskSchedulerImpl: Stage 0 was cancelled
20/12/05 14:03:57 INFO DAGScheduler: ResultStage 0 (collect at C:/Users/<>/<>/<>/main.py:<>) failed in 7.506 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, host.docker.internal, executor driver): java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:628)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, host.docker.internal, executor driver): java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:628)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
Driver stacktrace:
20/12/05 14:03:57 INFO DAGScheduler: Job 0 failed: collect at C:/<>/<>/<>/main.py, took 7.541442 s
Traceback (most recent call last):
File "C:/<>/<>/<>/main.py", line 68, in main
return res.collect()
File "C:\Users\<>\<>\<>\venv\lib\site-packages\pyspark\rdd.py", line 889, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\<>\<>\<>\venv\lib\site-packages\py4j\java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\<>\<>\<>\venv\lib\site-packages\pyspark\sql\utils.py", line 128, in deco
return f(*a, **kw)
File "C:\Users\<>\<>\<>\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
The error is on line model.fit(). [It only happens when I do model.fit If I comment it out and have something else there, it works perfectly fine. I'm unsure why it fails on model.fit()]

how to get exception type in nested exception in Java?

I want to perform some action if my code gets an org.apache.kafka.clients.consumer.OffsetOutOfRangeException. I tried with this check
if(e.getCause().getCause() instanceof OffsetOutOfRangeException)
but am still getting a SparkException, not an OffsetOutOfRangeException.
ERROR Driver:86 - Error in executing stream
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 11, localhost, executor 0): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {dns_data-0=23245772}
at org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:136)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:68)
at org.apache.spark.streaming.kafka010.KafkaRDDIterator.next(KafkaRDD.scala:271)
at org.apache.spark.streaming.kafka010.KafkaRDDIterator.next(KafkaRDD.scala:231)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)`
Caused by: org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {dns_data-0=23245772}
at org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:136)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:68)
at org.apache.spark.streaming.kafka010.KafkaRDDIterator.next(KafkaRDD.scala:271)
at org.apache.spark.streaming.kafka010.KafkaRDDIterator.next(KafkaRDD.scala:231)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
try the below condition:
e.getCause().getClass().equals(OffsetOutOfRangeException.class)

Spark saveAsNewAPIHadoopFile java.io.IOException: Could not find a serializer for the Value class

I'm trying to store a java pair RDD as a Hadoop sequence file as follows:
JavaPairRDD<ImmutableBytesWritable, Put> putRdd = ...
config.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
putRdd.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, Put.class, SequenceFileOutputFormat.class, config);
But I get the exception even if I'm setting the io.serializations:
2017-04-06 14:39:32,623 ERROR [Executor task launch worker-0] executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Could not find a serializer for the Value class: 'org.apache.hadoop.hbase.client.Put'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization.
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1192)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1094)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:273)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:530)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getSequenceWriter(SequenceFileOutputFormat.java:64)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:75)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2017-04-06 14:39:32,669 ERROR [task-result-getter-0] scheduler.TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Any idea on how I can fix this??
I find the fix, apparently Put (and all HBase mutations) have a specific serialiser MutationSerialization.
The following line fixes the issue:
config.setStrings("io.serializations",
config.get("io.serializations"),
MutationSerialization.class.getName(),
ResultSerialization.class.getName());

Categories