S3 Checkpoint with Structured Streaming - java

I have tried the suggestions given in the Apache Spark (Structured Streaming) : S3 Checkpoint support
I am still facing this issue. Below is the error i get
17/07/06 17:04:56 WARN FileSystem: "s3n" is a deprecated filesystem
name. Use "hdfs://s3n/" instead.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: s3n
I have something like this as part of my code
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.config("spark.hadoop.fs.defaultFS","s3")
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
.config("spark.hadoop.fs.s3n.awsAccessKeyId","<my-key>")
.config("spark.hadoop.fs.s3n.awsSecretAccessKey","<my-secret-key>")
.appName("My Spark App")
.getOrCreate();
and then checkpoint directory is being used like this:
StreamingQuery line = topicValue.writeStream()
.option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")
Any help is appreciated. Thanks in advance!

For checkpointing support of S3 in Structured Streaming you can try following way:
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName("My Spark App")
.getOrCreate();
spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<my-key>")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "<my-secret-key>")
and then checkpoint directory can be like this:
StreamingQuery line = topicValue.writeStream()
.option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")
I hope this helps!

Related

Using azure blob storage for spark checkpointing

I am trying to use azure storage as checkpoint location in my spark structured streaming application.
I have seen few articles which talks about reading/writing to azure storage, but I have not seen anyone explaining about using azure storage as checkpoint location. Following is my simple code, reading from one kafka topic and writing back to another topic, added checkpoint location.
SparkConf conf = new SparkConf().setMaster("local[*]");
conf.set(
"fs.azure.account.key.<storage-name>.blob.core.windows.net",
"<storage-key>");
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
SparkSession spark = SparkSession.builder().appName("app-name").config(conf).getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "input")
.load();
StreamingQuery ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "output")
.option("checkpointLocation", "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>")
.start();
ds.awaitTermination();
Azure connection details are correct. When I run this application, I could see one file(metadata) getting created at the specified azure storage location. However, app crashes after few seconds. Below is the exception.
Exception in thread "main" java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.close(NativeAzureFileSystem.java:818)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:339)
at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:298)
at org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:85)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.apply(StreamExecution.scala:124)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.apply(StreamExecution.scala:122)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:122)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExecution.scala:49)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:258)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:299)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:296)
at com.test.Test.main(Test.java:73)
Caused by: java.io.IOException: Stream is already closed.
at com.microsoft.azure.storage.blob.BlobOutputStreamInternal.close(BlobOutputStreamInternal.java:332)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.close(NativeAzureFileSystem.java:818)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:320)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.close(WriterBasedJsonGenerator.java:883)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3561)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:2909)
at org.json4s.jackson.Serialization$.write(Serialization.scala:27)
at org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:78)
... 9 more
Let me know If anything needs to be configured to enable azure storage as checkpoint location or any version conflicts creating this problem.
Spark: 2.3.0
hadoop-azure : 2.7
azure-storage : 8.0

Spark Mongo DB Connection - MongoDB Version lesser than 3.2

I am trying to connect to MongoDB using Spark. (Java Spark API)
When trying to run submit the job, it fails for the with the below error message ,
20/07/05 17:32:00 ERROR DefaultMongoPartitioner:
---------------------------------------- WARNING: MongoDB version < 3.2 detected.
----------------------------------------
With legacy MongoDB installations you will need to explicitly configure the Spark Connector with a partitioner.
This can be done by: * Setting a "spark.mongodb.input.partitioner" in SparkConf. * Setting in the "partitioner" parameter in ReadConfig. * Passing the "partitioner" option to the DataFrameReader.
The following Partitioners are available:
* MongoShardedPartitioner - for sharded clusters, requires read access to the config database. * MongoSplitVectorPartitioner - for single nodes or replicaSets. Utilises the SplitVector command on the primary. * MongoPaginateByCountPartitioner - creates a specific number of partitions. Slow as requires a query for every partition. * MongoPaginateBySizePartitioner - creates partitions based on data size. Slow as requires a query for every partition.
Exception in thread "main" java.lang.UnsupportedOperationException: The DefaultMongoPartitioner requires MongoDB >= 3.2
at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner.partitions(DefaultMongoPartitioner.scala:58)
at com.mongodb.spark.rdd.MongoRDD.getPartitions(MongoRDD.scala:137)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1928)
at org.apache.spark.rdd.RDD.count(RDD.scala:1164)
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:440)
at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:46)
at com.virtualpairprogrammers.JavaIntroduction.main(JavaIntroduction.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have tried the following options , but still throws the same error message,
SparkConf conf = new SparkConf()
.setAppName("MongoSparkConnectorTour")
.set("spark.app.id", "MongoSparkConnectorTour")
.set("spark.mongodb.input.uri", uri)
.set("spark.mongodb.output.uri", uri)
.set("partitioner", "MongoPaginateBySizePartitioner")
.set("spark.mongodb.input.partitionerOptions.partitionSizeMB", "64");
Let me know if i am missing something in here which is why it is throwing the error message.
Not able to identify why it still goes to DefaultMongoPartitioner
Thanks in Advance,
Sam
There was an issue with parameters sent,
Below are the correct format
SparkConf conf = new SparkConf()
.setAppName("MongoSparkConnectorTour")
.set("spark.app.id", "MongoSparkConnectorTour")
.set("spark.mongodb.input.uri", uri)
.set("spark.mongodb.output.uri", uri)
.set("spark.mongodb.input.partitioner", "MongoPaginateBySizePartitioner")
.set("spark.mongodb.input.partitionerOptions.partitionSizeMB", "64");
The issue is solved and i am able to connect and extract data without any issues

Kafka Streams error "TaskAssignmentException: unable to decode subscription data: version=4"

During deployment with only changed Kafka-Streams version from 1.1.1 to 2.x.x (without changing application.id), we got exceptions on app node with older Kafka-Streams version and, as a result, Kafka streams changed state to error and closed, meanwhile app node with new Kafka-Streams version consumes messages fine.
If we upgrade from 1.1.1 to 2.0.0, got error unable to decode subscription data: version=3; if from 1.1.1 to 2.3.0: unable to decode subscription data: version=4.
It might be really painful during canary deployment, e.g. we have 3 app nodes with previous Kafka-Streams version, and when we add one more node with a new version, all existing 3 nodes will be in error state. Error stack trace:
TaskAssignmentException: unable to decode subscription data: version=4
at org.apache.kafka.streams.processor.internals.assignment.SubscriptionInfo.decode(SubscriptionInfo.java:128)
at org.apache.kafka.streams.processor.internals.StreamPartitionAssignor.assign(StreamPartitionAssignor.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:358)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onJoinLeader(AbstractCoordinator.java:520)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.access$1100(AbstractCoordinator.java:93)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:472)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:455)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:822)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:802)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:204)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:127)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:563)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:390)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:293)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:364)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:290)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1149)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:831)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:788)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Issue is reproducible on both Kafka broker versions 1.1.0 and 2.1.1, even with the simple Kafka-Streams DSL example:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("default.key.serde", "org.apache.kafka.common.serialization.Serdes$StringSerde");
props.put("default.value.serde", "org.apache.kafka.common.serialization.Serdes$StringSerde");
props.put("application.id", "xxx");
StreamsBuilder streamsBuilder = new StreamsBuilder();
streamsBuilder.<String, String>stream("source")
.mapValues(value -> value + value)
.to("destination");
KafkaStreams kafkaStreams = new KafkaStreams(streamsBuilder.build(), props);
Is it a bug of kafka-streams? Does exist any workaround to prevent such failure?

Error connecting Spark local cluster

I am trying to run the following code in my local mac where a spark cluster with master and slaves are running
public void run(String inputFilePath) {
String master = "spark://192.168.1.199:7077";
SparkConf conf = new SparkConf()
.setAppName(WordCountTask.class.getName())
.setMaster(master);
JavaSparkContext context = new JavaSparkContext(conf);
context.textFile(inputFilePath)
.flatMap(text -> Arrays.asList(text.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b)
.foreach(result -> LOGGER.info(
String.format("Word [%s] count [%d].", result._1(), result._2)));
}
}
However I get the following exception both in the master console and
Error while invoking RpcHandler#receive() on RPC id
5655526795459682754 java.io.EOFException
and in the program console
18/07/01 22:35:19 WARN StandaloneAppClient$ClientEndpoint: Failed to
connect to master 192.168.1.199:7077 org.apache.spark.SparkException:
Exception thrown in awaitResult
This runs well when I set the master as "local[*]" as given in this example.
I have seen examples where the jar is submited with spark-submit command but I am trying to run it programatically.
Just realised the version of Spark was different in the master/slave and the POM file of the code. Bumped up the version in the pom.xml to match the spark cluster and it worked.

SparkStreaming put a kafka message into Hbase [duplicate]

I am trying to use the HBase Java APIs to write data into HBase. I installed Hadoop/HBase through Ambari.
Here is how the configuration is currently set up:
final Configuration CONFIGURATION = HBaseConfiguration.create();
final HBaseAdmin HBASE_ADMIN;
HBASE_ADMIN = new HBaseAdmin(CONFIGURATION)
When I try to write to HBase, I check to make sure that the table exists
!HBASE_ADMIN.tableExists(tableName)
If not, create a new one. However, it appears that when attempting to check if the table exists exceptions are being thrown.
I'm wondering if I'm not correctly connected to HBase...is there any good way to verify that the configuration is correct and that I am connecting to HBase? The exception I'm getting is below.
Thanks.
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:209)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:288)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:268)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:140)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:135)
at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:597)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:802)
at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:359)
at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:287)
at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:301)
at com.business.project.hbase.HBaseMessageWriter.getTable(HBaseMessageWriter.java:40)
at com.business.project.hbase.HBaseMessageWriter.write(HBaseMessageWriter.java:59)
at com.business.project.hbase.HBaseMessageWriter.write(HBaseMessageWriter.java:54)
at com.business.project.storm.bolt.package.exampleBolt.execute(exampleBolt.java:19)
at backtype.storm.daemon.executor$fn__5697$tuple_action_fn__5699.invoke(executor.clj:659)
at backtype.storm.daemon.executor$mk_task_receiver$fn__5620.invoke(executor.clj:415)
at backtype.storm.disruptor$clojure_handler$reify__1741.onEvent(disruptor.clj:58)
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
at backtype.storm.daemon.executor$fn__5697$fn__5710$fn__5761.invoke(executor.clj:794)
at backtype.storm.util$async_loop$fn__452.invoke(util.clj:465)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:269)
at org.apache.hadoop.hbase.zookeeper.MetaRegionTracker.blockUntilAvailable(MetaRegionTracker.java:241)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:62)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1203)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1164)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:294)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:130)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:55)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:201)
In addition to the configuration parameters suggested by Yosr, specifying
conf.set("zookeeper.znode.parent", "VALUE")
would help resolve the issue.
The property below resolved my issue
For Hortonworks:
hconfig.set("zookeeper.znode.parent", "/hbase-unsecure")
For cloudera:
hconfig.set("zookeeper.znode.parent", "/hbase")
You can use HBaseAdmin.checkHBaseAvailable(conf);
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.master", "ip_address:60000");
conf.set("hbase.zookeeper.quorum","ip_address");
conf.set("hbase.zookeeper.property.clientPort", "2181");
HBaseAdmin admin = new HBaseAdmin(conf);
boolean bool = admin.tableExists("table_name");
System.out.println( bool);
ip_address : this is the ip_adress of your hbase cluster, change your hbase zookeeper port (2181) if it is not the same on your configuration files.

Categories