What we are trying to do: we are evaluating Flink to perform batch processing using DataStream API in BATCH mode.
Minimal application to reproduce the issue:
FileSystem.initialize(GlobalConfiguration.loadConfiguration(System.getenv("FLINK_CONF_DIR")))
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setRuntimeMode(RuntimeExecutionMode.BATCH)
val inputStream = env.fromSource(
FileSource.forRecordStreamFormat(new TextLineFormat(), new Path("s3://testtest/2022/04/12/")).build(), WatermarkStrategy.noWatermarks()
.withTimestampAssigner(new SerializableTimestampAssigner[String]() {
override def extractTimestamp(element: String, recordTimestamp: Long): Long = -1
}), "MySourceName"
)
.map(str => {
val jsonNode = JsonUtil.getJSON(str)
val log = JsonUtil.getJSONString(jsonNode, "log")
if (StringUtils.isNotBlank(log)) {
log
} else {
""
}
})
.filter(StringUtils.isNotBlank(_))
val sink: FileSink[BaseLocation] = FileSink
// .forBulkFormat(new Path("/Users/temp/flinksave"), AvroWriters.forSpecificRecord(classOf[BaseLocation]))
.forBulkFormat(new Path("s3://testtest/avro"), AvroWriters.forSpecificRecord(classOf[BaseLocation]))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.withOutputFileConfig(config)
.build()
inputStream.map(data => {
val baseLocation = new BaseLocation()
baseLocation.setRegion(data)
baseLocation
}).sinkTo(sink)
inputStream.print("input:")
env.execute()
Flink version: 1.14.2
the program executes normally when the path is local.
The program does not give a error when path change to s3://. However I do not see any files being written in S3 either.
This problem does not exist in the stand-alone mode, but only in the local development environment jetbrains IDEA. Is it because I lack configuration? I have already configured flink-config.yaml like:
s3.access-key: test
s3.secret-key: test
s3.endpoint: http://127.0.0.1:39000
log
18:42:25,524 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Finished reading split(s) [0000000002]
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Finished reading split(s) [0000000001]
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Closing splitFetcher 0 because it is idle.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Closing splitFetcher 0 because it is idle.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Shutting down split fetcher 0
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Shutting down split fetcher 0
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Split fetcher 0 exited.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Split fetcher 0 exited.
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - Subtask 11 (on host '') is requesting a file source split
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No more splits available for subtask 11
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - Subtask 8 (on host '') is requesting a file source split
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No more splits available for subtask 8
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader received NoMoreSplits event.
18:42:25,526 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader received NoMoreSplits event.
I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.
The first executor id always takes the parameters I have provided while creating the streaming context but the executor with ID 2 and 3 always override the kafka parameters.
20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour
ces like Flume. See the programming guide for details on how to enable the Write Ahead Log.
20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream#12665f3f
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms
20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated
20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null
20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms
20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#a4d83ac
20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [1,2,3]
check.crcs = true
client.id = client-0
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = telemetry-streaming-service
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
Here is the log for other executors.
20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1
20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1
20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447
20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service.
20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1
20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0
20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps)
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB)
20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 ms
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 17.9 KB, free 6.2 GB)
20/01/14 12:15:20 INFO KafkaRDD: Computing topic telemetry, partition 1 offsets 237352170 -> 237352311
20/01/14 12:15:20 INFO CachedKafkaConsumer: Initializing cache 16 64 0.75
20/01/14 12:15:20 INFO CachedKafkaConsumer: Cache miss for CacheKey(spark-executor-telemetry-streaming-service,telemetry,1)
20/01/14 12:15:20 INFO ConsumerConfig: ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = none
bootstrap.servers = [1,2,3]
check.crcs = true
client.id = client-0
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
If we closely observer in the first executor the auto.offset.reset is latest but for the other executors the auto.offset.reset = none
Here is how I am creating the streaming context
public void init() throws Exception {
final String BOOTSTRAP_SERVERS = PropertyFileReader.getInstance()
.getProperty("spark.streaming.kafka.broker.list");
final String DYNAMIC_ALLOCATION_ENABLED = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.enabled");
final String DYNAMIC_ALLOCATION_SCALING_INTERVAL = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.scalingInterval");
final String DYNAMIC_ALLOCATION_MIN_EXECUTORS = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.minExecutors");
final String DYNAMIC_ALLOCATION_MAX_EXECUTORS = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.maxExecutors");
final String DYNAMIC_ALLOCATION_EXECUTOR_IDLE_TIMEOUT = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.executorIdleTimeout");
final String DYNAMIC_ALLOCATION_CACHED_EXECUTOR_IDLE_TIMEOUT = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.cachedExecutorIdleTimeout");
final String SPARK_SHUFFLE_SERVICE_ENABLED = PropertyFileReader.getInstance()
.getProperty("spark.shuffle.service.enabled");
final String SPARK_LOCALITY_WAIT = PropertyFileReader.getInstance().getProperty("spark.locality.wait");
final String SPARK_KAFKA_CONSUMER_POLL_INTERVAL = PropertyFileReader.getInstance()
.getProperty("spark.streaming.kafka.consumer.poll.ms");
final String SPARK_KAFKA_MAX_RATE_PER_PARTITION = PropertyFileReader.getInstance()
.getProperty("spark.streaming.kafka.maxRatePerPartition");
final String SPARK_BATCH_DURATION_IN_SECONDS = PropertyFileReader.getInstance()
.getProperty("spark.batch.duration.in.seconds");
final String KAFKA_TOPIC = PropertyFileReader.getInstance().getProperty("spark.streaming.kafka.topic");
LOGGER.debug("connecting to brokers ::" + BOOTSTRAP_SERVERS);
LOGGER.debug("bootstrapping properties to create consumer");
kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", BOOTSTRAP_SERVERS);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "telemetry-streaming-service");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
kafkaParams.put("client.id","client-0");
// Below property should be enabled in properties and changed based on
// performance testing
kafkaParams.put("max.poll.records",
PropertyFileReader.getInstance().getProperty("spark.streaming.kafka.max.poll.records"));
LOGGER.info("registering as a consumer with the topic :: " + KAFKA_TOPIC);
topics = Arrays.asList(KAFKA_TOPIC);
sparkConf = new SparkConf()
// .setMaster(PropertyFileReader.getInstance().getProperty("spark.master.url"))
.setAppName(PropertyFileReader.getInstance().getProperty("spark.application.name"))
.set("spark.streaming.dynamicAllocation.enabled", DYNAMIC_ALLOCATION_ENABLED)
.set("spark.streaming.dynamicAllocation.scalingInterval", DYNAMIC_ALLOCATION_SCALING_INTERVAL)
.set("spark.streaming.dynamicAllocation.minExecutors", DYNAMIC_ALLOCATION_MIN_EXECUTORS)
.set("spark.streaming.dynamicAllocation.maxExecutors", DYNAMIC_ALLOCATION_MAX_EXECUTORS)
.set("spark.streaming.dynamicAllocation.executorIdleTimeout", DYNAMIC_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)
.set("spark.streaming.dynamicAllocation.cachedExecutorIdleTimeout",
DYNAMIC_ALLOCATION_CACHED_EXECUTOR_IDLE_TIMEOUT)
.set("spark.shuffle.service.enabled", SPARK_SHUFFLE_SERVICE_ENABLED)
.set("spark.locality.wait", SPARK_LOCALITY_WAIT)
.set("spark.streaming.kafka.consumer.poll.ms", SPARK_KAFKA_CONSUMER_POLL_INTERVAL)
.set("spark.streaming.kafka.maxRatePerPartition", SPARK_KAFKA_MAX_RATE_PER_PARTITION);
LOGGER.debug("creating streaming context with minutes batch interval ::: " + SPARK_BATCH_DURATION_IN_SECONDS);
streamingContext = new JavaStreamingContext(sparkConf,
Durations.seconds(Integer.parseInt(SPARK_BATCH_DURATION_IN_SECONDS)));
/*
* todo: add checkpointing to the streaming context to recover from driver
* failures and also for offset management
*/
LOGGER.info("checkpointing the streaming transactions at hdfs path :: /checkpoint");
streamingContext.checkpoint("/checkpoint");
streamingContext.addStreamingListener(new DataProcessingListener());
}
#Override
public void execute() throws InterruptedException {
LOGGER.info("started telemetry pipeline executor to consume data");
// Data Consume from the Kafka topic
JavaInputDStream<ConsumerRecord<String, String>> telemetryStream = KafkaUtils.createDirectStream(
streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topics, kafkaParams));
telemetryStream.foreachRDD(rawRDD -> {
if (!rawRDD.isEmpty()) {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rawRDD.rdd()).offsetRanges();
LOGGER.debug("list of OffsetRanges getting processed as a string :: "
+ Arrays.asList(offsetRanges).toString());
System.out.println("offsetRanges : " + offsetRanges.length);
SparkSession spark = JavaSparkSessionSingleton.getInstance(rawRDD.context().getConf());
JavaPairRDD<String, String> flattenedRawRDD = rawRDD.mapToPair(record -> {
//LOGGER.debug("flattening JSON record with telemetry json value ::: " + record.value());
ObjectMapper om = new ObjectMapper();
JsonNode root = om.readTree(record.value());
Map<String, JsonNode> flattenedMap = new FlatJsonGenerator(root).flatten();
JsonNode flattenedRootNode = om.convertValue(flattenedMap, JsonNode.class);
//LOGGER.debug("creating Tuple for the JSON record Key :: " + flattenedRootNode.get("/name").asText()
// + ", value :: " + flattenedRootNode.toString());
return new Tuple2<String, String>(flattenedRootNode.get("/name").asText(),
flattenedRootNode.toString());
});
Dataset<Row> rawFlattenedDataRDD = spark
.createDataset(flattenedRawRDD.rdd(), Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
.toDF("sensor_path", "sensor_data");
Dataset<Row> groupedDS = rawFlattenedDataRDD.groupBy(col("sensor_path"))
.agg(collect_list(col("sensor_data").as("sensor_data")));
Dataset<Row> lldpGroupedDS = groupedDS.filter((FilterFunction<Row>) r -> r.getString(0).equals("Cisco-IOS-XR-ethernet-lldp-oper:lldp/nodes/node/neighbors/devices/device"));
LOGGER.info("printing the LLDP GROUPED DS ------------------>");
lldpGroupedDS.show(2);
LOGGER.info("creating telemetry pipeline to process the telemetry data");
HashMap<Object, Object> params = new HashMap<>();
params.put(DPConstants.OTSDB_CONFIG_F_PATH, ExternalizedConfigsReader.getPropertyValueFromCache("/opentsdb.config.file.path"));
params.put(DPConstants.OTSDB_CLIENT_TYPE, ExternalizedConfigsReader.getPropertyValueFromCache("/opentsdb.client.type"));
try {
LOGGER.info("<-------------------processing lldp data and write to hive STARTED ----------------->");
Pipeline lldpPipeline = PipelineFactory.getPipeline(PipelineType.LLDPTELEMETRY);
lldpPipeline.process(lldpGroupedDS, null);
LOGGER.info("<-------------------processing lldp data and write to hive COMPLETED ----------------->");
LOGGER.info("<-------------------processing groupedDS data and write to OPENTSDB STARTED ----------------->");
Pipeline pipeline = PipelineFactory.getPipeline(PipelineType.TELEMETRY);
pipeline.process(groupedDS, params);
LOGGER.info("<-------------------processing groupedDS data and write to OPENTSDB COMPLETED ----------------->");
}catch (Throwable t){
t.printStackTrace();
}
LOGGER.info("commiting offsets after processing the batch");
((CanCommitOffsets) telemetryStream.inputDStream()).commitAsync(offsetRanges);
}
});
streamingContext.start();
streamingContext.awaitTermination();
}
Am I missing something here? Any help is appreciated. Thanks.
I have controler:
#GetMapping("/old")
public Product getOld() {
Product omeOld = productService.getOneOld();
log.info(String.valueOf(omeOld.getId()));
return omeOld;
}
Service:
#Override
#Transactional
public Product getOneOld() {
Product aNew = productsRepository.findTop1ByStatusOrderByCountAsc("NEW");
try {
Thread.sleep(5000L);
} catch (InterruptedException e) {
e.printStackTrace();
}
return aNew;
}
And repository:
#Repository
public interface ProductsRepository extends JpaRepository<Product, Long> {
Product findTop1ByStatusOrderByCountAsc(String status);
}
I start JMeter and send 5 request in 5 threads. In result I get 5 response after 5 seconds. Each request was processed by seconds. But in log I see next:
2018-09-14 14:04:35.524 INFO 9048 --- [nio-8080-exec-1] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:04:35.525 INFO 9048 --- [nio-8080-exec-2] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:04:35.532 INFO 9048 --- [nio-8080-exec-3] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:04:35.534 INFO 9048 --- [nio-8080-exec-4] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:04:35.534 INFO 9048 --- [nio-8080-exec-6] c.e.l.demo.controller.ProductController : 1
Each thread select the same row and process it. I need that first thread select first row, second thread select second row and etc. I try use #Lock(LockModeType.PESSIMISTIC_WRITE) :
#Lock(LockModeType.PESSIMISTIC_WRITE)
Product findTop1ByStatusOrderByCountAsc(String status);
Now when I start JMeter I have next behavior:
first thread worck 5 sec, after that second thread work 5 sec and etc. 25 secons all 5 threads. And in log:
2018-09-14 14:11:40.564 INFO 13724 --- [nio-8080-exec-5] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:11:45.566 INFO 13724 --- [nio-8080-exec-4] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:11:50.567 INFO 13724 --- [nio-8080-exec-2] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:11:55.568 INFO 13724 --- [nio-8080-exec-1] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:12:00.570 INFO 13724 --- [nio-8080-exec-3] c.e.l.demo.controller.ProductController : 1
All threads select the same row(if I change this roe in first thread - it will not select in second thread if the conditions do not match).
I try this:
#Query(value = "Select * from products where status = ?1 order by count asc LIMIT 1 for update", nativeQuery = true)
Product findTop1ByStatusOrderByCountAsc(String status);
the result is the same.
But I need - first thread select first row and block it/ Second thread select next not blocked row and process. I try next:
#Query(value = "Select * from products where status = ?1 order by count asc LIMIT 1 for update of products skip locked", nativeQuery = true)
Product findTop1ByStatusOrderByCountAsc(String status);
And it work fine! :
2018-09-14 14:25:00.355 INFO 7904 --- [io-8080-exec-10] c.e.l.demo.controller.ProductController : 4
2018-09-14 14:25:00.355 INFO 7904 --- [nio-8080-exec-4] c.e.l.demo.controller.ProductController : 3
2018-09-14 14:25:00.355 INFO 7904 --- [nio-8080-exec-9] c.e.l.demo.controller.ProductController : 1
2018-09-14 14:25:00.358 INFO 7904 --- [nio-8080-exec-5] c.e.l.demo.controller.ProductController : 5
2018-09-14 14:25:00.359 INFO 7904 --- [nio-8080-exec-2] c.e.l.demo.controller.ProductController : 6
Each select in each thread select one row from non blocked rows!
But how can I repeat this with Oracle? In oracle I can not write LIMIT 1 and if I use ROWNUM = 1 each thread select same row always.
I am trying to understand K-means clustering on a input .csv file which consists of 56376 rows and two columns with first column representing id and second column a group of words/Example of this data is given as
**1. 1428951621 do rememb came milan 19 april 2013 maynardmonday 16
1429163429 rt windeerlust sehun hyungluhan yessehun do even rememb
day today**
The Scala code for processing this data looks like this
val inputData = sc.textFile("test.csv")
// this is a changable parameter for the number of clusters to use for kmeans
val numClusters = 4;
// number of iterations for the kmeans
val numIterations = 10;
// this is the size of the vectors to be created by Word2Vec this is tunable
val vectorSize = 600;
val filtereddata = inputData.filter(!_.isEmpty).
map(line=>line.split(",",-1)).
map(line=>(line(1),line(1).split(" ").filter(_.nonEmpty)))
val corpus = inputData.filter(!_.isEmpty).
map(line=>line.split(",",-1)).
map(line=>line(1).split(" ").toSeq)
val values:RDD[Seq[String]] = filtereddata.map(s=>s._2)
val keys = filtereddata.map(s=>s._1)
/*******************Word2Vec and normalisation*****************************/
val w2vec = new Word2Vec().setVectorSize(vectorSize);
val model = w2vec.fit(corpus)
val outtest:RDD[Seq[Vector]]= values.map(x=>x.map(m=>try {
model.transform(m)
} catch {
case e: Exception => Vectors.zeros(vectorSize)
}))
val convertest = outtest.map(m=>m.map(x=>(x.toArray)))
val withkey = keys.zip(convertest)
val filterkey = withkey.filter(!_._2.isEmpty)
val keysfinal= filterkey.map(x=>x._1)
val valfinal= filterkey.map(x=>x._2)
// for each collections of vectors that is one tweet, add the vectors
val reducetest = valfinal.map(x=>x.reduce((a,b)=>a.zip(b).map(t=>t._1+t._2)))
val filtertest = reducetest.map(x=>x.map(m=>(m,x.length)).map(m=>m._1/m._2))
val test = filtertest.map(x=>new DenseVector(x).asInstanceOf[Vector])
val normalizer = new Normalizer()
val data1= test.map(x=>(normalizer.transform(x)))
/*********************Clustering Algorithm***********************************/
val clusters = KMeans.train(data1,numClusters,numIterations)
val predictions= clusters.predict(data1)
val clustercount= keysfinal.zip(predictions).distinct.map(s=>(s._2,1)).reduceByKey(_+_)
val result= keysfinal.zip(predictions).distinct
result.saveAsTextFile(fileToSaveResults)
val wsse = clusters.computeCost(data1)
println(s"The number of clusters is $numClusters")
println("The cluster counts are:")
println(clustercount.collect().mkString(" "))
println(s"The wsse is: $wsse")
However After some iterations it throws a "java.lang.NullPointerException" and exits at stage 36.The error looks like this:
17/10/07 14:42:10 INFO TaskSchedulerImpl: Adding task set 26.0 with 2 tasks
17/10/07 14:42:10 INFO TaskSetManager: Starting task 0.0 in stage 26.0 (TID 50, localhost, partition 0, ANY, 5149 bytes)
17/10/07 14:42:10 INFO TaskSetManager: Starting task 1.0 in stage 26.0 (TID 51, localhost, partition 1, ANY, 5149 bytes)
17/10/07 14:42:10 INFO Executor: Running task 1.0 in stage 26.0 (TID 51)
17/10/07 14:42:10 INFO Executor: Running task 0.0 in stage 26.0 (TID 50)
17/10/07 14:42:10 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
17/10/07 14:42:10 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
17/10/07 14:42:10 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
17/10/07 14:42:10 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
17/10/07 14:42:10 ERROR Executor: Exception in task 0.0 in stage 26.0 (TID 50)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
Kindly help me in localizing the issue in this code as I am not able to understand.
Note:This code written by other people
I think this has nothing to do with your code. This exception is thrown if one of the arguments passed to the ProcessBuilder is null. So I guess this must be a configuration issue or a bug in Hadoop.
From the quick googling for "hadoop java.lang.ProcessBuilder.start nullpointerexception" it seems this is a known problem:
https://www.fachschaft.informatik.tu-darmstadt.de/forum/viewtopic.php?t=34250
Is it possible to run Hadoop jobs (like the WordCount sample) in the local mode on Windows without Cygwin?
I'm trying to use flume to spend data to spark then add data to HBase~
I have tried to use flume + spark + HDFS and it's work .
These is the source code :
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.flume.source.avro.AvroFlumeEvent;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.OutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFlatMapFunction;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.flume.FlumeUtils;
import org.apache.spark.streaming.flume.SparkFlumeEvent;
import com.google.common.collect.Lists;
import com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider.Text;
import scala.Tuple2;
import scala.Tuple4;
public class JavaFlumeEventTest {
private static final Pattern SPACE = Pattern.compile(" ");
private static Configuration conf = null;
/**
* initial
*/
static {
conf = HBaseConfiguration.create();
conf.addResource(new Path("file:///etc/hbase/conf/hbase-site.xml"));
conf.addResource(new Path("file:///etc/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("file:///etc/hadoop/conf/core-site.xml"));
conf.addResource(new Path("file:///etc/hadoop/conf/mapred-site.xml"));
conf.addResource(new Path("file:///etc/hadoop/conf/yarn-site.xml"));
conf.set("hbase.zookeeper.quorum", "elephant,tiger,horse");
conf.set("hbase.zookeeper.property.clientPort","2181");
conf.set("hbase.master", "elephant" + ":60000");
conf.set("hbase.cluster.distributed", "true");
conf.set("hbase.rootdir", "hdfs://elephant:8020/hbase");
}
/**
* Add new record
* #param tableName
* #param rowKey
* #param family
* #param qualifier
* #param value
*/
public static void addRecord (String tableName, String rowKey, String family, String qualifier, String value){
try {
System.out.println("===========HTable =========="+conf);
HTable table = new HTable(conf, tableName);
System.out.println("===========put ==========");
Put put = new Put(Bytes.toBytes(rowKey));
System.out.println("===========put Add==========");
put.add(Bytes.toBytes(family),Bytes.toBytes(qualifier),Bytes.toBytes(value));
System.out.println("===========table put ==========");
table.put(put);
System.out.println("insert recored " + rowKey + " to table " + tableName +" ok.");
} catch (IOException e) {
System.out.println("===========IOException ==========");
e.printStackTrace();
}
}
private JavaFlumeEventTest() {
}
public static void main(String[] args) {
String host = args[0];
int port = Integer.parseInt(args[1]);
Duration batchInterval = new Duration(Integer.parseInt(args[2]));
final String tableName = args[3];
final String columnFamily = args[4];
SparkConf sparkConf = new SparkConf()
.setAppName("JavaFlumeEventTest")
.set("spark.executor.memory", "256m");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, batchInterval);
final Broadcast<String> broadcastTableName = ssc.sparkContext().broadcast(tableName);
final Broadcast<String> broadcastColumnFamily = ssc.sparkContext().broadcast(columnFamily);
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream = FlumeUtils.createStream(ssc, host, port);
JavaDStream<String>
words = flumeStream.flatMap(new FlatMapFunction<SparkFlumeEvent,String>(){
#Override
public Iterable<String> call(SparkFlumeEvent arg0) throws Exception {
String body = new String(arg0.event().getBody().array(), Charset.forName("UTF-8"));
return Lists.newArrayList(SPACE.split(body));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
wordCounts.foreach(new Function2<JavaPairRDD<String,Integer>, Time, Void>() {
#Override
public Void call(JavaPairRDD<String, Integer> values,
Time time) throws Exception {
values.foreach(new VoidFunction<Tuple2<String, Integer>> () {
#Override
public void call(Tuple2<String, Integer> tuple){
System.out.println("===========insert record========"+tuple._1()+"=="+tuple._2().toString());
JavaFlumeEventTest.addRecord("mytable","PutInpu",columnFamily,tuple._1(),tuple._2().toString());
System.out.println("===========Done record========"+tuple._1());
}} );
return null;
}});
flumeStream.count().map(new Function<Long, String>() {
#Override
public String call(Long in) {
return "Received " + in + " flume events.";
}
}).print();
ssc.start();
}
}
I exported that as a runnable jar and start with spark
./bin/spark-submit --class JavaFlumeEventTest --master local[15] /home/training/software/JavaFlumeEventTest3.jar elephant 11000 5000 mytable cf
There is no Exception but no data be added to HBase~
I found that that the thread is stop at
HTable table = new HTable(conf, tableName);
There are the Spark terminal logs~
15/02/04 21:36:05 INFO DAGScheduler: Job 72 finished: print at JavaFlumeEventTest.java:139, took 0.056793 s
-------------------------------------------
Time: 1423103765000 ms
-------------------------------------------
(have,3)
(example,,1)
(dependencies,1)
(linked,1)
(1111,28)
(non-Spark,1)
(do,1)
(some,1)
(Hence,,1)
(from,2)
...
15/02/04 21:36:05 INFO JobScheduler: Finished job streaming job 1423103765000 ms.0 from job set of time 1423103765000 ms
15/02/04 21:36:05 INFO JobScheduler: Starting job streaming job 1423103765000 ms.1 from job set of time 1423103765000 ms
15/02/04 21:36:05 INFO SparkContext: Starting job: foreach at JavaFlumeEventTest.java:141
15/02/04 21:36:05 INFO DAGScheduler: Got job 73 (foreach at JavaFlumeEventTest.java:141) with 15 output partitions (allowLocal=false)
15/02/04 21:36:05 INFO DAGScheduler: Final stage: Stage 146(foreach at JavaFlumeEventTest.java:141)
15/02/04 21:36:05 INFO DAGScheduler: Parents of final stage: List(Stage 145)
15/02/04 21:36:05 INFO DAGScheduler: Missing parents: List()
15/02/04 21:36:05 INFO DAGScheduler: Submitting Stage 146 (ShuffledRDD[114] at reduceByKey at JavaFlumeEventTest.java:132), which has no missing parents
15/02/04 21:36:05 INFO MemoryStore: ensureFreeSpace(2544) called with curMem=141969, maxMem=280248975
15/02/04 21:36:05 INFO MemoryStore: Block broadcast_86 stored as values in memory (estimated size 2.5 KB, free 267.1 MB)
15/02/04 21:36:05 INFO MemoryStore: ensureFreeSpace(1862) called with curMem=144513, maxMem=280248975
15/02/04 21:36:05 INFO MemoryStore: Block broadcast_86_piece0 stored as bytes in memory (estimated size 1862.0 B, free 267.1 MB)
15/02/04 21:36:05 INFO BlockManagerInfo: Added broadcast_86_piece0 in memory on localhost:41505 (size: 1862.0 B, free: 267.2 MB)
15/02/04 21:36:05 INFO BlockManagerMaster: Updated info of block broadcast_86_piece0
15/02/04 21:36:05 INFO SparkContext: Created broadcast 86 from getCallSite at DStream.scala:294
15/02/04 21:36:05 INFO DAGScheduler: Submitting 15 missing tasks from Stage 146 (ShuffledRDD[114] at reduceByKey at JavaFlumeEventTest.java:132)
15/02/04 21:36:05 INFO TaskSchedulerImpl: Adding task set 146.0 with 15 tasks
15/02/04 21:36:05 INFO TaskSetManager: Starting task 0.0 in stage 146.0 (TID 466, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 1.0 in stage 146.0 (TID 467, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 2.0 in stage 146.0 (TID 468, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 3.0 in stage 146.0 (TID 469, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 4.0 in stage 146.0 (TID 470, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 5.0 in stage 146.0 (TID 471, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 6.0 in stage 146.0 (TID 472, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 7.0 in stage 146.0 (TID 473, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 8.0 in stage 146.0 (TID 474, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 9.0 in stage 146.0 (TID 475, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 10.0 in stage 146.0 (TID 476, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 11.0 in stage 146.0 (TID 477, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 12.0 in stage 146.0 (TID 478, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO TaskSetManager: Starting task 13.0 in stage 146.0 (TID 479, localhost, PROCESS_LOCAL, 1122 bytes)
15/02/04 21:36:05 INFO Executor: Running task 0.0 in stage 146.0 (TID 466)
15/02/04 21:36:05 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/02/04 21:36:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
===========insert record========have==3
===========HTable ==========Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml, file:/etc/hbase/conf/hbase-site.xml, file:/etc/hadoop/conf/hdfs-site.xml, file:/etc/hadoop/conf/core-site.xml, file:/etc/hadoop/conf/mapred-site.xml, file:/etc/hadoop/conf/yarn-site.xml
15/02/04 21:36:05 INFO Executor: Running task 1.0 in stage 146.0 (TID 467)
15/02/04 21:36:05 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/02/04 21:36:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
===========insert record========1111==28
===========HTable ==========Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml, file:/etc/hbase/conf/hbase-site.xml, file:/etc/hadoop/conf/hdfs-site.xml, file:/etc/hadoop/conf/core-site.xml, file:/etc/hadoop/conf/mapred-site.xml, file:/etc/hadoop/conf/yarn-site.xml
15/02/04 21:36:05 INFO Executor: Running task 2.0 in stage 146.0 (TID 468)
...
...
15/02/04 21:36:05 INFO ContextCleaner: Cleaned shuffle 1
15/02/04 21:36:05 INFO ContextCleaner: Cleaned shuffle 0
15/02/04 21:36:05 INFO ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
15/02/04 21:36:05 INFO ZooKeeper: Client environment:host.name=elephant
15/02/04 21:36:05 INFO ZooKeeper: Client environment:java.version=1.7.0_45
15/02/04 21:36:05 INFO ZooKeeper: Client environment:java.vendor=Oracle Corporation
15/02/04 21:36:05 INFO ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_45-cloudera/jre
15/02/04 21:36:05 INFO ZooKeeper: Client environment:java.class.path=::/home/training/software/spark-1.2.0-bin-hadoop2.3/conf:/home/training/software/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/home/training/software/spark-1.2.0-bin-hadoop2.3/lib/datanucleus-api-jdo-3.2.6.jar:/home/training/software/spark-1.2.0-bin-hadoop2.3/lib/datanucleus-core-3.2.10.jar:/home/training/software/spark-1.2.0-bin-hadoop2.3/lib/datanucleus-rdbms-3.2.9.jar
15/02/04 21:36:05 INFO ZooKeeper: Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
15/02/04 21:36:05 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp
15/02/04 21:36:05 INFO ZooKeeper: Client environment:java.compiler=<NA>
15/02/04 21:36:05 INFO ZooKeeper: Client environment:os.name=Linux
15/02/04 21:36:05 INFO ZooKeeper: Client environment:os.arch=amd64
15/02/04 21:36:05 INFO ZooKeeper: Client environment:os.version=2.6.32-279.el6.x86_64
15/02/04 21:36:05 INFO ZooKeeper: Client environment:user.name=training
15/02/04 21:36:05 INFO ZooKeeper: Client environment:user.home=/home/training
15/02/04 21:36:05 INFO ZooKeeper: Client environment:user.dir=/home/training/software/spark-1.2.0-bin-hadoop2.3
15/02/04 21:36:05 INFO ZooKeeper: Initiating client connection, connectString=tiger:2181,elephant:2181,horse:2181 sessionTimeout=90000 watcher=hconnection-0x575b43dd, quorum=tiger:2181,elephant:2181,horse:2181, baseZNode=/hbase
15/02/04 21:36:05 INFO RecoverableZooKeeper: Process identifier=hconnection-0x575b43dd connecting to ZooKeeper ensemble=tiger:2181,elephant:2181,horse:2181
15/02/04 21:36:05 INFO ClientCnxn: Opening socket connection to server tiger/192.168.137.12:2181. Will not attempt to authenticate using SASL (unknown error)
15/02/04 21:36:05 INFO ClientCnxn: Socket connection established to tiger/192.168.137.12:2181, initiating session
15/02/04 21:36:05 INFO ClientCnxn: Session establishment complete on server tiger/192.168.137.12:2181, sessionid = 0x24b573f71f00007, negotiated timeout = 40000
15/02/04 21:36:10 INFO JobScheduler: Added jobs for time 1423103770000 ms
15/02/04 21:36:15 INFO JobScheduler: Added jobs for time 1423103775000 ms
15/02/04 21:36:20 INFO JobScheduler: Added jobs for time 1423103780000 ms
Btw I can add data to Hbase with java but flume and spark~
Can any to help me to solve the problem?
Thx~