What we are trying to do: we are evaluating Flink to perform batch processing using DataStream API in BATCH mode.
Minimal application to reproduce the issue:
FileSystem.initialize(GlobalConfiguration.loadConfiguration(System.getenv("FLINK_CONF_DIR")))
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setRuntimeMode(RuntimeExecutionMode.BATCH)
val inputStream = env.fromSource(
FileSource.forRecordStreamFormat(new TextLineFormat(), new Path("s3://testtest/2022/04/12/")).build(), WatermarkStrategy.noWatermarks()
.withTimestampAssigner(new SerializableTimestampAssigner[String]() {
override def extractTimestamp(element: String, recordTimestamp: Long): Long = -1
}), "MySourceName"
)
.map(str => {
val jsonNode = JsonUtil.getJSON(str)
val log = JsonUtil.getJSONString(jsonNode, "log")
if (StringUtils.isNotBlank(log)) {
log
} else {
""
}
})
.filter(StringUtils.isNotBlank(_))
val sink: FileSink[BaseLocation] = FileSink
// .forBulkFormat(new Path("/Users/temp/flinksave"), AvroWriters.forSpecificRecord(classOf[BaseLocation]))
.forBulkFormat(new Path("s3://testtest/avro"), AvroWriters.forSpecificRecord(classOf[BaseLocation]))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.withOutputFileConfig(config)
.build()
inputStream.map(data => {
val baseLocation = new BaseLocation()
baseLocation.setRegion(data)
baseLocation
}).sinkTo(sink)
inputStream.print("input:")
env.execute()
Flink version: 1.14.2
the program executes normally when the path is local.
The program does not give a error when path change to s3://. However I do not see any files being written in S3 either.
This problem does not exist in the stand-alone mode, but only in the local development environment jetbrains IDEA. Is it because I lack configuration? I have already configured flink-config.yaml like:
s3.access-key: test
s3.secret-key: test
s3.endpoint: http://127.0.0.1:39000
log
18:42:25,524 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Finished reading split(s) [0000000002]
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Finished reading split(s) [0000000001]
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Closing splitFetcher 0 because it is idle.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Closing splitFetcher 0 because it is idle.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Shutting down split fetcher 0
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Shutting down split fetcher 0
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Split fetcher 0 exited.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Split fetcher 0 exited.
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - Subtask 11 (on host '') is requesting a file source split
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No more splits available for subtask 11
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - Subtask 8 (on host '') is requesting a file source split
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No more splits available for subtask 8
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader received NoMoreSplits event.
18:42:25,526 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader received NoMoreSplits event.
Related
I have to modify a Dataset<Row> according to some rules that are in a List<Row>.
I want to iterate over the Datset<Row> columns using Dataset.withColumn(...) as seen in the next example:
(import necesary libraries...)
SparkSession spark = SparkSession
.builder()
.appName("appname")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> dfToModify = spark.read().table("TableToModify");
List<Row> ListListWithInfo = new ArrayList<>(Arrays.asList());
ListWithInfo.add(0,RowFactory.create("field1", "input1", "output1", "conditionAux1"));
ListWithInfo.add(1,RowFactory.create("field1", "input1", "output1", "conditionAux2"));
ListWithInfo.add(2,RowFactory.create("field1", "input2", "output3", "conditionAux3"));
ListWithInfo.add(3,RowFactory.create("field2", "input3", "output4", "conditionAux4"));
.
.
.
for (Row row : ListWithInfo) {
String field = row.getString(0);
String input = row.getString(1);
String output = row.getString(2);
String conditionAux = row.getString(3);
dfToModify = dfToModify.withColumn(field,
when(dfToModify.col(field).equalTo(input)
.and(dfToModify.col("conditionAuxField").equalTo(conditionAux))
,output)
.otherwise(dfToModify.col(field)));
}
The code does works as it should, but when there are more than 50 "rules" in the List, the program doesn't finish and this output is shown in the screen:
0/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1653
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1650
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1635
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1641
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1645
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1646
20/01/27 17:48:18 INFO storage.BlockManagerInfo: Removed broadcast_113_piece0 on **************** in memory (size: 14.5 KB, free: 3.0 GB)
20/01/27 17:48:18 INFO storage.BlockManagerInfo: Removed broadcast_113_piece0 on ***************** in memory (size: 14.5 KB, free: 3.0 GB)
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1639
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1649
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1651
20/01/27 17:49:18 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 6
20/01/27 17:49:18 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 6
20/01/27 17:49:18 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 6
20/01/27 17:49:18 INFO spark.ExecutorAllocationManager: Removing executor 6 because it has been idle for 60 seconds (new desired total will be 0)
20/01/27 17:49:19 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:19 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 6.
20/01/27 17:49:19 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
20/01/27 17:49:19 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:19 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
20/01/27 17:49:19 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, *********************, 43387, None)
20/01/27 17:49:19 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
20/01/27 17:49:19 INFO cluster.YarnScheduler: Executor 6 on **************** killed by driver.
20/01/27 17:49:19 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 0)
20/01/27 17:49:20 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:21 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:22 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
.
.
.
.
Is there any way to make it more efficient using Java Spark? (without using for loop or something similar)
Finally I used withColumns method of Dataset<Row> objet. This method need two arguments:
.withColumns(Seq<String> ColumnsNames, Seq<Column> ColumnsValues);
And in the Seq<String> can not be duplicated.
The code is as follow:
SparkSession spark = SparkSession
.builder()
.appName("appname")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> dfToModify = spark.read().table("TableToModify");
List<Row> ListListWithInfo = new ArrayList<>(Arrays.asList());
ListWithInfo.add(0,RowFactory.create("field1", "input1", "output1", "conditionAux1"));
ListWithInfo.add(1,RowFactory.create("field1", "input1", "output1", "conditionAux2"));
ListWithInfo.add(2,RowFactory.create("field1", "input2", "output3", "conditionAux3"));
ListWithInfo.add(3,RowFactory.create("field2", "input3", "output4", "conditionAux4"));
.
.
.
// initialize values for fields and conditions
String field_ant = ListWithInfo.get(0).getString(0).toLowerCase();
String first_input = ListWithInfo.get(0).getString(1);
String first_output = ListWithInfo.get(0).getString(2);
String first_conditionAux = ListWithInfo.get(0).getString(3);
Column whenColumn = when(dfToModify.col(field_ant).equalTo(first_input)
.and(dfToModify.col("conditionAuxField").equalTo(lit(first_conditionAux)))
,first_output);
// lists with the names of the fields and the conditions
List<Column> whenColumnList = new ArrayList(Arrays.asList());
List<String> fieldsNameList = new ArrayList(Arrays.asList());
for (Row row : ListWithInfo.subList(1,ListWithInfo.size())) {
String field = row.getString(0);
String input = row.getString(1);
String output = row.getString(2);
String conditionAux = row.getString(3);
if (field.equals(field_ant)) {
// if field is equals to fiel_ant the new condition is added to the previous one
whenColumn = whenColumn.when(dfToModify.col(field).equalTo(input)
.and(dfToModify.col("conditionAuxField").equalTo(lit(conditionAux)))
,output);
} else {
// if field is diferent to the previous:
// close the conditions for this field
whenColumn = whenColumn.otherwise(dfToModify.col(field_ant));
// add to the lists the field(String) and the conditions (columns)
whenColumnList.add(whenColumn);
fieldsNameList.add(field_ant);
// and initialize the conditions for the new field
whenColumn = when(dfToModify.col(field).equalTo(input)
.and(dfToModify.col("branchField").equalTo(lit(branch)))
,output);
}
field_ant = field;
}
// add last values
whenColumnList.add(whenColumn);
fieldsNameList.add(field_ant);
// transform list to Seq
Seq<Column> whenColumnSeq = JavaConversions.asScalaBuffer(whenColumnList).seq();
Seq<String> fieldsNameSeq = JavaConversions.asScalaBuffer(fieldsNameList).seq();
Dataset<Row> dfModified = dfToModify.withColumns(fieldsNameSeq, whenColumnSeq);
I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.
The first executor id always takes the parameters I have provided while creating the streaming context but the executor with ID 2 and 3 always override the kafka parameters.
20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour
ces like Flume. See the programming guide for details on how to enable the Write Ahead Log.
20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream#12665f3f
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms
20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated
20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null
20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms
20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#a4d83ac
20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [1,2,3]
check.crcs = true
client.id = client-0
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = telemetry-streaming-service
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
Here is the log for other executors.
20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1
20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1
20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447
20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service.
20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1
20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0
20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps)
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB)
20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 ms
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 17.9 KB, free 6.2 GB)
20/01/14 12:15:20 INFO KafkaRDD: Computing topic telemetry, partition 1 offsets 237352170 -> 237352311
20/01/14 12:15:20 INFO CachedKafkaConsumer: Initializing cache 16 64 0.75
20/01/14 12:15:20 INFO CachedKafkaConsumer: Cache miss for CacheKey(spark-executor-telemetry-streaming-service,telemetry,1)
20/01/14 12:15:20 INFO ConsumerConfig: ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = none
bootstrap.servers = [1,2,3]
check.crcs = true
client.id = client-0
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
If we closely observer in the first executor the auto.offset.reset is latest but for the other executors the auto.offset.reset = none
Here is how I am creating the streaming context
public void init() throws Exception {
final String BOOTSTRAP_SERVERS = PropertyFileReader.getInstance()
.getProperty("spark.streaming.kafka.broker.list");
final String DYNAMIC_ALLOCATION_ENABLED = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.enabled");
final String DYNAMIC_ALLOCATION_SCALING_INTERVAL = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.scalingInterval");
final String DYNAMIC_ALLOCATION_MIN_EXECUTORS = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.minExecutors");
final String DYNAMIC_ALLOCATION_MAX_EXECUTORS = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.maxExecutors");
final String DYNAMIC_ALLOCATION_EXECUTOR_IDLE_TIMEOUT = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.executorIdleTimeout");
final String DYNAMIC_ALLOCATION_CACHED_EXECUTOR_IDLE_TIMEOUT = PropertyFileReader.getInstance()
.getProperty("spark.streaming.dynamicAllocation.cachedExecutorIdleTimeout");
final String SPARK_SHUFFLE_SERVICE_ENABLED = PropertyFileReader.getInstance()
.getProperty("spark.shuffle.service.enabled");
final String SPARK_LOCALITY_WAIT = PropertyFileReader.getInstance().getProperty("spark.locality.wait");
final String SPARK_KAFKA_CONSUMER_POLL_INTERVAL = PropertyFileReader.getInstance()
.getProperty("spark.streaming.kafka.consumer.poll.ms");
final String SPARK_KAFKA_MAX_RATE_PER_PARTITION = PropertyFileReader.getInstance()
.getProperty("spark.streaming.kafka.maxRatePerPartition");
final String SPARK_BATCH_DURATION_IN_SECONDS = PropertyFileReader.getInstance()
.getProperty("spark.batch.duration.in.seconds");
final String KAFKA_TOPIC = PropertyFileReader.getInstance().getProperty("spark.streaming.kafka.topic");
LOGGER.debug("connecting to brokers ::" + BOOTSTRAP_SERVERS);
LOGGER.debug("bootstrapping properties to create consumer");
kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", BOOTSTRAP_SERVERS);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "telemetry-streaming-service");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
kafkaParams.put("client.id","client-0");
// Below property should be enabled in properties and changed based on
// performance testing
kafkaParams.put("max.poll.records",
PropertyFileReader.getInstance().getProperty("spark.streaming.kafka.max.poll.records"));
LOGGER.info("registering as a consumer with the topic :: " + KAFKA_TOPIC);
topics = Arrays.asList(KAFKA_TOPIC);
sparkConf = new SparkConf()
// .setMaster(PropertyFileReader.getInstance().getProperty("spark.master.url"))
.setAppName(PropertyFileReader.getInstance().getProperty("spark.application.name"))
.set("spark.streaming.dynamicAllocation.enabled", DYNAMIC_ALLOCATION_ENABLED)
.set("spark.streaming.dynamicAllocation.scalingInterval", DYNAMIC_ALLOCATION_SCALING_INTERVAL)
.set("spark.streaming.dynamicAllocation.minExecutors", DYNAMIC_ALLOCATION_MIN_EXECUTORS)
.set("spark.streaming.dynamicAllocation.maxExecutors", DYNAMIC_ALLOCATION_MAX_EXECUTORS)
.set("spark.streaming.dynamicAllocation.executorIdleTimeout", DYNAMIC_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)
.set("spark.streaming.dynamicAllocation.cachedExecutorIdleTimeout",
DYNAMIC_ALLOCATION_CACHED_EXECUTOR_IDLE_TIMEOUT)
.set("spark.shuffle.service.enabled", SPARK_SHUFFLE_SERVICE_ENABLED)
.set("spark.locality.wait", SPARK_LOCALITY_WAIT)
.set("spark.streaming.kafka.consumer.poll.ms", SPARK_KAFKA_CONSUMER_POLL_INTERVAL)
.set("spark.streaming.kafka.maxRatePerPartition", SPARK_KAFKA_MAX_RATE_PER_PARTITION);
LOGGER.debug("creating streaming context with minutes batch interval ::: " + SPARK_BATCH_DURATION_IN_SECONDS);
streamingContext = new JavaStreamingContext(sparkConf,
Durations.seconds(Integer.parseInt(SPARK_BATCH_DURATION_IN_SECONDS)));
/*
* todo: add checkpointing to the streaming context to recover from driver
* failures and also for offset management
*/
LOGGER.info("checkpointing the streaming transactions at hdfs path :: /checkpoint");
streamingContext.checkpoint("/checkpoint");
streamingContext.addStreamingListener(new DataProcessingListener());
}
#Override
public void execute() throws InterruptedException {
LOGGER.info("started telemetry pipeline executor to consume data");
// Data Consume from the Kafka topic
JavaInputDStream<ConsumerRecord<String, String>> telemetryStream = KafkaUtils.createDirectStream(
streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topics, kafkaParams));
telemetryStream.foreachRDD(rawRDD -> {
if (!rawRDD.isEmpty()) {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rawRDD.rdd()).offsetRanges();
LOGGER.debug("list of OffsetRanges getting processed as a string :: "
+ Arrays.asList(offsetRanges).toString());
System.out.println("offsetRanges : " + offsetRanges.length);
SparkSession spark = JavaSparkSessionSingleton.getInstance(rawRDD.context().getConf());
JavaPairRDD<String, String> flattenedRawRDD = rawRDD.mapToPair(record -> {
//LOGGER.debug("flattening JSON record with telemetry json value ::: " + record.value());
ObjectMapper om = new ObjectMapper();
JsonNode root = om.readTree(record.value());
Map<String, JsonNode> flattenedMap = new FlatJsonGenerator(root).flatten();
JsonNode flattenedRootNode = om.convertValue(flattenedMap, JsonNode.class);
//LOGGER.debug("creating Tuple for the JSON record Key :: " + flattenedRootNode.get("/name").asText()
// + ", value :: " + flattenedRootNode.toString());
return new Tuple2<String, String>(flattenedRootNode.get("/name").asText(),
flattenedRootNode.toString());
});
Dataset<Row> rawFlattenedDataRDD = spark
.createDataset(flattenedRawRDD.rdd(), Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
.toDF("sensor_path", "sensor_data");
Dataset<Row> groupedDS = rawFlattenedDataRDD.groupBy(col("sensor_path"))
.agg(collect_list(col("sensor_data").as("sensor_data")));
Dataset<Row> lldpGroupedDS = groupedDS.filter((FilterFunction<Row>) r -> r.getString(0).equals("Cisco-IOS-XR-ethernet-lldp-oper:lldp/nodes/node/neighbors/devices/device"));
LOGGER.info("printing the LLDP GROUPED DS ------------------>");
lldpGroupedDS.show(2);
LOGGER.info("creating telemetry pipeline to process the telemetry data");
HashMap<Object, Object> params = new HashMap<>();
params.put(DPConstants.OTSDB_CONFIG_F_PATH, ExternalizedConfigsReader.getPropertyValueFromCache("/opentsdb.config.file.path"));
params.put(DPConstants.OTSDB_CLIENT_TYPE, ExternalizedConfigsReader.getPropertyValueFromCache("/opentsdb.client.type"));
try {
LOGGER.info("<-------------------processing lldp data and write to hive STARTED ----------------->");
Pipeline lldpPipeline = PipelineFactory.getPipeline(PipelineType.LLDPTELEMETRY);
lldpPipeline.process(lldpGroupedDS, null);
LOGGER.info("<-------------------processing lldp data and write to hive COMPLETED ----------------->");
LOGGER.info("<-------------------processing groupedDS data and write to OPENTSDB STARTED ----------------->");
Pipeline pipeline = PipelineFactory.getPipeline(PipelineType.TELEMETRY);
pipeline.process(groupedDS, params);
LOGGER.info("<-------------------processing groupedDS data and write to OPENTSDB COMPLETED ----------------->");
}catch (Throwable t){
t.printStackTrace();
}
LOGGER.info("commiting offsets after processing the batch");
((CanCommitOffsets) telemetryStream.inputDStream()).commitAsync(offsetRanges);
}
});
streamingContext.start();
streamingContext.awaitTermination();
}
Am I missing something here? Any help is appreciated. Thanks.
I want to loop within a test to implement a behavior that will respond to some requests with messages build from "message templates (files)" in which I replace some strings with the value of a citrus test variable and the index of the loop. I think I managed to almost get it working but unfortunately my test crash when I'm trying to use the ReplaceAll String functions within my behavior. See below the code snippet I wrote in which I remove all the unnecessary parts in order to hopefully make my problem simple to understand
public class myBehavior extends AbstractTestBehavior {
private String payloadData;
myBehavior withPayloadData(String payload) {
this.payloadData = payload;
return this;
}
#Override
public void apply() {
echo("[behavior] - OK ->behavior is invoked");
echo("[behavior]" + payloadData + " - OK ->variable from Test is correctly transmitted to behavior");
echo(func_asis(payloadData));
echo(func_replace(payloadData)); // if you uncomment this line the test will crash at starting time when invoking replace_all
}
String func_asis(String myvar)
{
String s = "This is a string in which nothing is replaced, OK fine !";
echo("[func_asis] OK ->in func_asis now ");
echo("[func_asis] myvar="+ myvar + " - OK ->variable from Test is correctly transmitted to func_asis");
return s;
}
String func_replace(String myvar)
{
String s = "This is a string in which to replace !!Name!! by the value of my citrus variable but it crashes";
echo("[func_replace] OK ->in func_replace");
echo("[func_replace] myvar="+ myvar + " - OK ->variable from Test is correctly transmitted to func_asis");
//s=s.replaceAll("!!Name!!",myvar); // This will crash when starting the test (not actually when running it) !!!
return s;
}
}
#CitrusTest
public void mySimpleTest() throws IOException {
description("Simple Test invoking a behavior which it self will invoke a java function");
variable("vm", "/dc/vm/folder/vm_basename");
repeat().until("i = 3")
.actions(
sleep(1000L),
applyBehavior(new myBehavior().withPayloadData("${vm}${i}"))
);
}
Here is the output of the test with the replaceAll invocation commented in func_replace.
09:21:22,102 DEBUG port.LoggingReporter| BEFORE TEST SUITE
09:21:22,102 INFO port.LoggingReporter|
09:21:22,103 INFO port.LoggingReporter|
09:21:22,103 INFO port.LoggingReporter| BEFORE TEST SUITE: SUCCESS
09:21:22,103 INFO port.LoggingReporter| ------------------------------------------------------------------------
09:21:22,103 INFO port.LoggingReporter|
09:21:22,119 DEBUG t.TestContextFactory| Created new test context - using global variables: '{}'
09:21:22,128 INFO port.LoggingReporter|
09:21:22,128 INFO port.LoggingReporter| ------------------------------------------------------------------------
09:21:22,128 DEBUG port.LoggingReporter| STARTING TEST CitrusLearning.mySimpleTest <com.grge.citrus.cmptest.stratus>
09:21:22,128 INFO port.LoggingReporter|
09:21:22,129 DEBUG citrus.TestCase| Initializing test case
09:21:22,130 DEBUG context.TestContext| Setting variable: citrus.test.name with value: 'CitrusLearning.mySimpleTest'
09:21:22,130 DEBUG context.TestContext| Setting variable: citrus.test.package with value: 'com.grge.citrus.cmptest.stratus'
09:21:22,130 DEBUG context.TestContext| Setting variable: vm with value: '/dc/vm/folder/vm_basename'
09:21:22,130 DEBUG citrus.TestCase| Test variables:
09:21:22,131 DEBUG citrus.TestCase| citrus.test.name = CitrusLearning.mySimpleTest
09:21:22,131 DEBUG citrus.TestCase| citrus.test.package = com.grge.citrus.cmptest.stratus
09:21:22,131 DEBUG citrus.TestCase| vm = /dc/vm/folder/vm_basename
09:21:22,131 INFO port.LoggingReporter|
09:21:22,131 DEBUG port.LoggingReporter| TEST STEP 1/1: repeat
09:21:22,131 DEBUG port.LoggingReporter| TEST ACTION CONTAINER with 9 embedded actions
09:21:22,131 DEBUG context.TestContext| Setting variable: i with value: '1'
09:21:22,134 INFO actions.SleepAction| Sleeping 1000 ms
09:21:23,139 INFO actions.SleepAction| Returning after 1000 ms
09:21:23,139 INFO actions.EchoAction| [behavior] - OK ->behavior is invoked
09:21:23,139 INFO actions.EchoAction| [behavior]/dc/vm/folder/vm_basename1 - OK ->variable from Test is correctly transmitted to behavior
09:21:23,140 INFO actions.EchoAction| [func_asis] OK ->in func_asis now
09:21:23,140 INFO actions.EchoAction| [func_asis] myvar=/dc/vm/folder/vm_basename1 - OK ->variable from Test is correctly transmitted to func_asis
09:21:23,140 INFO actions.EchoAction| This is a string in which nothing is replaced, OK fine !
09:21:23,140 INFO actions.EchoAction| [func_replace] OK ->in func_replace
09:21:23,141 INFO actions.EchoAction| [func_replace] myvar=/dc/vm/folder/vm_basename1 - OK ->variable from Test is correctly transmitted to func_asis
09:21:23,141 INFO actions.EchoAction| This is a string in which to replace !!Name!! by the value of my citrus variable but it crashes
09:21:23,143 DEBUG leanExpressionParser| Boolean expression 2 = 3 evaluates to false
09:21:23,143 DEBUG context.TestContext| Setting variable: i with value: '2'
09:21:23,143 INFO actions.SleepAction| Sleeping 1000 ms
09:21:24,145 INFO actions.SleepAction| Returning after 1000 ms
09:21:24,145 INFO actions.EchoAction| [behavior] - OK ->behavior is invoked
09:21:24,145 INFO actions.EchoAction| [behavior]/dc/vm/folder/vm_basename2 - OK ->variable from Test is correctly transmitted to behavior
09:21:24,145 INFO actions.EchoAction| [func_asis] OK ->in func_asis now
09:21:24,145 INFO actions.EchoAction| [func_asis] myvar=/dc/vm/folder/vm_basename2 - OK ->variable from Test is correctly transmitted to func_asis
09:21:24,145 INFO actions.EchoAction| This is a string in which nothing is replaced, OK fine !
09:21:24,146 INFO actions.EchoAction| [func_replace] OK ->in func_replace
09:21:24,146 INFO actions.EchoAction| [func_replace] myvar=/dc/vm/folder/vm_basename2 - OK ->variable from Test is correctly transmitted to func_asis
09:21:24,146 INFO actions.EchoAction| This is a string in which to replace !!Name!! by the value of my citrus variable but it crashes
09:21:24,146 DEBUG leanExpressionParser| Boolean expression 3 = 3 evaluates to true
09:21:24,146 INFO port.LoggingReporter|
09:21:24,147 DEBUG port.LoggingReporter| TEST STEP 1/1 SUCCESS
Here is the crash log when uncommenting that line.
09:27:51,525 DEBUG port.LoggingReporter| BEFORE TEST SUITE
09:27:51,525 INFO port.LoggingReporter|
09:27:51,525 INFO port.LoggingReporter|
09:27:51,525 INFO port.LoggingReporter| BEFORE TEST SUITE: SUCCESS
09:27:51,525 INFO port.LoggingReporter| ------------------------------------------------------------------------
09:27:51,525 INFO port.LoggingReporter|
09:27:51,542 DEBUG t.TestContextFactory| Created new test context - using global variables: '{}'
09:27:51,551 INFO port.LoggingReporter|
09:27:51,552 ERROR port.LoggingReporter| TEST FAILED CitrusLearning.mySimpleTest <com.grge.citrus.cmptest.stratus> Nested exception is:
java.lang.IllegalArgumentException: No group with name {vm}
at java.util.regex.Matcher.appendReplacement(Matcher.java:849)
at java.util.regex.Matcher.replaceAll(Matcher.java:955)
at java.lang.String.replaceAll(String.java:2223)
at com.grge.citrus.cmptest.stratus.CitrusLearning$myBehavior.func_replace(CitrusLearning.java:180)
at com.grge.citrus.cmptest.stratus.CitrusLearning$myBehavior.apply(CitrusLearning.java:163)
at com.consol.citrus.dsl.design.AbstractTestBehavior.apply(AbstractTestBehavior.java:53)
at com.consol.citrus.dsl.design.ApplyTestBehaviorAction.doExecute(ApplyTestBehaviorAction.java:38)
at com.consol.citrus.actions.AbstractTestAction.execute(AbstractTestAction.java:42)
at com.consol.citrus.dsl.design.DefaultTestDesigner.applyBehavior(DefaultTestDesigner.java:193)
at com.consol.citrus.dsl.testng.TestNGCitrusTestDesigner.applyBehavior(TestNGCitrusTestDesigner.java:168)
at com.grge.citrus.cmptest.stratus.CitrusLearning.mySimpleTest(CitrusLearning.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
at com.consol.citrus.testng.AbstractTestNGCitrusTest.invokeTestMethod(AbstractTestNGCitrusTest.java:121)
at com.consol.citrus.dsl.testng.TestNGCitrusTest.invokeTestMethod(TestNGCitrusTest.java:121)
at com.consol.citrus.dsl.testng.TestNGCitrusTestDesigner.invokeTestMethod(TestNGCitrusTestDesigner.java:73)
at com.consol.citrus.dsl.testng.TestNGCitrusTest.run(TestNGCitrusTest.java:110)
at com.consol.citrus.dsl.testng.TestNGCitrusTest.run(TestNGCitrusTest.java:56)
at org.testng.internal.MethodInvocationHelper.invokeHookable(MethodInvocationHelper.java:242)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:579)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:719)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:989)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109)
at org.testng.TestRunner.privateRun(TestRunner.java:648)
at org.testng.TestRunner.run(TestRunner.java:505)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:455)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415)
at org.testng.SuiteRunner.run(SuiteRunner.java:364)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1208)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1137)
at org.testng.TestNG.runSuites(TestNG.java:1049)
at org.testng.TestNG.run(TestNG.java:1017)
at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:99)
at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)
09:27:51,555 INFO port.LoggingReporter| ------------------------------------------------------------------------
09:27:51,555 INFO port.LoggingReporter|
09:27:51,595 INFO port.LoggingReporter|
09:27:51,595 INFO port.LoggingReporter| ------------------------------------------------------------------------
09:27:51,595 DEBUG port.LoggingReporter| AFTER TEST SUITE
Thx a lot for any idea to fix this
having putting some more time into it, I start to understand how Citrus treats variables, in the case above the value of 'myvar' is actually set to "${vm}${i}" and will be replaced at execution time but when being invoked from a test action.... So I looked at custom test action to find that even there sometime variable instantiation is a bit tricky... but nevertheless I could achieve the first part of what I wanted to do which is to replace some predefined values in a string by variables content. See below the snippet that does this.
...But now because I invoke the replace function from an action, I cannot return a string to the invoker which was what I was planning to do in my behavior ... so will investigate if & how that's possible or will look at some other design for my test (like temporary file to store the string with replaced values being read by the behavior after func_replace would have updated them..)
Anyway here is a solution to the problem I had above:
#Test
public class CitrusLearningL4 extends TestNGCitrusTestDesigner {
public class myBehavior extends AbstractTestBehavior {
private #CitrusResource TestContext parentContext;
myBehavior withContext(#Optional #CitrusResource TestContext context) {
this.parentContext=context;
return this;
}
#Override
public void apply() {
echo("[behavior] - OK ->behavior is invoked");
func_replace();
echo("[behavior] - OK ->behavior is finished");
}
void func_replace()
{
final String s = "This is a string in which '!!Name!!' is present because it was replaced by the value of my test variable";
echo("[func_replace] OK ->in func_replace");
action(new AbstractTestAction() {
public void doExecute(TestContext context) {
String s1 = s;
System.out.println("[anAction] - OK ->anAction is invoked");
String sVar=String.format("%s", (String) parentContext.getVariable("vm") + (String) parentContext.getVariable("i"));
System.out.println("[anAction] - OK ->" + s1.replaceAll("!!Name!!",sVar));
}
});
}
}
#Test #Parameters("context")
#CitrusTest
public void mySimpleBehaviorTest(#Optional #CitrusResource TestContext context) throws IOException {
description("Simple Test invoking a behavior which it self will invoke a java function");
variable("vm", "/dc/vm/folder/vm_basename");
repeat().until("i = 3")
.actions(
sleep(1000L),
applyBehavior(new myBehavior().withContext(context))
);
}
}
The problem you face is that
s=s.replaceAll("!!Name!!",myvar);
is not part of the Java DSL test. It is evaluated before the test runs. The behaviour is described in the Citrus documentat. The calls to echo(...), however, are part of the DSL and thus get executed when the test is actually run.
What you can do instead is let citrus take care of the replacement for you. You want to replace "!!Name!!" with the values of vm and i concatenated. To achieve this, replace "!!Name!!" with "${vm}${i}" in your template and let citrus do the heavy lifting for your:
String func_replace(String myvar)
{
String s = "This is a string in which to replace ${vm}${id} by the value of my citrus variable but it crashes";
echo("[func_replace] OK ->in func_replace");
echo("[func_replace] myvar="+ myvar + " - OK ->variable from Test is correctly transmitted to func_asis");
return s;
}
As the returned String is used within apply() in class myBehavior through echo(...), citrus does the substition for you.
I am trying to understand K-means clustering on a input .csv file which consists of 56376 rows and two columns with first column representing id and second column a group of words/Example of this data is given as
**1. 1428951621 do rememb came milan 19 april 2013 maynardmonday 16
1429163429 rt windeerlust sehun hyungluhan yessehun do even rememb
day today**
The Scala code for processing this data looks like this
val inputData = sc.textFile("test.csv")
// this is a changable parameter for the number of clusters to use for kmeans
val numClusters = 4;
// number of iterations for the kmeans
val numIterations = 10;
// this is the size of the vectors to be created by Word2Vec this is tunable
val vectorSize = 600;
val filtereddata = inputData.filter(!_.isEmpty).
map(line=>line.split(",",-1)).
map(line=>(line(1),line(1).split(" ").filter(_.nonEmpty)))
val corpus = inputData.filter(!_.isEmpty).
map(line=>line.split(",",-1)).
map(line=>line(1).split(" ").toSeq)
val values:RDD[Seq[String]] = filtereddata.map(s=>s._2)
val keys = filtereddata.map(s=>s._1)
/*******************Word2Vec and normalisation*****************************/
val w2vec = new Word2Vec().setVectorSize(vectorSize);
val model = w2vec.fit(corpus)
val outtest:RDD[Seq[Vector]]= values.map(x=>x.map(m=>try {
model.transform(m)
} catch {
case e: Exception => Vectors.zeros(vectorSize)
}))
val convertest = outtest.map(m=>m.map(x=>(x.toArray)))
val withkey = keys.zip(convertest)
val filterkey = withkey.filter(!_._2.isEmpty)
val keysfinal= filterkey.map(x=>x._1)
val valfinal= filterkey.map(x=>x._2)
// for each collections of vectors that is one tweet, add the vectors
val reducetest = valfinal.map(x=>x.reduce((a,b)=>a.zip(b).map(t=>t._1+t._2)))
val filtertest = reducetest.map(x=>x.map(m=>(m,x.length)).map(m=>m._1/m._2))
val test = filtertest.map(x=>new DenseVector(x).asInstanceOf[Vector])
val normalizer = new Normalizer()
val data1= test.map(x=>(normalizer.transform(x)))
/*********************Clustering Algorithm***********************************/
val clusters = KMeans.train(data1,numClusters,numIterations)
val predictions= clusters.predict(data1)
val clustercount= keysfinal.zip(predictions).distinct.map(s=>(s._2,1)).reduceByKey(_+_)
val result= keysfinal.zip(predictions).distinct
result.saveAsTextFile(fileToSaveResults)
val wsse = clusters.computeCost(data1)
println(s"The number of clusters is $numClusters")
println("The cluster counts are:")
println(clustercount.collect().mkString(" "))
println(s"The wsse is: $wsse")
However After some iterations it throws a "java.lang.NullPointerException" and exits at stage 36.The error looks like this:
17/10/07 14:42:10 INFO TaskSchedulerImpl: Adding task set 26.0 with 2 tasks
17/10/07 14:42:10 INFO TaskSetManager: Starting task 0.0 in stage 26.0 (TID 50, localhost, partition 0, ANY, 5149 bytes)
17/10/07 14:42:10 INFO TaskSetManager: Starting task 1.0 in stage 26.0 (TID 51, localhost, partition 1, ANY, 5149 bytes)
17/10/07 14:42:10 INFO Executor: Running task 1.0 in stage 26.0 (TID 51)
17/10/07 14:42:10 INFO Executor: Running task 0.0 in stage 26.0 (TID 50)
17/10/07 14:42:10 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
17/10/07 14:42:10 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
17/10/07 14:42:10 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
17/10/07 14:42:10 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
17/10/07 14:42:10 ERROR Executor: Exception in task 0.0 in stage 26.0 (TID 50)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
Kindly help me in localizing the issue in this code as I am not able to understand.
Note:This code written by other people
I think this has nothing to do with your code. This exception is thrown if one of the arguments passed to the ProcessBuilder is null. So I guess this must be a configuration issue or a bug in Hadoop.
From the quick googling for "hadoop java.lang.ProcessBuilder.start nullpointerexception" it seems this is a known problem:
https://www.fachschaft.informatik.tu-darmstadt.de/forum/viewtopic.php?t=34250
Is it possible to run Hadoop jobs (like the WordCount sample) in the local mode on Windows without Cygwin?
This question already has answers here:
SocketChannel.write() writing problem
(2 answers)
Closed 5 years ago.
I'm writing a nio server,process some http request
and I want to use SocketChannel 's method write(ByteBuffer[] srcs),
code like this
#Override
public void send(ByteBuffer[] arr) throws IOException {
long writeBytes=channel.write(arr);
log.debug("writeBytes "+writeBytes);
}
but if arr is too big ,such as 93k ,it can only write
DEBUG : 2017-08-25 15:03:41 > writeBytes16384
And in the brower , of course it's not complete,only a part of it
if I split it ,such as
#Override
public void send(byte[] bytes, int index, int length) throws IOException {
ByteBuffer buffer=ByteBuffer.allocate(1024);
try {
buffer.put(bytes,index,length);
}catch (BufferOverflowException e){
log.error(e.getMessage());
}
buffer.flip();
channel.write(buffer);
}
and use Thread.sleep(2) after every method,and send 93 times in loop,it's ok,but I don't think it is a good way
16384 is 16k,I realy think some buffer is 16k,but I didn't found which buffer is
I saw channel.socket().getSendBufferSize(); is 8192
I try to channel.socket().setSendBufferSize(4*1024*1024);
but it didn't work
How can I success to transfer a big data (more than 16k) to brower one time and not sleep or wait
thanks #Keyaman
you are right ,I should read some tutorials
I fix it,it works well
while(arr[arr.length-1].hasRemaining()){
long writeBytes=channel.write(arr);
log.debug("writeBytes "+writeBytes);
}
but I still don't it is a good way
it log just like this
DEBUG : 2017-08-25 16:26:56 > writeBytes 16384
DEBUG : 2017-08-25 16:26:56 > writeBytes 0
DEBUG : 2017-08-25 16:26:56 > writeBytes 0
DEBUG : 2017-08-25 16:26:56 > writeBytes 0
...
DEBUG : 2017-08-25 16:26:56 > writeBytes 16384
DEBUG : 2017-08-25 16:26:56 > writeBytes 0
DEBUG : 2017-08-25 16:26:56 > writeBytes 0
DEBUG : 2017-08-25 16:26:56 > writeBytes 0
...
DEBUG : 2017-08-25 16:26:56 > writeBytes 0
DEBUG : 2017-08-25 16:26:56 > writeBytes 12922
there should be a buffer is 16k,when it's full ,it flush
what is it,and can I set it?