Must the Spark Streaming developer install Hadoop on his computer? - java

I am trying to learn spark streaming, when my demo set Master is "local[2]", it is normal. But when I setMaster for the local cluster started in StandAlone mode, an error occurred:
lost an executor 2 (already removed): Unable to create executor due to java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
It should be noted that I submitted the code in idea
#Component
public final class JavaNetworkWordCount {
private static final String SPACE = " ";
#Bean("test")
public void test() throws Exception {
// Create a local StreamingContext with two working thread and batch interval of 10 second
SparkConf conf = new SparkConf()
.setJars(new String[]{"E:\\project\\spark-demo\\target\\spark-demo-0.0.1-SNAPSHOT.jar"})
.setMaster("spark://10.4.41.93:7077")
.set("spark.driver.host", "127.0.0.1")
.setAppName("JavaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("192.168.2.51", 9999);
// Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(SPACE)).iterator());
// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2);
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}

It turns out, but I downloaded hadoop and set the value to HADOOP_HOME, after restarting the cluster, this error disappeared.

Related

How to fix Failed to open native connection to Cassandra at {server ip}:9042

I am trying to connect spark and Cassandra using spark-cassandra-connector. the connection gets established but when i am trying to perform operations on JavaRDD i am facing.
java.io.IOException: Failed to open native connection to Cassandra at {10.0.21.92}:9042
Here is the configuration and code which i am trying to implement :
SparkConf sparkConf = new SparkConf().setAppName("Data Transformation").set("spark.serializer","org.apache.spark.serializer.KryoSerializer").setMaster("local[4]");
sparkConf.set("spark.cassandra.connection.host", server ip);
sparkConf.set("spark.cassandra.connection.port", "9042");
sparkConf.set("spark.cassandra.connection.timeout_ms", "5000");
sparkConf.set("spark.cassandra.read.timeout_ms", "200000");
sparkConf.set("spark.cassandra.auth.username", user_name);
sparkConf.set("spark.cassandra.auth.password", password);
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
And below is the code where i am performing operation on javardd:
CassandraJavaRDD<CassandraRow> cassandraRDD = CassandraJavaUtil.javaFunctions(sparkContext).cassandraTable(keySpaceName, tableName);
JavaRDD<GenericTriggerEntity> rdd = cassandraRDD.map(new Function<CassandraRow, GenericTriggerEntity>() {
private static final long serialVersionUID = -165799649937652815L;
#Override
public GenericTriggerEntity call(CassandraRow row) throws Exception {
GenericTriggerEntity genericTriggerEntity = new GenericTriggerEntity();
if(row.getString("end") != null) genericTriggerEntity.setEnd(row.getString("end"));
if(row.getString("key") != null)
genericTriggerEntity.setKey(row.getString("key"));
genericTriggerEntity.setKeyspacename(row.getString("keyspacename"));
genericTriggerEntity.setPartitiondeleted(row.getString("partitiondeleted"));
genericTriggerEntity.setRowdeleted(row.getString("rowDeleted"));
genericTriggerEntity.setRows(row.getString("rows"));
genericTriggerEntity.setStart(row.getString("start"));
genericTriggerEntity.setTablename("tablename");
genericTriggerEntity.setTriggerdate(row.getString("triggerdate"));
genericTriggerEntity.setTriggertime(row.getString("triggertime"));
genericTriggerEntity.setUuid(row.getUUID("uuid"));
return genericTriggerEntity;
}
});
Here is the JavaRDD operation i am performing
JavaRDD<String> jsonDataRDDwords = rdd.flatMap(s -> Arrays.asList(SPACE.split((CharSequence) s)));
JavaPairRDD<String, Integer> jsonDataRDDones = jsonDataRDDwords.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD<String, Integer> jsonDataRDDcounts = jsonDataRDDones.reduceByKey((i1, i2) -> i1 + i2);
List<Tuple2<String, Integer>> jsonDatRDDoutput = jsonDataRDDcounts.collect();
I even tried telnet to Cassandra server the port is open.
I am able to establish the connection but then while performing reduceByKey getting the above exception.
I am not able to figure out what is the issue. Is something wrong in the javardd operation.
Any help would be appreciated.
Thanks In Advance.
The above error was due to some dependency issue of cassandra drive core.
solved it by adding metric dependency in my pom.xml
<dependency>
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-core</artifactId>
<version>3.2.2</version>
</dependency>
You can use socat command to forward your local port to your remote cassandra port:
apt-get install socat
socat tcp-listen:9042,fork tcp:10.0.21.92:9042 &

Flink Kafka - how to make App run in Parallel?

I am creating a app in Flink to
Read Messages from a topic
Do some simple process on it
Write Result to a different topic
My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?
On the Flink Web Dashboard:
App goes to running status
But, there is only one block shown in the overview subtasks
And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )
Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?
public class SimpleApp {
public static void main(String[] args) throws Exception {
// create execution environment INPUT
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment();
// event time characteristic
env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// production Ready (Does NOT Work if greater than 1)
env_in.setParallelism(Integer.parseInt(args[0].toString()));
// configure kafka consumer
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("auto.offset.reset", "earliest");
// create a kafka consumer
final DataStream<String> consumer = env_in
.addSource(new FlinkKafkaConsumer09<>("test", new
SimpleStringSchema(), properties));
// filter data
SingleOutputStreamOperator<String> result = consumer.filter(new
FilterFunction<String>(){
#Override
public boolean filter(String s) throws Exception {
return s.substring(0, 2).contentEquals("PS");
}
});
// Process Data
// Transform String Records to JSON Objects
SingleOutputStreamOperator<JSONObject> data = result.map(new
MapFunction<String, JSONObject>()
{
#Override
public JSONObject map(String value) throws Exception
{
JSONObject jsnobj = new JSONObject();
if(value.substring(0, 2).contentEquals("PS"))
{
// 1. Raw Data
jsnobj.put("Raw_Data", value.substring(0, value.length()-6));
// 2. Comment
int first_index_comment = value.indexOf("$");
int last_index_comment = value.lastIndexOf("$") + 1;
// - set comment
String comment =
value.substring(first_index_comment, last_index_comment);
comment = comment.substring(0, comment.length()-6);
jsnobj.put("Comment", comment);
}
else {
jsnobj.put("INVALID", value);
}
return jsnobj;
}
});
// Write JSON to Kafka Topic
data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
"FilteredData",
new SimpleJsonSchema()));
env_in.execute();
}
}
My code does work, but it seems to run only on a single thread
( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).
How do I make it run in parallel ?
To run your job in parallel you can do 2 things:
Increase the parallelism of your job at the env level - i.e. do something like
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);
But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.
To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 3 --partitions 4 --topic test

Switching JavaStreamingContext from INITIALIZED to ACTIVE

I'm using the example code proposed by Spark Streaming "JavaKafkaWordCount.java".
public final class JavaKafkaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
private JavaKafkaWordCount() {
}
public static void main(String[] args) throws Exception {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaWordCount <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
StreamingExamples.setStreamingLogLevels();
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
JavaDStream<String> lines = messages.map(Tuple2::_2);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
wordCounts.print();
jssc.start();
jssc.awaitTermination();
}
}
After creating the SparkConf object, it creates the JavaStreamingContext.
Then it defines all the functions needed to do the WordCount, and it starts the JavaStreamingContext. After that, it never comes back to wordCount.print()but it keeps printing. How is that possible? What happens when the JSSC switches from INITIALIZED to ACTIVE? Is it a loop or what?
Internally, Spark Streaming uses a scheduler to execute all registered 'output operations'.
'output operations' are operations that cause the materialization of the declared stream transformations which are lazy like in Spark.
In the particular case of the code in the question, wordCounts.print(); is such 'output operation' and it will be registered in the Spark Streaming scheduler, causing it to execute at each 'batch interval'. The 'batch interval' is defined at the moment the Streaming Context is created. Going back to the code above: new JavaStreamingContext(sparkConf, new Duration(2000)); the 'batch interval'
is 2000ms or 2 seconds.
In a nutshell:
Each 2 seconds, Spark Streaming will trigger the execution of wordCounts.print() which in turn materializes the evaluation of the DStream with the data for that interval. The materialization process will apply all defined transformations on the DStream (and underlying RDD), such as the map, flatMap and mapToPair operations in the same code.

Spark Checkpoint doesn't remember state (Java HDFS)

ALready Looked at Spark streaming not remembering previous state
but doesn't help.
Also looked at http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing but cant find JavaStreamingContextFactory although I am using spark streaming 2.11 v 2.0.1
My code works fine but when I restart it... it won't remember the last checkpoint...
Function0<JavaStreamingContext> scFunction = new Function0<JavaStreamingContext>() {
#Override
public JavaStreamingContext call() throws Exception {
//Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.milliseconds(SPARK_DURATION));
//checkpointDir = "hdfs://user:pw#192.168.1.50:54310/spark/checkpoint";
ssc.sparkContext().setCheckpointDir(checkpointDir);
StorageLevel.MEMORY_AND_DISK();
return ssc;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkpointDir, scFunction);
Currently data is streaming from kafka and I am performing some transformation and action.
JavaPairDStream<Integer, Long> responseCodeCountDStream = logObject.transformToPair
(MainApplication::responseCodeCount);
JavaPairDStream<Integer, Long> cumulativeResponseCodeCountDStream = responseCodeCountDStream.updateStateByKey
(COMPUTE_RUNNING_SUM);
cumulativeResponseCodeCountDStream.foreachRDD(rdd -> {
rdd.checkpoint();
LOG.warn("Response code counts: " + rdd.take(100));
});
Could somebody point me to right direction, if I am missing something?
Also, I can see that checkpoint is being saved in hdfs. But why wont it read from it?

Submitting Spark application on standalone cluster

I am rather new at using Spark and I am having issues running a simple word count application on a standalone cluster. I have a cluster consisting of one master node and one worker, launched on AWS using the spark-ec2 script. Everything works fine when I run the code locally using
./bin/spark-submit --class com.spark.SparkDataAnalysis --master local[*] ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount
This saves the output into the specified directory as it should.
When I try to run the application using
./bin/spark-submit --class com.spark.SparkDataAnalysis --master spark://server-ip:7077 ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount
it just keeps on running and never produce a final result. The directory gets created but only a temporary file of 0 bytes is present.
According to the Spark UI it keeps on running the mapToPair function indefinitely.
Here is a picture of the Spark UI
Does anyone know why this is happening and how to solve it?
Here is the code:
public class SparkDataAnalysis {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("SparkDataAnalysis");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> input = sc.textFile( args[0] );
JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) );
JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2<String, Integer>( t, 1 ) ).reduceByKey( (x, y) -> x + y );
counts.saveAsTextFile( args[1] );
}
}
I skipped using a standalone cluster via the spark-ec2 script and used Amazon EMR instead. There everything worked perfectly.

Categories