I have a RDD and want to add more RDD to it. How can I do it in Spark?
I have code like below. I want to return RDD from the dStream I have.
JavaDStream<Object> newDStream = dStream.map(this);
JavaRDD<Object> rdd = context.sparkContext().emptyRDD();
return newDStream.wrapRDD(context.sparkContext().emptyRDD());
I do not find much documentation about wrapRDD method of JavaDStream class provided by Apache Spark.
Since RDD is immutable, what you can do is use sparkContext.parallelize to create a new RDD and return the new one.
List<Object> objectList = new ArrayList<Object>;
objectList.add("your content");
JavaRDD<Object> objectRDD = sparkContext.parallelize(objectList);
JavaRDD<Object> newRDD = oldRDD.union(objectRDD);
See https://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections
You can use JavaStreamingContext.queueStream and fill it with a Queue<RDD<YourType>>:
public JavaInputDStream<Object> FillDStream() {
LinkedList<RDD<Object>> rdds = new LinkedList<RDD<Object>>();
rdds.add(context.sparkContext.emptyRDD());
rdds.add(context.sparkContext.emptyRDD());
JavaInputDStream<Object> filledDStream = context.queueStream(rdds);
return filledStream;
}
Related
I am reading a txt file as a JavaRDD with the following command:
JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);
Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.
I tried also this:
JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))
But is says I cannot assign the map function to an "Object" RDD
How can I create a JavaRDD out of a JavaRDD
How can I use map to the JavaRDD?
Thanks!
Creating a JavaRDD out of another is implicit when you apply a transformation such as map. Here, the RDD you create is a RDD of arrays of strings (result of split).
To get a RDD of rows, just create a Row from the array:
JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
Note that if your goal is then to transform the JavaRDD<Row> to a dataframe (Dataset<Row>), there is a simpler way. You can change the delimiter option when using spark.read to avoid having to use RDDs:
Dataset<Row> dataframe = spark.read()
.option("delimiter", "\t")
.csv("your_path/file.csv");
You can define this two columns as a class's field, and then you can use
JavaRDD<Row> rows = rdd.map(new Function<ClassName, Row>() {
#Override
public Row call(ClassName target) throws Exception {
return RowFactory.create(
target.getField1(),
target.getUsername(),
}
});
And then create StructField,
finally using
StructType struct = DataTypes.createStructType(fields);
Dataset<Row> dataFrame = sparkSession.createDataFrame(rows, struct);
I have an RDD, i need to convert it into a Dataset, i tried:
Dataset<Person> personDS = sqlContext.createDataset(personRDD, Encoders.bean(Person.class));
the above line throws the error,
cannot resolve method createDataset(org.apache.spark.api.java.JavaRDD
Main.Person, org.apache.spark.sql.Encoder T)
however, i can convert to Dataset after converting to Dataframe. the below code works:
Dataset<Row> personDF = sqlContext.createDataFrame(personRDD, Person.class);
Dataset<Person> personDS = personDF.as(Encoders.bean(Person.class));
.createDataset() accepts RDD<T> not JavaRDD<T>. JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd(). The following can create a Dataset:
Dataset<Person> personDS = sqlContext.createDataset(personRDD.rdd(), Encoders.bean(Person.class));
on your rdd use .toDS() you will get a dataset.
Let me know if it helps. Cheers.
In addition to accepted answer, if you want to create a Dataset<Row> instead of Dataset<Person> in Java, please try like this:
StructType yourStruct = ...; //Create your own structtype based on individual field types
Dataset<Row> personDS = sqlContext.createDataset(personRDD.rdd(), RowEncoder.apply(yourStruct));
StructType schema = new StructType()
.add("Id", DataTypes.StringType)
.add("Name", DataTypes.StringType)
.add("Country", DataTypes.StringType);
Dataset<Row> dataSet = sqlContext.createDataFrame(yourJavaRDD, schema);
Be carefull with schema variable, not always easy to predict what datatype you need to use, sometimes it's better to use just StringType for all columns
I am using Spark 2.1.0. For the following code, which read a text file and convert the content to DataFrame, then feed into a Word2Vector model:
SparkSession spark = SparkSession.builder().appName("word2vector").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 10).toJavaRDD();
JavaRDD<List<String>> lists = lines.map(new Function<String, List<String>>(){
public List<String> call(String line){
List<String> list = Arrays.asList(line.split(" "));
return list;
}
});
JavaRDD<Row> rows = lists.map(new Function<List<String>, Row>() {
public Row call(List<String> list) {
return RowFactory.create(list);
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(100).setMinCount(0);
Word2VecModel model = word2Vec.fit(input);
Dataset<Row> result = model.transform(input);
It throws an exception
java.lang.RuntimeException: Error while encoding: java.util.Arrays$ArrayList is not a valid external type for
schema of array
which happens at line input.show(3) , so the createDataFrame() is causing the exception because Arrays.asList() returns an Arrays$ArrayList which is not supported here. However the Spark Official Documentation has the following code:
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
which works just fine. If Arrays$ArrayList is not supported, how come this code is working? The difference is I am converting a JavaRDD<Row> to DataFrame but the official documentation is converting a List<Row> to DataFrame. I believe Spark Java API has an overloaded method createDataFrame() which takes a JavaRDD<Row> and convert it to a DataFrame based on the provided schema. I am so confused about why it is not working. Can anyone help?
I encountered the same issue several days ago and the only way to solve this problem is the use an array of array. Why ? Here is the response:
An ArrayType is wrapper for Scala Arrays which correspond one-to-one to Java arrays. Java ArrayList is not mapped by default to Scala Array so that's why you get the exception:
java.util.Arrays$ArrayList is not a valid external type for schema of array
Hence, passing directly a String[] sould have work:
RowFactory.create(line.split(" "))
But since create takes as input an Object list as a row may have a columns list, the String[] get interpreted to a list of String columns. That's why a double array of String is required:
RowFactory.create(new String[][] {line.split(" ")})
However, still the mystery of constructing a DataFrame from a Java List of rows in the spark documentation. This is because the SparkSession.createDataFrame function version that takes as first parameter java.util.List of rows makes special type checks and converts so that it converts all Java Iterable (so ArrayList) to a Scala Array.
However, the SparkSession.createDataFrame that takes JavaRDD maps directly the row content to the DataFrame.
To wrap-up, this is the correct version:
SparkSession spark = SparkSession.builder().master("local[*]").appName("Word2Vec").getOrCreate();
SparkContext sc = spark.sparkContext();
sc.setLogLevel("WARN");
JavaRDD<String> lines = sc.textFile("input.txt", 10).toJavaRDD();
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
public Row call(String line){
return RowFactory.create(new String[][] {line.split(" ")});
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Hope this solves your problem.
It's exactly as the error says. ArrayList is not the equivalent of Scala's array. You should use a normal array (i.e String[]) instead.
for me below is working fine
JavaRDD<Row> rowRdd = rdd.map(r -> RowFactory.create(r.split(",")));
I use Spark SQL in a Spark Streaming Job to search in a Hive table.
Kafka streaming works fine without problems. If I run hiveContext.runSqlHive(sqlQuery); outside directKafkaStream.foreachRDD it works fine without problems. But I need the Hive-Table lookup inside the streaming job. The use of JDBC (jdbc:hive2://) would work, but I want to use the Spark SQL.
The significant places of my source code looks as follows:
// set context
SparkConf sparkConf = new SparkConf().setAppName(appName).set("spark.driver.allowMultipleContexts", "true");
SparkContext sparkSqlContext = new SparkContext(sparkConf);
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(batchDuration));
HiveContext hiveContext = new HiveContext(sparkSqlContext);
// Initialize Direct Spark Kafka Stream. Starts from top
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(streamingContext,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
// work on stream
directKafkaStream.foreachRDD((Function<JavaPairRDD<String, String>, Void>) rdd -> {
rdd.foreachPartition(tuple2Iterator -> {
// get message
Tuple2<String, String> item = tuple2Iterator.next();
// lookup
String sqlQuery = "SELECT something FROM somewhere";
Seq<String> resultSequence = hiveContext.runSqlHive(sqlQuery);
List<String> result = scala.collection.JavaConversions.seqAsJavaList(resultSequence);
});
return null;
});
// Start the computation
streamingContext.start();
streamingContext.awaitTermination();
I get no meaningful error, even if I surround with try-catch.
I hope someone can help - Thanks.
//edit:
The solution looks like:
// work on stream
directKafkaStream.foreachRDD((Function<JavaPairRDD<String, String>, Void>) rdd -> {
// driver
Map<String, String> lookupMap = getResult(hiveContext); //something with hiveContext.runSqlHive(sqlQuery);
rdd.foreachPartition(tuple2Iterator -> {
// worker
while (tuple2Iterator != null && tuple2Iterator.hasNext()) {
// get message
Tuple2<String, String> item = tuple2Iterator.next();
// lookup
String result = lookupMap.get(item._2());
}
});
return null;
});
Just because you want to use Spark SQL it won't make it possible. Spark's rule number one is no nested actions, transformations or distributed data structures.
If you can express your query for example as join you can use push it to one level higher to foreachRDD and this pretty much exhaust your options to use Spark SQL here:
directKafkaStream.foreachRDD(rdd ->
hiveContext.runSqlHive(sqlQuery)
rdd.foreachPartition(...)
)
Otherwise direct JDBC connection can be a valid option.
Please see below code:
//Create Spark Context
SparkConf sparkConf = new SparkConf().setAppName("TestWithObjects").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
//Creating RDD
JavaRDD<Person> personsRDD = javaSparkContext.parallelize(persons);
//Creating SQL context
SQLContext sQLContext = new SQLContext(javaSparkContext);
DataFrame personDataFrame = sQLContext.createDataFrame(personsRDD, Person.class);
personDataFrame.show();
personDataFrame.printSchema();
personDataFrame.select("name").show();
personDataFrame.registerTempTable("peoples");
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
result.show();
After this I need to convert the DataFrame - 'result' to Person Object or List. Thanks in advance.
DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.
The conversion from Dataset[Row] to Dataset[Person] is very simple in spark
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.
// Create an Encoders for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> personDF = result.as(personEncoder);
personDF.show();
Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.
Please refer to below link provided by databricks for further details
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
A DataFrame is stored as Rows, so you can use the methods there to cast from untyped to typed. Take a look at the get methods.
If someone looking for conversion of json string column in Dataset<Row> to Dataset<PojoClass>
Sample pojo: Testing
#Data
public class Testing implements Serializable {
private String name;
private String dept;
}
In the above code #Data is an annotation from Lombok to generate getters and setters for this Testing class.
Actual conversion logic in Spark
#Test
void shouldConvertJsonStringToPojo() {
var sparkSession = SparkSession.builder().getOrCreate();
var structType = new StructType(new StructField[] {
new StructField("employee", DataTypes.StringType, false, Metadata.empty()),
});
var ds = sparkSession.createDataFrame(new ArrayList<>(
Arrays.asList(RowFactory.create(new Object[]{"{ \"name\": \"test\", \"dept\": \"IT\"}"}))), structType);
var objectMapper = new ObjectMapper();
var bean = Encoders.bean(Testing.class);
var testingDataset = ds.map((MapFunction<Row, Testing>) row -> {
var dept = row.<String>getAs("employee");
return objectMapper.readValue(dept, Testing.class);
}, bean);
assertEquals("test", testingDataset.head().getName());
}