How to access Java Spark Broadcast variable? - java

i am trying to broadcast a Dataset in spark in order to access it from within a map function. The first print statement returns the first line of the broadcasted dataset as expected. Unfortunately, the second print statement does not return a result. The execution simply hangs at this point.
Any idea what I'm doing wrong?
Broadcast<JavaRDD<Row>> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData.toJavaRDD());
System.out.println("Data:" + broadcastedTrainingData.value().first());
JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> {
System.out.println("Data (map):" + broadcastedTrainingData.value().first());
return RowFactory.create(row);
});
The following pseudocode hightlights what i want to achieve. My main goal is to broadcast the training dataset, so i can use it from within a map function.
public Dataset<Row> getWSSE(Dataset<Row> trainingData, int clusterRange) {
StructType structType = new StructType();
structType = structType.add("ClusterAm", DataTypes.IntegerType, false);
structType = structType.add("Cost", DataTypes.DoubleType, false);
List<Integer> stringAsList = new ArrayList<>();
for (int clusterAm = 2; clusterAm < clusterRange + 2; clusterAm++) {
stringAsList.add(clusterAm);
}
Broadcast<Dataset> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData);
System.out.println("Data:" + broadcastedTrainingData.value().first());
JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> RowFactory.create(row));
StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("ClusterAm", DataTypes.IntegerType, false)});
Dataset wsse = sqlContext.createDataFrame(rowRDD, schema).toDF();
wsse.show();
ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
Dataset result = wsse.map(
(MapFunction<Row, Row>) row -> RowFactory.create(row.getAs("ClusterAm"), new KMeans().setK(row.getAs("ClusterAm")).setSeed(1L).fit(broadcastedTrainingData.value()).computeCost(broadcastedTrainingData.value())),
encoder);
result.show();
broadcastedTrainingData.destroy();
return wsse;
}

DataSet<Row> trainingData = ...<Your dataset>;
//Creating the broadcast variable. No need to write classTag code by hand
// use akka.japi.Util which is available
Broadcast<Dataset<Row>> broadcastedTrainingData = spark.sparkContext()
.broadcast(trainingData, akka.japi.Util.classTag(DataSet.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. trainingData) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to
//each machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSet.foreach((row) -> {
DataSet<Row> recieveBrdcast = broadcastedTrainingData.value();
...
...
})

Related

How to transform particular code piece from Spark 1.6.2 to Spark 2.2.0?

I need to pass my Spark 1.6.2 code to Spark 2.2.0 in Java.
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM my_data");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
Map<String, String> myProperties = new HashMap<>();
myProperties.put("startdate", rddRows[rowIdx].get(1).toString());
JEDIS.hmset("PK:" + rddRows[rowIdx].get(0).toString(), myProperties); // JEDIS is a Redis client for Java
}
As far as I understand, there is no DataFrame in Spark 2.2.0 for Java. Only Dataset. However, if I substitute DataFrame with Dataset, then I get Object[] instead of Row[] as output of eventsRaw.collect(). Then get(1) is marked in red and I cannot compile the code.
How can I correctly do it?
DataFrame (Scala) is Dataset<Row>:
SparkSession spark;
...
Dataset<Row> eventsRaw = spark.sql("SELECT * FROM my_data");
but instead of collect you should rather use foreach (use lazy singleton connection) :
eventsRaw.foreach(
(ForeachFunction<Row>) row -> ... // replace ... with appropriate logic
);
or foreachPartition (initialize connection for each partition):
eventsRaw.foreachPartition((ForeachPartitionFunction<Row)) rows -> {
... // replace ... with appropriate logic
})

Spark error when convert JavaRDD to DataFrame: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>

I am using Spark 2.1.0. For the following code, which read a text file and convert the content to DataFrame, then feed into a Word2Vector model:
SparkSession spark = SparkSession.builder().appName("word2vector").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 10).toJavaRDD();
JavaRDD<List<String>> lists = lines.map(new Function<String, List<String>>(){
public List<String> call(String line){
List<String> list = Arrays.asList(line.split(" "));
return list;
}
});
JavaRDD<Row> rows = lists.map(new Function<List<String>, Row>() {
public Row call(List<String> list) {
return RowFactory.create(list);
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(100).setMinCount(0);
Word2VecModel model = word2Vec.fit(input);
Dataset<Row> result = model.transform(input);
It throws an exception
java.lang.RuntimeException: Error while encoding: java.util.Arrays$ArrayList is not a valid external type for
schema of array
which happens at line input.show(3) , so the createDataFrame() is causing the exception because Arrays.asList() returns an Arrays$ArrayList which is not supported here. However the Spark Official Documentation has the following code:
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
which works just fine. If Arrays$ArrayList is not supported, how come this code is working? The difference is I am converting a JavaRDD<Row> to DataFrame but the official documentation is converting a List<Row> to DataFrame. I believe Spark Java API has an overloaded method createDataFrame() which takes a JavaRDD<Row> and convert it to a DataFrame based on the provided schema. I am so confused about why it is not working. Can anyone help?
I encountered the same issue several days ago and the only way to solve this problem is the use an array of array. Why ? Here is the response:
An ArrayType is wrapper for Scala Arrays which correspond one-to-one to Java arrays. Java ArrayList is not mapped by default to Scala Array so that's why you get the exception:
java.util.Arrays$ArrayList is not a valid external type for schema of array
Hence, passing directly a String[] sould have work:
RowFactory.create(line.split(" "))
But since create takes as input an Object list as a row may have a columns list, the String[] get interpreted to a list of String columns. That's why a double array of String is required:
RowFactory.create(new String[][] {line.split(" ")})
However, still the mystery of constructing a DataFrame from a Java List of rows in the spark documentation. This is because the SparkSession.createDataFrame function version that takes as first parameter java.util.List of rows makes special type checks and converts so that it converts all Java Iterable (so ArrayList) to a Scala Array.
However, the SparkSession.createDataFrame that takes JavaRDD maps directly the row content to the DataFrame.
To wrap-up, this is the correct version:
SparkSession spark = SparkSession.builder().master("local[*]").appName("Word2Vec").getOrCreate();
SparkContext sc = spark.sparkContext();
sc.setLogLevel("WARN");
JavaRDD<String> lines = sc.textFile("input.txt", 10).toJavaRDD();
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
public Row call(String line){
return RowFactory.create(new String[][] {line.split(" ")});
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Hope this solves your problem.
It's exactly as the error says. ArrayList is not the equivalent of Scala's array. You should use a normal array (i.e String[]) instead.
for me below is working fine
JavaRDD<Row> rowRdd = rdd.map(r -> RowFactory.create(r.split(",")));

Load csv data with partition in spark 2.0

In Spark 2.0, i have the following method which loads the data into dataset
public Dataset<AccountingData> GetDataFrameFromTextFile()
{ // The schema is encoded in a string
String schemaString = "id firstname lastname accountNo";
// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split("\t")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return sparksession.read().schema(schema)
.option("mode", "DROPMALFORMED")
.option("sep", "|")
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace ", true)
.csv("D:\\HadoopDirectory\Employee.txt").as(Encoders.bean(Employee.class));
}
and in my driver code, Map operation is called on the dataset
Dataset<Employee> rowDataset = ad.GetDataFrameFromTextFile();
Dataset<String> map = rowDataset.map(new MapFunction<Employee, String>() {
#Override
public String call(Employee emp) throws Exception {
return TraverseRuleByADRow(emp);
}
},Encoders.STRING());
When i run the driver program in spark local mode with 8 cores on my laptop, i see 8 partitions split the input file.May i know whether there is a way to load the file in more than 8 partitions, say 100 or 1000 partitions?
I know this is achievable if the source data is from sql server table via jdbc.
sparksession.read().format("jdbc").option("url", urlCandi).option("dbtable", tableName).option("partitionColumn", partitionColumn).option("lowerBound", String.valueOf(lowerBound))
.option("upperBound", String.valueOf(upperBound))
.option("numPartitions", String.valueOf(numberOfPartitions))
.load().as(Encoders.bean(Employee.class));
Thanks
Use repartition() method from Dataset. According to Scaladoc, there is no option to set number of paritions while reading

DataFrame filtering based on second Dataframe

Using Spark SQL, I have two dataframes, they are created from one, such as:
df = sqlContext.createDataFrame(...);
df1 = df.filter("value = 'abc'"); //[path, value]
df2 = df.filter("value = 'qwe'"); //[path, value]
I want to filter df1, if part of its 'path' is any path in df2.
So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'.
In SQL it should be like
SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2)
where udf is user defined function that shorten original path from df1.
Naive solution is to use JOIN and then filter result, but it is slow, since df1 and df2 have each more than 10mil rows.
I also tried following code, but firstly I had to create broadcast variable from df2
static Broadcast<DataFrame> bdf;
bdf = sc.broadcast(df2); //variable 'sc' is JavaSparkContext
sqlContext.createDataFrame(df1.javaRDD().filter(
new Function<Row, Boolean>(){
#Override
public Boolean call(Row row) throws Exception {
String foo = shortenPath(row.getString(0));
return bdf.value().filter("path = '"+foo+"'").count()>0;
}
}
), myClass.class)
the problem I'm having is that Spark got stuck when the return was evaluated/when filtering of df2 was performed.
I would like to know how to work with two dataframes to do this.
I really want to avoid JOIN. Any ideas?
EDIT>>
In my original code df1 has alias 'first' and df2 'second'. This join is not cartesian, and it also does not use broadcast.
df1 = df1.as("first");
df2 = df2.as("second");
df1.join(df2, df1.col("first.path").
lt(df2.col("second.path"))
, "left_outer").
filter("isPrefix(first.path, second.path)").
na().drop("any");
isPrefix is udf
UDF2 isPrefix = new UDF2<String, String, Boolean>() {
#Override
public Boolean call(String p, String s) throws Exception {
//return true if (p.length()+4==s.length()) and s.contains(p)
}};
shortenPath - it cuts last two characters in path
UDF1 shortenPath = new UDF1<String, String>() {
#Override
public String call(String s) throws Exception {
String[] foo = s.split("/");
String result = "";
for (int i = 0; i < foo.length-2; i++) {
result += foo[i];
if(i<foo.length-3) result+="/";
}
return result;
}
};
Example of records. Path is unique.
a/a/a/b/c abc
a/a/a qwe
a/b/c/d/e abc
a/b/c qwe
a/b/b/k foo
a/b/f/a bar
...
So df1 consits of
a/a/a/b/c abc
a/b/c/d/e abc
...
and df2 consits of
a/a/a qwe
a/b/c qwe
...
There at least few problems with your code:
you cannot execute action or transformation inside another action or transformation. It means that filtering broadcasted DataFrame simply cannot work and you should get an exception.
join you use is executed as a Cartesian product followed by filter. Since Spark is using Hashing for joins only equality based joins can be efficiently executed without Cartesian. It is slightly related to Why using a UDF in a SQL query leads to cartesian product?
if both DataFrames are relatively large and have similar size then broadcasting is unlikely to be useful. See Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark
not important when it comes to performance but isPrefix seems to wrong. In particular it looks like it can match both prefix and suffix
col("first.path").lt(col("second.path")) condition looks wrong. I assume you want a/a/a/b/c from df1 match a/a/a from df2. If so it should be gt not lt.
Probably the best thing you can do is something similar to this:
import org.apache.spark.sql.functions.{col, regexp_extract}
val df = sc.parallelize(Seq(
("a/a/a/b/c", "abc"), ("a/a/a","qwe"),
("a/b/c/d/e", "abc"), ("a/b/c", "qwe"),
("a/b/b/k", "foo"), ("a/b/f/a", "bar")
)).toDF("path", "value")
val df1 = df
.where(col("value") === "abc")
.withColumn("path_short", regexp_extract(col("path"), "^(.*)(/.){2}$", 1))
.as("df1")
val df2 = df.where(col("value") === "qwe").as("df2")
val joined = df1.join(df2, col("df1.path_short") === col("df2.path"))
You can try to broadcast one of the tables like this (Spark >= 1.5.0 only):
import org.apache.spark.sql.functions.broadcast
df1.join(broadcast(df2), col("df1.path_short") === col("df2.path"))
and increase auto broadcast limits, but as I've mentioned above it most likely will be less efficient than plain HashJoin.
As a possible way of implementing IN with subquery, the LEFT SEMI JOIN can be used:
JavaSparkContext javaSparkContext = new JavaSparkContext("local", "testApp");
SQLContext sqlContext = new SQLContext(javaSparkContext);
StructType schema = DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("path", DataTypes.StringType, false),
DataTypes.createStructField("value", DataTypes.StringType, false)
});
// Prepare First DataFrame
List<Row> dataForFirstDF = new ArrayList<>();
dataForFirstDF.add(RowFactory.create("a/a/a/b/c", "abc"));
dataForFirstDF.add(RowFactory.create("a/b/c/d/e", "abc"));
dataForFirstDF.add(RowFactory.create("x/y/z", "xyz"));
DataFrame df1 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForFirstDF), schema);
//
df1.show();
//
// +---------+-----+
// | path|value|
// +---------+-----+
// |a/a/a/b/c| abc|
// |a/b/c/d/e| abc|
// | x/y/z| xyz|
// +---------+-----+
// Prepare Second DataFrame
List<Row> dataForSecondDF = new ArrayList<>();
dataForSecondDF.add(RowFactory.create("a/a/a", "qwe"));
dataForSecondDF.add(RowFactory.create("a/b/c", "qwe"));
DataFrame df2 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForSecondDF), schema);
// Use left semi join to filter out df1 based on path in df2
Column pathContains = functions.column("firstDF.path").contains(functions.column("secondDF.path"));
DataFrame result = df1.as("firstDF").join(df2.as("secondDF"), pathContains, "leftsemi");
//
result.show();
//
// +---------+-----+
// | path|value|
// +---------+-----+
// |a/a/a/b/c| abc|
// |a/b/c/d/e| abc|
// +---------+-----+
The Physical Plan of such query will look like this:
== Physical Plan ==
Limit 21
ConvertToSafe
LeftSemiJoinBNL Some(Contains(path#0, path#2))
ConvertToUnsafe
Scan PhysicalRDD[path#0,value#1]
TungstenProject [path#2]
Scan PhysicalRDD[path#2,value#3]
It will use the LeftSemiJoinBNL for the actual join operation, which should broadcast values internally. From more details refer to the actual implementation in Spark - LeftSemiJoinBNL.scala
P.S. I didn't quite understand the need for removing the last two characters, but if that's needed - it can be done, like #zero323 proposed (using regexp_extract).

Convert Spark DataFrame to Pojo Object

Please see below code:
//Create Spark Context
SparkConf sparkConf = new SparkConf().setAppName("TestWithObjects").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
//Creating RDD
JavaRDD<Person> personsRDD = javaSparkContext.parallelize(persons);
//Creating SQL context
SQLContext sQLContext = new SQLContext(javaSparkContext);
DataFrame personDataFrame = sQLContext.createDataFrame(personsRDD, Person.class);
personDataFrame.show();
personDataFrame.printSchema();
personDataFrame.select("name").show();
personDataFrame.registerTempTable("peoples");
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
result.show();
After this I need to convert the DataFrame - 'result' to Person Object or List. Thanks in advance.
DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.
The conversion from Dataset[Row] to Dataset[Person] is very simple in spark
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.
// Create an Encoders for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> personDF = result.as(personEncoder);
personDF.show();
Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.
Please refer to below link provided by databricks for further details
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
A DataFrame is stored as Rows, so you can use the methods there to cast from untyped to typed. Take a look at the get methods.
If someone looking for conversion of json string column in Dataset<Row> to Dataset<PojoClass>
Sample pojo: Testing
#Data
public class Testing implements Serializable {
private String name;
private String dept;
}
In the above code #Data is an annotation from Lombok to generate getters and setters for this Testing class.
Actual conversion logic in Spark
#Test
void shouldConvertJsonStringToPojo() {
var sparkSession = SparkSession.builder().getOrCreate();
var structType = new StructType(new StructField[] {
new StructField("employee", DataTypes.StringType, false, Metadata.empty()),
});
var ds = sparkSession.createDataFrame(new ArrayList<>(
Arrays.asList(RowFactory.create(new Object[]{"{ \"name\": \"test\", \"dept\": \"IT\"}"}))), structType);
var objectMapper = new ObjectMapper();
var bean = Encoders.bean(Testing.class);
var testingDataset = ds.map((MapFunction<Row, Testing>) row -> {
var dept = row.<String>getAs("employee");
return objectMapper.readValue(dept, Testing.class);
}, bean);
assertEquals("test", testingDataset.head().getName());
}

Categories