JavaRDD<String> to JavaRDD<Row> - java

I am reading a txt file as a JavaRDD with the following command:
JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);
Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.
I tried also this:
JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))
But is says I cannot assign the map function to an "Object" RDD
How can I create a JavaRDD out of a JavaRDD
How can I use map to the JavaRDD?
Thanks!

Creating a JavaRDD out of another is implicit when you apply a transformation such as map. Here, the RDD you create is a RDD of arrays of strings (result of split).
To get a RDD of rows, just create a Row from the array:
JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
Note that if your goal is then to transform the JavaRDD<Row> to a dataframe (Dataset<Row>), there is a simpler way. You can change the delimiter option when using spark.read to avoid having to use RDDs:
Dataset<Row> dataframe = spark.read()
.option("delimiter", "\t")
.csv("your_path/file.csv");

You can define this two columns as a class's field, and then you can use
JavaRDD<Row> rows = rdd.map(new Function<ClassName, Row>() {
#Override
public Row call(ClassName target) throws Exception {
return RowFactory.create(
target.getField1(),
target.getUsername(),
}
});
And then create StructField,
finally using
StructType struct = DataTypes.createStructType(fields);
Dataset<Row> dataFrame = sparkSession.createDataFrame(rows, struct);

Related

convert RDD to Dataset in Java Spark

I have an RDD, i need to convert it into a Dataset, i tried:
Dataset<Person> personDS = sqlContext.createDataset(personRDD, Encoders.bean(Person.class));
the above line throws the error,
cannot resolve method createDataset(org.apache.spark.api.java.JavaRDD
Main.Person, org.apache.spark.sql.Encoder T)
however, i can convert to Dataset after converting to Dataframe. the below code works:
Dataset<Row> personDF = sqlContext.createDataFrame(personRDD, Person.class);
Dataset<Person> personDS = personDF.as(Encoders.bean(Person.class));
.createDataset() accepts RDD<T> not JavaRDD<T>. JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd(). The following can create a Dataset:
Dataset<Person> personDS = sqlContext.createDataset(personRDD.rdd(), Encoders.bean(Person.class));
on your rdd use .toDS() you will get a dataset.
Let me know if it helps. Cheers.
In addition to accepted answer, if you want to create a Dataset<Row> instead of Dataset<Person> in Java, please try like this:
StructType yourStruct = ...; //Create your own structtype based on individual field types
Dataset<Row> personDS = sqlContext.createDataset(personRDD.rdd(), RowEncoder.apply(yourStruct));
StructType schema = new StructType()
.add("Id", DataTypes.StringType)
.add("Name", DataTypes.StringType)
.add("Country", DataTypes.StringType);
Dataset<Row> dataSet = sqlContext.createDataFrame(yourJavaRDD, schema);
Be carefull with schema variable, not always easy to predict what datatype you need to use, sometimes it's better to use just StringType for all columns

Histogram with Spark Dataframe in Java

Is it possible to generate a histogram dataframe with Spark 2.1 in Java from a Dataset<Row> table?
Convert the Dataset into JavaRDD where Datatype can be Integer, Double etc. using toJavaRDD().map() function.
Again Convert the JavaRDD to JavaDoubleRDD using mapToDouble function.
Then you can apply histogram(int bucketcount) to get the histogram of the data.
Example :
I got a table in spark with table name as 'nation' having column as 'n_nationkey' which is Integer then this is how I did it:
String query = "select n_nationkey from nation" ;
Dataset<Row> df = spark.sql(query);
JavaRDD<Integer> jdf = df.toJavaRDD().map(row -> row.getInt(0));
JavaDoubleRDD example = jdf.mapToDouble(y -> y);
Tuple2<double[], long[]> resultsnew = example.histogram(5);
In case the column have a double type, you simply replace some things as :
JavaRDD<Double> jdf = df.toJavaRDD().map(row -> row.getDouble(0));
JavaDoubleRDD example = jdf.mapToDouble(y -> y);
Tuple2<double[], long[]> resultsnew = example.histogram(5);

Spark error when convert JavaRDD to DataFrame: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>

I am using Spark 2.1.0. For the following code, which read a text file and convert the content to DataFrame, then feed into a Word2Vector model:
SparkSession spark = SparkSession.builder().appName("word2vector").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 10).toJavaRDD();
JavaRDD<List<String>> lists = lines.map(new Function<String, List<String>>(){
public List<String> call(String line){
List<String> list = Arrays.asList(line.split(" "));
return list;
}
});
JavaRDD<Row> rows = lists.map(new Function<List<String>, Row>() {
public Row call(List<String> list) {
return RowFactory.create(list);
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(100).setMinCount(0);
Word2VecModel model = word2Vec.fit(input);
Dataset<Row> result = model.transform(input);
It throws an exception
java.lang.RuntimeException: Error while encoding: java.util.Arrays$ArrayList is not a valid external type for
schema of array
which happens at line input.show(3) , so the createDataFrame() is causing the exception because Arrays.asList() returns an Arrays$ArrayList which is not supported here. However the Spark Official Documentation has the following code:
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
which works just fine. If Arrays$ArrayList is not supported, how come this code is working? The difference is I am converting a JavaRDD<Row> to DataFrame but the official documentation is converting a List<Row> to DataFrame. I believe Spark Java API has an overloaded method createDataFrame() which takes a JavaRDD<Row> and convert it to a DataFrame based on the provided schema. I am so confused about why it is not working. Can anyone help?
I encountered the same issue several days ago and the only way to solve this problem is the use an array of array. Why ? Here is the response:
An ArrayType is wrapper for Scala Arrays which correspond one-to-one to Java arrays. Java ArrayList is not mapped by default to Scala Array so that's why you get the exception:
java.util.Arrays$ArrayList is not a valid external type for schema of array
Hence, passing directly a String[] sould have work:
RowFactory.create(line.split(" "))
But since create takes as input an Object list as a row may have a columns list, the String[] get interpreted to a list of String columns. That's why a double array of String is required:
RowFactory.create(new String[][] {line.split(" ")})
However, still the mystery of constructing a DataFrame from a Java List of rows in the spark documentation. This is because the SparkSession.createDataFrame function version that takes as first parameter java.util.List of rows makes special type checks and converts so that it converts all Java Iterable (so ArrayList) to a Scala Array.
However, the SparkSession.createDataFrame that takes JavaRDD maps directly the row content to the DataFrame.
To wrap-up, this is the correct version:
SparkSession spark = SparkSession.builder().master("local[*]").appName("Word2Vec").getOrCreate();
SparkContext sc = spark.sparkContext();
sc.setLogLevel("WARN");
JavaRDD<String> lines = sc.textFile("input.txt", 10).toJavaRDD();
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
public Row call(String line){
return RowFactory.create(new String[][] {line.split(" ")});
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Hope this solves your problem.
It's exactly as the error says. ArrayList is not the equivalent of Scala's array. You should use a normal array (i.e String[]) instead.
for me below is working fine
JavaRDD<Row> rowRdd = rdd.map(r -> RowFactory.create(r.split(",")));

Reading parquet file in Spark from S3

I am reading data from S3 in the parquet format, and then I process this data as a DataFrame.
The question is how to efficiently iterate over rows in DataFrame? I know that the method collect loads data into memory, so, though my DataFrame is not big, I would prefer to avoid loading the complete data set into memory. How could I optimize the given code?
Also, I am using indices to access columns in DataFrame. Can I access them by column names (I know them)?
DataFrame parquetFile = sqlContext.read().parquet("s3n://"+this.aws_bucket+"/"+this.aws_key_members);
parquetFile.registerTempTable("mydata");
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM mydata");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
Map<String, String> props = new HashMap<>();
props.put("field1", rddRows[rowIdx].get(0).toString());
props.put("field2", rddRows[rowIdx].get(1).toString());
// further processing
}
You can use Map function in spark.
You can iterate the whole data frame without collecting the dataset/dataframe:
Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age
BETWEEN 13 AND 19");
Dataset<String> namesDS = namesDF.map((MapFunction<Row, String>) row -> "Name:" + row.getString(0),Encoders.STRING());
namesDS.show();
You can create a map function if the operations you are doing are complex:
// Map function
Row doSomething(Row row){
// get column
String field = row.getAs(COLUMN)
// construct a new row and add all the existing/modified columns in the row .
return row.
}
Now this map function can be called into dataframe's map function:
StructType structType = dataset.schema();
namesDF.map((MapFunction<Row, Row>)dosomething,
RowEncoder.apply(structType))
Source: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

How to add more RDD to existing RDD in Spark?

I have a RDD and want to add more RDD to it. How can I do it in Spark?
I have code like below. I want to return RDD from the dStream I have.
JavaDStream<Object> newDStream = dStream.map(this);
JavaRDD<Object> rdd = context.sparkContext().emptyRDD();
return newDStream.wrapRDD(context.sparkContext().emptyRDD());
I do not find much documentation about wrapRDD method of JavaDStream class provided by Apache Spark.
Since RDD is immutable, what you can do is use sparkContext.parallelize to create a new RDD and return the new one.
List<Object> objectList = new ArrayList<Object>;
objectList.add("your content");
JavaRDD<Object> objectRDD = sparkContext.parallelize(objectList);
JavaRDD<Object> newRDD = oldRDD.union(objectRDD);
See https://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections
You can use JavaStreamingContext.queueStream and fill it with a Queue<RDD<YourType>>:
public JavaInputDStream<Object> FillDStream() {
LinkedList<RDD<Object>> rdds = new LinkedList<RDD<Object>>();
rdds.add(context.sparkContext.emptyRDD());
rdds.add(context.sparkContext.emptyRDD());
JavaInputDStream<Object> filledDStream = context.queueStream(rdds);
return filledStream;
}

Categories