Load csv data with partition in spark 2.0

Load csv data with partition in spark 2.0 - java

In Spark 2.0, i have the following method which loads the data into dataset
public Dataset<AccountingData> GetDataFrameFromTextFile()
{ // The schema is encoded in a string
String schemaString = "id firstname lastname accountNo";
// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split("\t")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return sparksession.read().schema(schema)
.option("mode", "DROPMALFORMED")
.option("sep", "|")
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace ", true)
.csv("D:\\HadoopDirectory\Employee.txt").as(Encoders.bean(Employee.class));
}
and in my driver code, Map operation is called on the dataset
Dataset<Employee> rowDataset = ad.GetDataFrameFromTextFile();
Dataset<String> map = rowDataset.map(new MapFunction<Employee, String>() {
#Override
public String call(Employee emp) throws Exception {
return TraverseRuleByADRow(emp);
}
},Encoders.STRING());
When i run the driver program in spark local mode with 8 cores on my laptop, i see 8 partitions split the input file.May i know whether there is a way to load the file in more than 8 partitions, say 100 or 1000 partitions?
I know this is achievable if the source data is from sql server table via jdbc.
sparksession.read().format("jdbc").option("url", urlCandi).option("dbtable", tableName).option("partitionColumn", partitionColumn).option("lowerBound", String.valueOf(lowerBound))
.option("upperBound", String.valueOf(upperBound))
.option("numPartitions", String.valueOf(numberOfPartitions))
.load().as(Encoders.bean(Employee.class));
Thanks

Use repartition() method from Dataset. According to Scaladoc, there is no option to set number of paritions while reading

Related

Add a column with constant value into Spark Dataset in Java?

I'm working on a report generation with Spark and I need to be able to somehow add a column with constant value into a Dataset created with Dataset.select() and then flushed into CSV file:
private static void buildReport(FileSystem fileSystem, Dataset<Row> joinedDs, String reportName) throws IOException {
Path report = new Path(reportName);
joinedDs.filter(aFlter)
.select(
joinedDs.col("AGREEMENT_ID"),
//... here I need to insert a column with constant value
joinedDs.col("ERROR_MESSAGE")
)
.write()
.format("csv")
.option("header", true)
.option("sep", ",")
.csv(reportName);
fileSystem.copyToLocalFile(report, new Path(reportName + ".csv"));
}
I don't want to insert the column manually into created CSV file, I'd like to have the column there at file creation time.

You can add it with lit function during select
.select(
joinedDs.col("AGREEMENT_ID"),
lit("YOUR_CONSTANT_VALUE").as("YOUR_COL_NAME"),
joinedDs.col("ERROR_MESSAGE")
)

Convert Spark Dataset to JSON and Write to Kafka Producer

I want to read a table from Hive and write to Kafka Producer(batch job).
Currently, I am reading the table as Dataset<Row> in my java class and trying to convert to a json so that i can write as json message using KafkaProducer.
Dataset<Row> data = spark.sql("select * from tablename limit 5");
List<Row> rows = data.collectAsList();
for(Row row: rows) {
List<String> stringList = new ArrayList<String>(Arrays.asList(row.schema().fieldNames()));
Seq<String> row_seq = JavaConverters.asScalaIteratorConverter(stringList.iterator()).asScala().toSeq();
Map map = (Map) row.getValuesMap(row_seq);
JSONObject json = new JSONObject();
json.putAll( map);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(SPARK_CONF.get("topic.name"), json.toString());
producer.send(record);
I am getting ClassCastException

As soon as you write collectAsList();, you are no longer using Spark, just raw Kafka Java API.
My suggestion would be to use Spark Structured Streaming Kafka Integration and you can do
Here is an example, and you need to form a DataFrame with at least two columns because Kafka takes keys and values.
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
data.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic_name")
.save()
As far as getting data into JSON, again, collectToList() is wrong. Do not pull data into a single node.
You can use data.map() to convert a DataSet from one format into another.
For example, you would map a Row into a String that is in JSON format.
row -> "{\"f0\":" + row.get(0) + "}"

How to access Java Spark Broadcast variable?

i am trying to broadcast a Dataset in spark in order to access it from within a map function. The first print statement returns the first line of the broadcasted dataset as expected. Unfortunately, the second print statement does not return a result. The execution simply hangs at this point.
Any idea what I'm doing wrong?
Broadcast<JavaRDD<Row>> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData.toJavaRDD());
System.out.println("Data:" + broadcastedTrainingData.value().first());
JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> {
System.out.println("Data (map):" + broadcastedTrainingData.value().first());
return RowFactory.create(row);
});
The following pseudocode hightlights what i want to achieve. My main goal is to broadcast the training dataset, so i can use it from within a map function.
public Dataset<Row> getWSSE(Dataset<Row> trainingData, int clusterRange) {
StructType structType = new StructType();
structType = structType.add("ClusterAm", DataTypes.IntegerType, false);
structType = structType.add("Cost", DataTypes.DoubleType, false);
List<Integer> stringAsList = new ArrayList<>();
for (int clusterAm = 2; clusterAm < clusterRange + 2; clusterAm++) {
stringAsList.add(clusterAm);
}
Broadcast<Dataset> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData);
System.out.println("Data:" + broadcastedTrainingData.value().first());
JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> RowFactory.create(row));
StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("ClusterAm", DataTypes.IntegerType, false)});
Dataset wsse = sqlContext.createDataFrame(rowRDD, schema).toDF();
wsse.show();
ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
Dataset result = wsse.map(
(MapFunction<Row, Row>) row -> RowFactory.create(row.getAs("ClusterAm"), new KMeans().setK(row.getAs("ClusterAm")).setSeed(1L).fit(broadcastedTrainingData.value()).computeCost(broadcastedTrainingData.value())),
encoder);
result.show();
broadcastedTrainingData.destroy();
return wsse;
}

DataSet<Row> trainingData = ...<Your dataset>;
//Creating the broadcast variable. No need to write classTag code by hand
// use akka.japi.Util which is available
Broadcast<Dataset<Row>> broadcastedTrainingData = spark.sparkContext()
.broadcast(trainingData, akka.japi.Util.classTag(DataSet.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. trainingData) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to
//each machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSet.foreach((row) -> {
DataSet<Row> recieveBrdcast = broadcastedTrainingData.value();
...
...
})

Spark error when convert JavaRDD to DataFrame: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>

I am using Spark 2.1.0. For the following code, which read a text file and convert the content to DataFrame, then feed into a Word2Vector model:
SparkSession spark = SparkSession.builder().appName("word2vector").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 10).toJavaRDD();
JavaRDD<List<String>> lists = lines.map(new Function<String, List<String>>(){
public List<String> call(String line){
List<String> list = Arrays.asList(line.split(" "));
return list;
}
});
JavaRDD<Row> rows = lists.map(new Function<List<String>, Row>() {
public Row call(List<String> list) {
return RowFactory.create(list);
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(100).setMinCount(0);
Word2VecModel model = word2Vec.fit(input);
Dataset<Row> result = model.transform(input);
It throws an exception
java.lang.RuntimeException: Error while encoding: java.util.Arrays$ArrayList is not a valid external type for
schema of array
which happens at line input.show(3) , so the createDataFrame() is causing the exception because Arrays.asList() returns an Arrays$ArrayList which is not supported here. However the Spark Official Documentation has the following code:
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
which works just fine. If Arrays$ArrayList is not supported, how come this code is working? The difference is I am converting a JavaRDD<Row> to DataFrame but the official documentation is converting a List<Row> to DataFrame. I believe Spark Java API has an overloaded method createDataFrame() which takes a JavaRDD<Row> and convert it to a DataFrame based on the provided schema. I am so confused about why it is not working. Can anyone help?

I encountered the same issue several days ago and the only way to solve this problem is the use an array of array. Why ? Here is the response:
An ArrayType is wrapper for Scala Arrays which correspond one-to-one to Java arrays. Java ArrayList is not mapped by default to Scala Array so that's why you get the exception:
java.util.Arrays$ArrayList is not a valid external type for schema of array
Hence, passing directly a String[] sould have work:
RowFactory.create(line.split(" "))
But since create takes as input an Object list as a row may have a columns list, the String[] get interpreted to a list of String columns. That's why a double array of String is required:
RowFactory.create(new String[][] {line.split(" ")})
However, still the mystery of constructing a DataFrame from a Java List of rows in the spark documentation. This is because the SparkSession.createDataFrame function version that takes as first parameter java.util.List of rows makes special type checks and converts so that it converts all Java Iterable (so ArrayList) to a Scala Array.
However, the SparkSession.createDataFrame that takes JavaRDD maps directly the row content to the DataFrame.
To wrap-up, this is the correct version:
SparkSession spark = SparkSession.builder().master("local[*]").appName("Word2Vec").getOrCreate();
SparkContext sc = spark.sparkContext();
sc.setLogLevel("WARN");
JavaRDD<String> lines = sc.textFile("input.txt", 10).toJavaRDD();
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
public Row call(String line){
return RowFactory.create(new String[][] {line.split(" ")});
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Hope this solves your problem.

It's exactly as the error says. ArrayList is not the equivalent of Scala's array. You should use a normal array (i.e String[]) instead.

for me below is working fine
JavaRDD<Row> rowRdd = rdd.map(r -> RowFactory.create(r.split(",")));

Convert Spark DataFrame to Pojo Object

Please see below code:
//Create Spark Context
SparkConf sparkConf = new SparkConf().setAppName("TestWithObjects").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
//Creating RDD
JavaRDD<Person> personsRDD = javaSparkContext.parallelize(persons);
//Creating SQL context
SQLContext sQLContext = new SQLContext(javaSparkContext);
DataFrame personDataFrame = sQLContext.createDataFrame(personsRDD, Person.class);
personDataFrame.show();
personDataFrame.printSchema();
personDataFrame.select("name").show();
personDataFrame.registerTempTable("peoples");
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
result.show();
After this I need to convert the DataFrame - 'result' to Person Object or List. Thanks in advance.

DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.
The conversion from Dataset[Row] to Dataset[Person] is very simple in spark
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.
// Create an Encoders for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> personDF = result.as(personEncoder);
personDF.show();
Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.
Please refer to below link provided by databricks for further details
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

A DataFrame is stored as Rows, so you can use the methods there to cast from untyped to typed. Take a look at the get methods.

If someone looking for conversion of json string column in Dataset<Row> to Dataset<PojoClass>
Sample pojo: Testing
#Data
public class Testing implements Serializable {
private String name;
private String dept;
}
In the above code #Data is an annotation from Lombok to generate getters and setters for this Testing class.
Actual conversion logic in Spark
#Test
void shouldConvertJsonStringToPojo() {
var sparkSession = SparkSession.builder().getOrCreate();
var structType = new StructType(new StructField[] {
new StructField("employee", DataTypes.StringType, false, Metadata.empty()),
});
var ds = sparkSession.createDataFrame(new ArrayList<>(
Arrays.asList(RowFactory.create(new Object[]{"{ \"name\": \"test\", \"dept\": \"IT\"}"}))), structType);
var objectMapper = new ObjectMapper();
var bean = Encoders.bean(Testing.class);
var testingDataset = ds.map((MapFunction<Row, Testing>) row -> {
var dept = row.<String>getAs("employee");
return objectMapper.readValue(dept, Testing.class);
}, bean);
assertEquals("test", testingDataset.head().getName());
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Load csv data with partition in spark 2.0 - java

Use repartition() method from Dataset. According to Scaladoc, there is no option to set number of paritions while reading

Related

Add a column with constant value into Spark Dataset in Java?

Convert Spark Dataset to JSON and Write to Kafka Producer

How to access Java Spark Broadcast variable?

Spark error when convert JavaRDD to DataFrame: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>

Convert Spark DataFrame to Pojo Object

Categories

Resources