Getting a distinct count from a dataframe using Apache Spark

Getting a distinct count from a dataframe using Apache Spark - java

I have data that looks like this
+--------------+---------+-------+---------+
| dataOne|OtherData|dataTwo|dataThree|
+--------------+---------|-------+---------+
| Best| tree| 5| 533|
| OK| bush| e| 3535|
| MEH| cow| -| 3353|
| MEH| oak| none| 12|
+--------------+---------+-------+---------+
and I'm trying to get it into the output of
+--------------+---------+
| dataOne| Count|
+--------------+---------|
| Best| 1|
| OK| 1|
| Meh| 2|
+--------------+---------+
I have no problem getting the dataOne into a dataframe by itself and showing the contents of it in order to make sure I'm just grabbing the dataOne column,
However I can't seem to find the correct syntax for either turning that sql query into a the data I need. I tried creating this following dataframe from the temp view created by the entire data set
Dataset<Row> dataOneCount = spark.sql("select dataOne, count(*) from
dataFrame group by dataOne");
dataOneCount.show();
But spark
The documentation I was able to find on this only showed how to do this type of aggregation in spark 1.6 and prior so any help would be appreciated.
Here's the error message I get, However I've checked the data and there is no indexing error in there.
java.lang.ArrayIndexOutOfBoundsException: 11
I've also tried applying the functions() method countDistinct
Column countNum = countDistinct(dataFrame.col("dataOne"));
Dataset<Row> result = dataOneDataFrame.withColumn("count",countNum);
result.show();
where dataOneDataFrame is a dataFrame created from running
select dataOne from dataFrame
But it returns an analysis exception, I'm still new to spark so I'm not sure if there's an error with how/when I'm evaluating the countDistinct method
edit: To clarify, the first table shown is the result of the dataFrame I've created from reading the text file and applying a custom schema to it (they are still all strings)
Dataset<Row> dataFrame
Here is my full code
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Log File Reader")
.getOrCreate();
//args[0] is the textfile location
JavaRDD<String> logsRDD = spark.sparkContext()
.textFile(args[0],1)
.toJavaRDD();
String schemaString = "dataOne OtherData dataTwo dataThree";
List<StructField> fields = new ArrayList<>();
String[] fieldName = schemaString.split(" ");
for (String field : fieldName){
fields.add(DataTypes.createStructField(field, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = logsRDD.map((Function<String, Row>) record -> {
String[] attributes = record.split(" ");
return RowFactory.create(attributes[0],attributes[1],attributes[2],attributes[3]);
});
Dataset<Row> dF = spark.createDataFrame(rowRDD, schema);
//first attempt
dF.groupBy(col("dataOne")).count().show();
//Trying with a sql statement
dF.createOrReplaceTempView("view");
dF.sparkSession().sql("select command, count(*) from view group by command").show();
The most likely thing that comes to mind is the lambda function that returns the row using RowFactory? The idea seems sound but I'm not sure how it really holds up or if there's another way I could do it. Other than that I'm quite puzzled
sample data
best tree 5 533
OK bush e 3535
MEH cow - 3353
MEH oak none 12

Using Scala syntax for convenience. It's very similar to the Java syntax:
// Input data
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("dataOne", StringType) ::
StructField("OtherData", StringType) ::
StructField("dataTwo", StringType) ::
StructField("dataThree", IntegerType) :: Nil)
val data = List(
Row("Best", "tree", "5", 533),
Row("OK", "bush", "e", 3535),
Row("MEH", "cow", "-", 3353),
Row("MEH", "oak", "none", 12)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
df.show
+-------+---------+-------+---------+
|dataOne|OtherData|dataTwo|dataThree|
+-------+---------+-------+---------+
| Best| tree| 5| 533|
| OK| bush| e| 3535|
| MEH| cow| -| 3353|
| MEH| oak| none| 12|
+-------+---------+-------+---------+
df.groupBy(col("dataOne")).count().show()
+-------+-----+
|dataOne|count|
+-------+-----+
| MEH| 2|
| Best| 1|
| OK| 1|
+-------+-----+
I can submit the Java code given above as follows with the four row data file on S3 and it works fine:
$SPARK_HOME/bin/spark-submit \
--class sparktest.FromStackOverflow \
--packages "org.apache.hadoop:hadoop-aws:2.7.3" \
target/scala-2.11/sparktest_2.11-1.0.0-SNAPSHOT.jar "s3a://my-bucket-name/sample.txt"

Related

Iterating rows of a Spark Dataset and applying operations in Java API

New to Spark (2.4.x) and using the Java API (not Scala!!!)
I have a Dataset that I've read in from a CSV file. It has a schema (named columns) like so:
id (integer) | name (string) | color (string) | price (double) | enabled (boolean)
An example row:
23 | "hotmeatballsoup" | "blue" | 3.95 | true
There are many (tens of thousands) rows in the dataset. I would like to write an expression using the proper Java/Spark API, that scrolls through each row and applies the following two operations on each row:
If the price is null, default it to 0.00; and then
If the color column value is "red", add 2.55 to the price
Since I'm so new to Spark I'm not sure even where to begin! My best attempt thus far is definitely wrong, but its a least a starting point I guess:
Dataset csvData = sparkSession.read()
.format("csv")
.load(fileToLoad.getAbsolutePath());
// ??? get rows somehow
Seq<Seq<String>> csvRows = csvData.getRows(???, ???);
// now how to loop through rows???
for (Seq<String> row : csvRows) {
// how apply two operations specified above???
if (row["price"] == null) {
row["price"] = 0.00;
}
if (row["color"].equals("red")) {
row["price"] = row["price"] + 2.55;
}
}
Can someone help nudge me in the right direction here?

You could use spark sql api to achieve it. Null values could also be replaced with values using .fill() from DataFrameNaFunctions. Otherwise you could convert Dataframe to Dataset and do these steps in .map, but sql api is better and more efficient in this case.
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 1.0| true|
| 24| abc| red| null| true|
+---+---------------+-----+-----+-------+
import sql functions before class declaration:
import static org.apache.spark.sql.functions.*;
sql api:
df.select(
col("id"), col("name"), col("color"),
when(col("color").equalTo("red").and(col("price").isNotNull()), col("price").plus(2.55))
.when(col("color").equalTo("red").and(col("price").isNull()), 2.55)
.otherwise(col("price")).as("price")
,col("enabled")
).show();
or using temp view and sql query:
df.createOrReplaceTempView("df");
spark.sql("select id,name,color, case when color = 'red' and price is not null then (price + 2.55) when color = 'red' and price is null then 2.55 else price end as price, enabled from df").show();
output:
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 3.55| true|
| 24| abc| red| 2.55| true|
+---+---------------+-----+-----+-------+

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

I am using spark-sql 2.3.1v with java8.
I have data frame like below
val df_data = Seq(
("G1","I1","col1_r1", "col2_r1","col3_r1"),
("G1","I2","col1_r2", "col2_r2","col3_r3")
).toDF("group","industry_id","col1","col2","col3")
.withColumn("group", $"group".cast(StringType))
.withColumn("industry_id", $"industry_id".cast(StringType))
.withColumn("col1", $"col1".cast(StringType))
.withColumn("col2", $"col2".cast(StringType))
.withColumn("col3", $"col3".cast(StringType))
+-----+-----------+-------+-------+-------+
|group|industry_id| col1| col2| col3|
+-----+-----------+-------+-------+-------+
| G1| I1|col1_r1|col2_r1|col3_r1|
| G1| I2|col1_r2|col2_r2|col3_r3|
+-----+-----------+-------+-------+-------+
val df_cols = Seq(
("1", "usa", Seq("col1","col2","col3")),
("2", "ind", Seq("col1","col2"))
).toDF("id","name","list_of_colums")
.withColumn("id", $"id".cast(IntegerType))
.withColumn("name", $"name".cast(StringType))
+---+----+------------------+
| id|name| list_of_colums|
+---+----+------------------+
| 1| usa|[col1, col2, col3]|
| 2| ind| [col1, col2]|
+---+----+------------------+
Question :
As shown above I have columns information in "df_cols" dataframe.
And all the data in the "df_data" dataframe.
how can I select columns dynamically from "df_data" to the given id of "df_cols" ??

Initial question:
val columns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-------+-------+
| col1| col2|
+-------+-------+
|col1_r1|col2_r1|
|col1_r2|col2_r2|
+-------+-------+
Updated question:
1) We may just use 2 lists: static columns + dynamic ones
2) I think that "rdd" is ok in this code. I don't know how to update to "Dataframe" only, unfortunately.
val staticColumns = Seq[String]("group", "industry_id")
val dynamicColumns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val columns: Seq[String] = staticColumns ++ dynamicColumns
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-----+-----------+-------+-------+
|group|industry_id| col1| col2|
+-----+-----------+-------+-------+
| G1| I1|col1_r1|col2_r1|
| G1| I2|col1_r2|col2_r2|
+-----+-----------+-------+-------+

Add index column to apache spark Dataset<Row> using java

The below question has solution for scala and pyspark and the solution provided in this question is not for consecutive index values.
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
I have an existing Dataset in Apache-spark and i want to select some rows from it based on the index. I am planning to add one index column that contains unique values staring from 1 and based on the values of that column i will fetch rows.
I found below method to add index that uses order by:
df.withColumn("index", functions.row_number().over(Window.orderBy("a column")));
I do not want to use order by. I need index in the same order they are present in Dataset. Any help?

From what I gather, you are trying to add an index (with consecutive values) to a dataframe. Unfortunately, there is no built in function that does that in Spark. You can only add an increasing index (but not necessarily with consecutive values) with df.withColumn("index", monotonicallyIncreasingId).
Nonetheless, there exists a zipWithIndex function in the RDD API that does exactly what you need. We can thus define a function that transforms the dataframe into a RDD, adds the index and transforms it back into a dataframe.
I'm not an expert in spark in java (scala is much more compact) so it might be possible to do better. Here is how I would do it.
public static Dataset<Row> zipWithIndex(Dataset<Row> df, String name) {
JavaRDD<Row> rdd = df.javaRDD().zipWithIndex().map(t -> {
Row r = t._1;
Long index = t._2 + 1;
ArrayList<Object> list = new ArrayList<>();
r.toSeq().iterator().foreach(x -> list.add(x));
list.add(index);
return RowFactory.create(list);
});
StructType newSchema = df.schema()
.add(new StructField(name, DataTypes.LongType, true, null));
return df.sparkSession().createDataFrame(rdd, newSchema);
}
And here is how you would use it. Notice what the built in spark function does in contrast with what our approach does.
Dataset<Row> df = spark.range(5)
.withColumn("index1", functions.monotonicallyIncreasingId());
Dataset<Row> result = zipWithIndex(df, "good_index");
// df
+---+-----------+
| id| index1|
+---+-----------+
| 0| 0|
| 1| 8589934592|
| 2|17179869184|
| 3|25769803776|
| 4|25769803777|
+---+-----------+
// result
+---+-----------+----------+
| id| index1|good_index|
+---+-----------+----------+
| 0| 0| 1|
| 1| 8589934592| 2|
| 2|17179869184| 3|
| 3|25769803776| 4|
| 4|25769803777| 5|
+---+-----------+----------+

The above answer worked for me with some adjustments. Below is a functional Intellij Scratch file. I'm on Spark 2.3.0:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import java.util.ArrayList;
class Scratch {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("_LOCAL")
.master("local")
.getOrCreate();
Dataset<Row> df = spark.range(5)
.withColumn("index1", functions.monotonicallyIncreasingId());
Dataset<Row> result = zipWithIndex(df, "good_index");
result.show();
}
public static Dataset<Row> zipWithIndex(Dataset<Row> df, String name) {
JavaRDD<Row> rdd = df.javaRDD().zipWithIndex().map(t -> {
Row r = t._1;
Long index = t._2 + 1;
ArrayList<Object> list = new ArrayList<>();
scala.collection.Iterator<Object> iterator = r.toSeq().iterator();
while(iterator.hasNext()) {
Object value = iterator.next();
assert value != null;
list.add(value);
}
list.add(index);
return RowFactory.create(list.toArray());
});
StructType newSchema = df.schema()
.add(new StructField(name, DataTypes.LongType, true, Metadata.empty()));
return df.sparkSession().createDataFrame(rdd, newSchema);
}
}
Output:
+---+------+----------+
| id|index1|good_index|
+---+------+----------+
| 0| 0| 1|
| 1| 1| 2|
| 2| 2| 3|
| 3| 3| 4|
| 4| 4| 5|
+---+------+----------+

apply functions or operations on dataframe java which strips the last special character

I have the data coming in for first column 'code' for dataframe as below
'101-23','23-00-11','NOV-11-23','34-000-1111-1'
and now i want to the values as below for 'code' column after the substring.
101,23-00,NOV-11,34-000-1111
The above can achieved easily by java code as below
String str ="23-00-11";
int index=str.lastindex("-");
String ss=str.substring(0,index);
which gives
'23-00'
How to do with dataframe and to write udf orapply to dataframe with spark 1.6.2 java 1.8?
I tried with df.withcolumn("code",substring("code",0,1)) but didnt find the way to find the last index. Please help.

from pyspark.sql.functions import *
newDf = df.withColumn('_c0', regexp_replace('_c0', '#', ''))\
.withColumn('_c1', regexp_replace('_c1', "'", ''))\
.withColumn('_c2', regexp_replace('_c2', '!', ''))
newDf.show()
Updated
import org.apache.spark.sql.functions._
val df11 = Seq("'101-23','23-00-11','NOV-11-23','34-000-1111-1'").toDS()
df11.show()
//df11.select(col("a"), substring_index(col("value"), ",", 1).as("b"))
val df111=df11.withColumn("value", substring(df11("value"), 0, 10))
df111.show()
Result :
+--------------------+
| value|
+--------------------+
|'101-23','23-00-1...|
+--------------------+
+----------+
| value|
+----------+
|'101-23','|
+----------+
import org.apache.spark.sql.functions._
df11: org.apache.spark.sql.Dataset[String] = [value: string]
df111: org.apache.spark.sql.DataFrame = [value: string]

How to print unique values of a column of DataFrame in Spark?

I create a DataFrame from Parquet file as follows:
DataFrame parquetFile = sqlContext.read().parquet("test_file.parquet");
parquetFile.printSchema();
parquetFile.registerTempTable("myData");
DataFrame data_df = sqlContext.sql("SELECT * FROM myData");
Now I want to print out all unique values of a column that is called field1.
I know that in case of using Python, it would be possible to run import pandas as pd and then convert data_df to Pandas DataFrame, after which use unique().
But how can I do it in Java?

It's very straightforward, you can use the distinct function in the SQL query
DataFrame data_df = sqlContext.sql("SELECT DISTINCT(field1) FROM myData");
Here's an example :
val myData = Seq("h", "h", "d", "b", "d").toDF("field1")
myData.createOrReplaceTempView("myData")
val sqlContext = spark.sqlContext
sqlContext.sql("SELECT DISTINCT(field1) FROM myData").show()
this gives the following output :
+------+
|field1|
+------+
| h|
| d|
| b|
+------+
Hope this help, Best Regrads

You can remove the duplicate and get distinct values by
parquetFile.dropDuplicates("field1")
This gives you only distinct rows by field1

DataFrame uniqueDF = data_df.groupBy("field1");
uniqueDF.show();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Getting a distinct count from a dataframe using Apache Spark - java

Related

Iterating rows of a Spark Dataset and applying operations in Java API

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

Add index column to apache spark Dataset<Row> using java

apply functions or operations on dataframe java which strips the last special character

How to print unique values of a column of DataFrame in Spark?

Categories

Resources