Spark DataFrame - .distinct() not working?

Spark DataFrame - .distinct() not working? - java

I'm using the following code:
df = df.select(
df.col("col").as("col1"),
df.col("col_").as("col2");
df = df.select("*").distinct();
df= df.sample(true, 0.8).limit(300);
df= df.withColumn("random", lit(0));
df.show();
I want to select distinct rows, then take a sample and limit it to 300 records, however df.show(); shows that there are duplicate rows all over the place. What am I missing?
Thank you!

Assign to a new dataframe
val myDupeDF=myDF.select(myDF.col("EmpName"))
myDupeDF.show()
val myDistinctDf=myDF.select(myDF.col("EmpName")).distinct
myDistinctDf.show();
+-------+
|EmpName|
+-------+
| John|
| John|
| John|
+-------+
After distinct
+-------+
|EmpName|
+-------+
| John|
+-------+
Update for all columns
I choose all columns still it work for me. I am using spark 1.5.1
val myDupeDF=myDF.select(myDF.col("*"))
myDupeDF.show()
val myDistinctDf=myDF.select(myDF.col("*")).distinct
myDistinctDf.show();
Result:
+-----+-------+------+----------+
|EmpId|EmpName|Salary|SalaryDate|
+-----+-------+------+----------+
| 1| John|1000.0|2016-01-01|
+-----+-------+------+----------+
-

Try this -
df = df.select(
df.col("col").as("col1"),
df.col("col_").as("col2");
df = df.distinct();
df= df.sample(true, 0.8).limit(300);
df= df.withColumn("random", lit(0));
df.show();
But I think u need to mention a column name to perform distinct operation -
df = df.select("COLUMNNAME").distinct();

Related

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

I am using spark-sql 2.3.1v with java8.
I have data frame like below
val df_data = Seq(
("G1","I1","col1_r1", "col2_r1","col3_r1"),
("G1","I2","col1_r2", "col2_r2","col3_r3")
).toDF("group","industry_id","col1","col2","col3")
.withColumn("group", $"group".cast(StringType))
.withColumn("industry_id", $"industry_id".cast(StringType))
.withColumn("col1", $"col1".cast(StringType))
.withColumn("col2", $"col2".cast(StringType))
.withColumn("col3", $"col3".cast(StringType))
+-----+-----------+-------+-------+-------+
|group|industry_id| col1| col2| col3|
+-----+-----------+-------+-------+-------+
| G1| I1|col1_r1|col2_r1|col3_r1|
| G1| I2|col1_r2|col2_r2|col3_r3|
+-----+-----------+-------+-------+-------+
val df_cols = Seq(
("1", "usa", Seq("col1","col2","col3")),
("2", "ind", Seq("col1","col2"))
).toDF("id","name","list_of_colums")
.withColumn("id", $"id".cast(IntegerType))
.withColumn("name", $"name".cast(StringType))
+---+----+------------------+
| id|name| list_of_colums|
+---+----+------------------+
| 1| usa|[col1, col2, col3]|
| 2| ind| [col1, col2]|
+---+----+------------------+
Question :
As shown above I have columns information in "df_cols" dataframe.
And all the data in the "df_data" dataframe.
how can I select columns dynamically from "df_data" to the given id of "df_cols" ??

Initial question:
val columns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-------+-------+
| col1| col2|
+-------+-------+
|col1_r1|col2_r1|
|col1_r2|col2_r2|
+-------+-------+
Updated question:
1) We may just use 2 lists: static columns + dynamic ones
2) I think that "rdd" is ok in this code. I don't know how to update to "Dataframe" only, unfortunately.
val staticColumns = Seq[String]("group", "industry_id")
val dynamicColumns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val columns: Seq[String] = staticColumns ++ dynamicColumns
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-----+-----------+-------+-------+
|group|industry_id| col1| col2|
+-----+-----------+-------+-------+
| G1| I1|col1_r1|col2_r1|
| G1| I2|col1_r2|col2_r2|
+-----+-----------+-------+-------+

Spark Dataframe: Select distinct rows

I tried two ways to find distinct rows from parquet but it doesn't seem to work.
Attemp 1:
Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct();
But throws
Cannot have map type columns in DataFrame which calls set operations
(intersect, except, etc.),
but the type of column canvasHashes is map<string,string>;;
Attemp 2:
Tried running sql queries:
Dataset<Row> df = sqlContext.read().parquet("location.parquet");
rawLandingDS.createOrReplaceTempView("df");
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct on timestamp * from df");
error I get:
= SQL ==
SELECT distinct on timestamp * from df
-----------------------------^^^
Is there a way to get distinct records while reading parquet files? Any read option I can use.

The problem you face is explicitly stated in the exception message - because MapType columns are neither hashable nor orderable cannot be used as a part of grouping or partitioning expression.
Your take on SQL solution is not logically equivalent to distinct on Dataset. If you want to deduplicate data based on a set of compatible columns you should use dropDuplicates:
df.dropDuplicates("timestamp")
which would be equivalent to
SELECT timestamp, first(c1) AS c1, first(c2) AS c2, ..., first(cn) AS cn,
first(canvasHashes) AS canvasHashes
FROM df GROUP BY timestamp
Unfortunately if your goal is actual DISTINCT it won't be so easy. On possible solution is to leverage Scala* Map hashing. You could define Scala udf like this:
spark.udf.register("scalaHash", (x: Map[String, String]) => x.##)
and then use it in your Java code to derive column that can be used to dropDuplicates:
df
.selectExpr("*", "scalaHash(canvasHashes) AS hash_of_canvas_hashes")
.dropDuplicates(
// All columns excluding canvasHashes / hash_of_canvas_hashes
"timestamp", "c1", "c2", ..., "cn"
// Hash used as surrogate of canvasHashes
"hash_of_canvas_hashes"
)
with SQL equivalent
SELECT
timestamp, c1, c2, ..., cn, -- All columns excluding canvasHashes
first(canvasHashes) AS canvasHashes
FROM df GROUP BY
timestamp, c1, c2, ..., cn -- All columns excluding canvasHashes
* Please note that java.util.Map with its hashCode won't work, as hashCode is not consistent.

1) If you want to distinct based on coluns you can use it
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
val distinctValuesDF = df.select(df("no")).distinct
scala> distinctValuesDF.show
+---+
| no|
+---+
| 1|
| 3|
+---+
2) If you have want unique on all column use dropduplicate
scala> val df = sc.parallelize(Array((1, 2), (3, 4),(3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 3| 4|
| 1| 6|
+---+---+
scala> df.dropDuplicates().show()
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+

Yes, the syntax is incorrect, it should be:
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

How to merge two dataframes spark java/scala based on a column?

I have a two dataframes DF1 and DF2 with id as the unique column,
DF2 may contain a new records and updated values for existing records of DF1, when we merge the two dataframes result should include the new record and a old records with updated values remain should come as it is.
Input example:
id name
10 abc
20 tuv
30 xyz
and
id name
10 abc
20 pqr
40 lmn
When I merge these two dataframes, I want the result as:
id name
10 abc
20 pqr
30 xyz
40 lmn

Use an outer join followed by a coalesce. In Scala:
val df1 = Seq((10, "abc"), (20, "tuv"), (30, "xyz")).toDF("id", "name")
val df2 = Seq((10, "abc"), (20, "pqr"), (40, "lmn")).toDF("id", "name")
df1.select($"id", $"name".as("old_name"))
.join(df2, Seq("id"), "outer")
.withColumn("name", coalesce($"name", $"old_name"))
.drop("old_name")
coalesce will give the value of the first non-null value, which in this case returns:
+---+----+
| id|name|
+---+----+
| 20| pqr|
| 40| lmn|
| 10| abc|
| 30| xyz|
+---+----+

df1.join(df2, Seq("id"), "leftanti").union(df2).show
| id|name|
+---+----+
| 30| xyz|
| 10| abc|
| 20| pqr|
| 40| lmn|
+---+----+

Getting a distinct count from a dataframe using Apache Spark

I have data that looks like this
+--------------+---------+-------+---------+
| dataOne|OtherData|dataTwo|dataThree|
+--------------+---------|-------+---------+
| Best| tree| 5| 533|
| OK| bush| e| 3535|
| MEH| cow| -| 3353|
| MEH| oak| none| 12|
+--------------+---------+-------+---------+
and I'm trying to get it into the output of
+--------------+---------+
| dataOne| Count|
+--------------+---------|
| Best| 1|
| OK| 1|
| Meh| 2|
+--------------+---------+
I have no problem getting the dataOne into a dataframe by itself and showing the contents of it in order to make sure I'm just grabbing the dataOne column,
However I can't seem to find the correct syntax for either turning that sql query into a the data I need. I tried creating this following dataframe from the temp view created by the entire data set
Dataset<Row> dataOneCount = spark.sql("select dataOne, count(*) from
dataFrame group by dataOne");
dataOneCount.show();
But spark
The documentation I was able to find on this only showed how to do this type of aggregation in spark 1.6 and prior so any help would be appreciated.
Here's the error message I get, However I've checked the data and there is no indexing error in there.
java.lang.ArrayIndexOutOfBoundsException: 11
I've also tried applying the functions() method countDistinct
Column countNum = countDistinct(dataFrame.col("dataOne"));
Dataset<Row> result = dataOneDataFrame.withColumn("count",countNum);
result.show();
where dataOneDataFrame is a dataFrame created from running
select dataOne from dataFrame
But it returns an analysis exception, I'm still new to spark so I'm not sure if there's an error with how/when I'm evaluating the countDistinct method
edit: To clarify, the first table shown is the result of the dataFrame I've created from reading the text file and applying a custom schema to it (they are still all strings)
Dataset<Row> dataFrame
Here is my full code
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Log File Reader")
.getOrCreate();
//args[0] is the textfile location
JavaRDD<String> logsRDD = spark.sparkContext()
.textFile(args[0],1)
.toJavaRDD();
String schemaString = "dataOne OtherData dataTwo dataThree";
List<StructField> fields = new ArrayList<>();
String[] fieldName = schemaString.split(" ");
for (String field : fieldName){
fields.add(DataTypes.createStructField(field, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = logsRDD.map((Function<String, Row>) record -> {
String[] attributes = record.split(" ");
return RowFactory.create(attributes[0],attributes[1],attributes[2],attributes[3]);
});
Dataset<Row> dF = spark.createDataFrame(rowRDD, schema);
//first attempt
dF.groupBy(col("dataOne")).count().show();
//Trying with a sql statement
dF.createOrReplaceTempView("view");
dF.sparkSession().sql("select command, count(*) from view group by command").show();
The most likely thing that comes to mind is the lambda function that returns the row using RowFactory? The idea seems sound but I'm not sure how it really holds up or if there's another way I could do it. Other than that I'm quite puzzled
sample data
best tree 5 533
OK bush e 3535
MEH cow - 3353
MEH oak none 12

Using Scala syntax for convenience. It's very similar to the Java syntax:
// Input data
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("dataOne", StringType) ::
StructField("OtherData", StringType) ::
StructField("dataTwo", StringType) ::
StructField("dataThree", IntegerType) :: Nil)
val data = List(
Row("Best", "tree", "5", 533),
Row("OK", "bush", "e", 3535),
Row("MEH", "cow", "-", 3353),
Row("MEH", "oak", "none", 12)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
df.show
+-------+---------+-------+---------+
|dataOne|OtherData|dataTwo|dataThree|
+-------+---------+-------+---------+
| Best| tree| 5| 533|
| OK| bush| e| 3535|
| MEH| cow| -| 3353|
| MEH| oak| none| 12|
+-------+---------+-------+---------+
df.groupBy(col("dataOne")).count().show()
+-------+-----+
|dataOne|count|
+-------+-----+
| MEH| 2|
| Best| 1|
| OK| 1|
+-------+-----+
I can submit the Java code given above as follows with the four row data file on S3 and it works fine:
$SPARK_HOME/bin/spark-submit \
--class sparktest.FromStackOverflow \
--packages "org.apache.hadoop:hadoop-aws:2.7.3" \
target/scala-2.11/sparktest_2.11-1.0.0-SNAPSHOT.jar "s3a://my-bucket-name/sample.txt"

How to print unique values of a column of DataFrame in Spark?

I create a DataFrame from Parquet file as follows:
DataFrame parquetFile = sqlContext.read().parquet("test_file.parquet");
parquetFile.printSchema();
parquetFile.registerTempTable("myData");
DataFrame data_df = sqlContext.sql("SELECT * FROM myData");
Now I want to print out all unique values of a column that is called field1.
I know that in case of using Python, it would be possible to run import pandas as pd and then convert data_df to Pandas DataFrame, after which use unique().
But how can I do it in Java?

It's very straightforward, you can use the distinct function in the SQL query
DataFrame data_df = sqlContext.sql("SELECT DISTINCT(field1) FROM myData");
Here's an example :
val myData = Seq("h", "h", "d", "b", "d").toDF("field1")
myData.createOrReplaceTempView("myData")
val sqlContext = spark.sqlContext
sqlContext.sql("SELECT DISTINCT(field1) FROM myData").show()
this gives the following output :
+------+
|field1|
+------+
| h|
| d|
| b|
+------+
Hope this help, Best Regrads

You can remove the duplicate and get distinct values by
parquetFile.dropDuplicates("field1")
This gives you only distinct rows by field1

DataFrame uniqueDF = data_df.groupBy("field1");
uniqueDF.show();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark DataFrame - .distinct() not working? - java

Related

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

Spark Dataframe: Select distinct rows

How to merge two dataframes spark java/scala based on a column?

Getting a distinct count from a dataframe using Apache Spark

How to print unique values of a column of DataFrame in Spark?

Categories

Resources