How to print unique values of a column of DataFrame in Spark?

How to print unique values of a column of DataFrame in Spark? - java

I create a DataFrame from Parquet file as follows:
DataFrame parquetFile = sqlContext.read().parquet("test_file.parquet");
parquetFile.printSchema();
parquetFile.registerTempTable("myData");
DataFrame data_df = sqlContext.sql("SELECT * FROM myData");
Now I want to print out all unique values of a column that is called field1.
I know that in case of using Python, it would be possible to run import pandas as pd and then convert data_df to Pandas DataFrame, after which use unique().
But how can I do it in Java?

It's very straightforward, you can use the distinct function in the SQL query
DataFrame data_df = sqlContext.sql("SELECT DISTINCT(field1) FROM myData");
Here's an example :
val myData = Seq("h", "h", "d", "b", "d").toDF("field1")
myData.createOrReplaceTempView("myData")
val sqlContext = spark.sqlContext
sqlContext.sql("SELECT DISTINCT(field1) FROM myData").show()
this gives the following output :
+------+
|field1|
+------+
| h|
| d|
| b|
+------+
Hope this help, Best Regrads

You can remove the duplicate and get distinct values by
parquetFile.dropDuplicates("field1")
This gives you only distinct rows by field1

DataFrame uniqueDF = data_df.groupBy("field1");
uniqueDF.show();

Related

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

I am using spark-sql 2.3.1v with java8.
I have data frame like below
val df_data = Seq(
("G1","I1","col1_r1", "col2_r1","col3_r1"),
("G1","I2","col1_r2", "col2_r2","col3_r3")
).toDF("group","industry_id","col1","col2","col3")
.withColumn("group", $"group".cast(StringType))
.withColumn("industry_id", $"industry_id".cast(StringType))
.withColumn("col1", $"col1".cast(StringType))
.withColumn("col2", $"col2".cast(StringType))
.withColumn("col3", $"col3".cast(StringType))
+-----+-----------+-------+-------+-------+
|group|industry_id| col1| col2| col3|
+-----+-----------+-------+-------+-------+
| G1| I1|col1_r1|col2_r1|col3_r1|
| G1| I2|col1_r2|col2_r2|col3_r3|
+-----+-----------+-------+-------+-------+
val df_cols = Seq(
("1", "usa", Seq("col1","col2","col3")),
("2", "ind", Seq("col1","col2"))
).toDF("id","name","list_of_colums")
.withColumn("id", $"id".cast(IntegerType))
.withColumn("name", $"name".cast(StringType))
+---+----+------------------+
| id|name| list_of_colums|
+---+----+------------------+
| 1| usa|[col1, col2, col3]|
| 2| ind| [col1, col2]|
+---+----+------------------+
Question :
As shown above I have columns information in "df_cols" dataframe.
And all the data in the "df_data" dataframe.
how can I select columns dynamically from "df_data" to the given id of "df_cols" ??

Initial question:
val columns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-------+-------+
| col1| col2|
+-------+-------+
|col1_r1|col2_r1|
|col1_r2|col2_r2|
+-------+-------+
Updated question:
1) We may just use 2 lists: static columns + dynamic ones
2) I think that "rdd" is ok in this code. I don't know how to update to "Dataframe" only, unfortunately.
val staticColumns = Seq[String]("group", "industry_id")
val dynamicColumns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val columns: Seq[String] = staticColumns ++ dynamicColumns
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-----+-----------+-------+-------+
|group|industry_id| col1| col2|
+-----+-----------+-------+-------+
| G1| I1|col1_r1|col2_r1|
| G1| I2|col1_r2|col2_r2|
+-----+-----------+-------+-------+

apply functions or operations on dataframe java which strips the last special character

I have the data coming in for first column 'code' for dataframe as below
'101-23','23-00-11','NOV-11-23','34-000-1111-1'
and now i want to the values as below for 'code' column after the substring.
101,23-00,NOV-11,34-000-1111
The above can achieved easily by java code as below
String str ="23-00-11";
int index=str.lastindex("-");
String ss=str.substring(0,index);
which gives
'23-00'
How to do with dataframe and to write udf orapply to dataframe with spark 1.6.2 java 1.8?
I tried with df.withcolumn("code",substring("code",0,1)) but didnt find the way to find the last index. Please help.

from pyspark.sql.functions import *
newDf = df.withColumn('_c0', regexp_replace('_c0', '#', ''))\
.withColumn('_c1', regexp_replace('_c1', "'", ''))\
.withColumn('_c2', regexp_replace('_c2', '!', ''))
newDf.show()
Updated
import org.apache.spark.sql.functions._
val df11 = Seq("'101-23','23-00-11','NOV-11-23','34-000-1111-1'").toDS()
df11.show()
//df11.select(col("a"), substring_index(col("value"), ",", 1).as("b"))
val df111=df11.withColumn("value", substring(df11("value"), 0, 10))
df111.show()
Result :
+--------------------+
| value|
+--------------------+
|'101-23','23-00-1...|
+--------------------+
+----------+
| value|
+----------+
|'101-23','|
+----------+
import org.apache.spark.sql.functions._
df11: org.apache.spark.sql.Dataset[String] = [value: string]
df111: org.apache.spark.sql.DataFrame = [value: string]

Getting a distinct count from a dataframe using Apache Spark

I have data that looks like this
+--------------+---------+-------+---------+
| dataOne|OtherData|dataTwo|dataThree|
+--------------+---------|-------+---------+
| Best| tree| 5| 533|
| OK| bush| e| 3535|
| MEH| cow| -| 3353|
| MEH| oak| none| 12|
+--------------+---------+-------+---------+
and I'm trying to get it into the output of
+--------------+---------+
| dataOne| Count|
+--------------+---------|
| Best| 1|
| OK| 1|
| Meh| 2|
+--------------+---------+
I have no problem getting the dataOne into a dataframe by itself and showing the contents of it in order to make sure I'm just grabbing the dataOne column,
However I can't seem to find the correct syntax for either turning that sql query into a the data I need. I tried creating this following dataframe from the temp view created by the entire data set
Dataset<Row> dataOneCount = spark.sql("select dataOne, count(*) from
dataFrame group by dataOne");
dataOneCount.show();
But spark
The documentation I was able to find on this only showed how to do this type of aggregation in spark 1.6 and prior so any help would be appreciated.
Here's the error message I get, However I've checked the data and there is no indexing error in there.
java.lang.ArrayIndexOutOfBoundsException: 11
I've also tried applying the functions() method countDistinct
Column countNum = countDistinct(dataFrame.col("dataOne"));
Dataset<Row> result = dataOneDataFrame.withColumn("count",countNum);
result.show();
where dataOneDataFrame is a dataFrame created from running
select dataOne from dataFrame
But it returns an analysis exception, I'm still new to spark so I'm not sure if there's an error with how/when I'm evaluating the countDistinct method
edit: To clarify, the first table shown is the result of the dataFrame I've created from reading the text file and applying a custom schema to it (they are still all strings)
Dataset<Row> dataFrame
Here is my full code
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Log File Reader")
.getOrCreate();
//args[0] is the textfile location
JavaRDD<String> logsRDD = spark.sparkContext()
.textFile(args[0],1)
.toJavaRDD();
String schemaString = "dataOne OtherData dataTwo dataThree";
List<StructField> fields = new ArrayList<>();
String[] fieldName = schemaString.split(" ");
for (String field : fieldName){
fields.add(DataTypes.createStructField(field, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = logsRDD.map((Function<String, Row>) record -> {
String[] attributes = record.split(" ");
return RowFactory.create(attributes[0],attributes[1],attributes[2],attributes[3]);
});
Dataset<Row> dF = spark.createDataFrame(rowRDD, schema);
//first attempt
dF.groupBy(col("dataOne")).count().show();
//Trying with a sql statement
dF.createOrReplaceTempView("view");
dF.sparkSession().sql("select command, count(*) from view group by command").show();
The most likely thing that comes to mind is the lambda function that returns the row using RowFactory? The idea seems sound but I'm not sure how it really holds up or if there's another way I could do it. Other than that I'm quite puzzled
sample data
best tree 5 533
OK bush e 3535
MEH cow - 3353
MEH oak none 12

Using Scala syntax for convenience. It's very similar to the Java syntax:
// Input data
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("dataOne", StringType) ::
StructField("OtherData", StringType) ::
StructField("dataTwo", StringType) ::
StructField("dataThree", IntegerType) :: Nil)
val data = List(
Row("Best", "tree", "5", 533),
Row("OK", "bush", "e", 3535),
Row("MEH", "cow", "-", 3353),
Row("MEH", "oak", "none", 12)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
df.show
+-------+---------+-------+---------+
|dataOne|OtherData|dataTwo|dataThree|
+-------+---------+-------+---------+
| Best| tree| 5| 533|
| OK| bush| e| 3535|
| MEH| cow| -| 3353|
| MEH| oak| none| 12|
+-------+---------+-------+---------+
df.groupBy(col("dataOne")).count().show()
+-------+-----+
|dataOne|count|
+-------+-----+
| MEH| 2|
| Best| 1|
| OK| 1|
+-------+-----+
I can submit the Java code given above as follows with the four row data file on S3 and it works fine:
$SPARK_HOME/bin/spark-submit \
--class sparktest.FromStackOverflow \
--packages "org.apache.hadoop:hadoop-aws:2.7.3" \
target/scala-2.11/sparktest_2.11-1.0.0-SNAPSHOT.jar "s3a://my-bucket-name/sample.txt"

Spark DataFrame - .distinct() not working?

I'm using the following code:
df = df.select(
df.col("col").as("col1"),
df.col("col_").as("col2");
df = df.select("*").distinct();
df= df.sample(true, 0.8).limit(300);
df= df.withColumn("random", lit(0));
df.show();
I want to select distinct rows, then take a sample and limit it to 300 records, however df.show(); shows that there are duplicate rows all over the place. What am I missing?
Thank you!

Assign to a new dataframe
val myDupeDF=myDF.select(myDF.col("EmpName"))
myDupeDF.show()
val myDistinctDf=myDF.select(myDF.col("EmpName")).distinct
myDistinctDf.show();
+-------+
|EmpName|
+-------+
| John|
| John|
| John|
+-------+
After distinct
+-------+
|EmpName|
+-------+
| John|
+-------+
Update for all columns
I choose all columns still it work for me. I am using spark 1.5.1
val myDupeDF=myDF.select(myDF.col("*"))
myDupeDF.show()
val myDistinctDf=myDF.select(myDF.col("*")).distinct
myDistinctDf.show();
Result:
+-----+-------+------+----------+
|EmpId|EmpName|Salary|SalaryDate|
+-----+-------+------+----------+
| 1| John|1000.0|2016-01-01|
+-----+-------+------+----------+
-

Try this -
df = df.select(
df.col("col").as("col1"),
df.col("col_").as("col2");
df = df.distinct();
df= df.sample(true, 0.8).limit(300);
df= df.withColumn("random", lit(0));
df.show();
But I think u need to mention a column name to perform distinct operation -
df = df.select("COLUMNNAME").distinct();

How to create a new column in a Spark DataFrame based on a second DataFrame (Java)?

I have two Spark DataFrames where one of them has two cols, id and Tag. A second DataFrame has an id col, but missing the Tag. The first Dataframe is essentially a dictionary, each id appears once, while in the second DataFrame and id may appear several times. What I need is to create a new col in the second DataFrame that has the Tag as a function of the id in each row (in the second DataFrame). I think this can be done by converting to RDDs first ..etc, but I thought there must be a more elegant way using DataFrames (in Java). Example: given a df1 Row-> id: 0, Tag: "A", a df2 Row1-> id: 0, Tag: null, a df2 Row2-> id: 0, Tag: "B", I need to create a Tag col in the resulting DataFrame df3 equal to df1(id=0) = "A" IF df2 Tag was null, but keep original Tag if not null => resulting in df3 Row1-> id: 0, Tag: "A", df3 Row2-> id: 0, Tag: "B". Hope the example is clear.
| ID | No. | Tag | new Tag Col |
| 1 | 10002 | A | A |
| 2 | 10003 | B | B |
| 1 | 10004 | null | A |
| 2 | 10005 | null | B |

All you need here is left outer join and coalesce:
import org.apache.spark.sql.functions.coalesce
val df = sc.parallelize(Seq(
(1, 10002, Some("A")), (2, 10003, Some("B")),
(1, 10004, None), (2, 10005, None)
)).toDF("id", "no", "tag")
val lookup = sc.parallelize(Seq(
(1, "A"), (2, "B")
)).toDF("id", "tag")
df.join(lookup, df.col("id").equalTo(lookup.col("id")), "leftouter")
.withColumn("new_tag", coalesce(df.col("tag"), lookup.col("tag")))
This should almost identical to Java version.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to print unique values of a column of DataFrame in Spark? - java

You can remove the duplicate and get distinct values by parquetFile.dropDuplicates("field1") This gives you only distinct rows by field1

DataFrame uniqueDF = data_df.groupBy("field1"); uniqueDF.show();

Related

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

apply functions or operations on dataframe java which strips the last special character

Getting a distinct count from a dataframe using Apache Spark

Spark DataFrame - .distinct() not working?

How to create a new column in a Spark DataFrame based on a second DataFrame (Java)?

Categories

Resources