How to iterate grouped data in spark?

How to iterate grouped data in spark? - java

I have a dataset like this:
uid group_a group_b
1 3 unkown
1 unkown 4
2 unkown 3
2 2 unkown
I want to get the result:
uid group_a group_b
1 3 4
2 2 3
I try to group the data by "uid" and iterate each group and select the not-unkown value as the final value, but don't know how to do it.

I would suggest you define a User Defined Aggregation Function (UDAF)
Using inbuilt functions are great ways but they are difficult to be customized. If you own a UDAF then it is customizable and you can edit it according to your needs.
Concerning your problem, following can be your solution. You can edit it according to your needs.
First task is to define a UDAF
class PingJiang extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("group_a", StringType).add("group_b", StringType)
def bufferSchema = new StructType().add("buff0", StringType).add("buff1", StringType)
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, "")
buffer.update(1, "")
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getString(0)
val groupa = input.getString(0)
val groupb = input.getString(1)
if(!groupa.equalsIgnoreCase("unknown")){
buffer.update(0, groupa)
}
if(!groupb.equalsIgnoreCase("unknown")){
buffer.update(1, groupb)
}
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getString(0)+buffer2.getString(0)
val buff2 = buffer1.getString(1)+buffer2.getString(1)
buffer1.update(0, buff1+","+buff2)
}
def evaluate(buffer: Row) : String = {
buffer.getString(0)
}
}
Then you call it from your main class and do some manipulations to get the result you need as
val data = Seq(
(1, "3", "unknown"),
(1, "unknown", "4"),
(2, "unknown", "3"),
(2, "2", "unknown"))
.toDF("uid", "group_a", "group_b")
val udaf = new PingJiang()
val result = data.groupBy("uid").agg(udaf($"group_a", $"group_b").as("ping"))
.withColumn("group_a", split($"ping", ",")(0))
.withColumn("group_b", split($"ping", ",")(1))
.drop("ping")
result.show(false)
Visit databricks and augmentiq for better understanding of UDAF
Note : The above solution gets you the latest value for each group if present (You can always edit according to your needs)

After you format the dataset to a PairRDD you can use the reduceByKey operation to find the single known value. The following example assumes that there is only one known value per uid or otherwise returns the first known value
val input = List(
("1", "3", "unknown"),
("1", "unknown", "4"),
("2", "unknown", "3"),
("2", "2", "unknown")
)
val pairRdd = sc.parallelize(input).map(l => (l._1, (l._2, l._3)))
val result = pairRdd.reduceByKey { (a, b) =>
val groupA = if (a._1 != "unknown") a._1 else b._1
val groupB = if (a._2 != "unknown") a._2 else b._2
(groupA, groupB)
}
The result will be a pairRdd that looks like this
(uid, (group_a, group_b))
(1,(3,4))
(2,(2,3))
You can return to the plain line format with a simple map operation.

You could replace all "unknown" values by null, and then use the function first() inside a map (as shown here), to get the first non-null values in each column per group:
import org.apache.spark.sql.functions.{col,first,when}
// We are only gonna apply our function to the last 2 columns
val cols = df.columns.drop(1)
// Create expression
val exprs = cols.map(first(_,true))
// Putting it all together
df.select(df.columns
.map(c => when(col(c) === "unknown", null)
.otherwise(col(c)).as(c)): _*)
.groupBy("uid")
.agg(exprs.head, exprs.tail: _*).show()
+---+--------------------+--------------------+
|uid|first(group_1, true)|first(group_b, true)|
+---+--------------------+--------------------+
| 1| 3| 4|
| 2| 2| 3|
+---+--------------------+--------------------+
Data:
val df = sc.parallelize(Array(("1","3","unknown"),("1","unknown","4"),
("2","unknown","3"),("2","2","unknown"))).toDF("uid","group_1","group_b")

Related

Aerospike filter expression on boolean fields not working

I've written the following Aerospike filter in Java. Both Field1 and Field2 are booleans. For some reason, the "filterByField.ValidBoth" condition does not yield true, although the record matches the criteria.
Since it's a boolean, I'm using 1 for true and 0 for false.
Am I missing something?
public Exp getFilterByFieldFilter(FilterByField filterByField) {
if (filterByField == null || "".equals(filterByField)) {
return Exp.val(true);
}
if (filterByField == filterByField.All) {
return Exp.val(true);
} else if (filterByField == filterByField.ValidBoth) {
return Exp.and(Exp.eq(Exp.intBin("Field1"), Exp.val(0)),
Exp.eq(Exp.intBin("Field2"), Exp.val(0)));
}
}
From what I can see from database results through AQL, those which are not set to true are not reflected in the result set.
Should I write my filter is a different way to check this condition? If so what would that condition look like?
I tried checking for Exp.val(NULL) but got error.
Here's my database result set through AQL
[
{
"PK": "1",
"Name": "ABC",
"Field1": 1,
"Field2": 1
},
{
"PK": "2",
"Name": "EFG",
"Field1": 1
},
{
"PK": "3",
"Name": "XYZ",
}
]

If bin names Field1 and Field2 contain boolean values then your expression should be constructed in this fashion (whatever the desired logic is):
Exp.eq(Exp.boolBin("Field1"), Exp.val(false))
I tested the construct below, seems to work for me:
WritePolicy wPolicy = new WritePolicy();
Bin b1 = new Bin("Field1", Value.get(0));
Bin b2 = new Bin("Field2", Value.get(0));
Bin b3 = new Bin("Data", Value.get("data"));
wPolicy.recordExistsAction = RecordExistsAction.REPLACE;
client.put(wPolicy, key, b1, b2, b3);
//client.put(wPolicy, key, b1, b3);
Exp condFilter = Exp.and(
Exp.eq(Exp.intBin("Field1"),Exp.val(0) ),
Exp.eq(Exp.intBin("Field2"),Exp.val(0) )
);
Policy policy = new Policy();
policy.filterExp = Exp.build(condFilter);
Record record = client.get(policy, key);
System.out.println("Read back the record.");
System.out.println("Record values are:");
System.out.println(record);
//Get record without filter condition
record = client.get(null, key);
System.out.println(record);
Valid condition:
Read back the record.
Record values are:
(gen:18),(exp:0),(bins:(Field1:0),(Field2:0),(Data:data))
(gen:18),(exp:0),(bins:(Field1:0),(Field2:0),(Data:data))
Invalid Condition (no Field2 bin):
Read back the record.
Record values are:
null
(gen:19),(exp:0),(bins:(Field1:0),(Data:data))

Updating Yaml using Groovy

I have the following Yaml file that I am trying to update, depending on whether a value for a particular key exits.
If productName with a value of test exists in the Yaml file, I want to update its respective URL productUrl with a new value.
If I have a new productName called test that does not exist in the Yaml file, I want to be able to add a new entry to the Yaml file for this productName and its productUrl.
products:
- productName: abc
productUrl: https://company/product-abc
- productName: def
productUrl: https://company/product-def
- productName: ghi
productUrl: https://company/product-ghi
- productName: jkl
productUrl: https://company/product-jkl
- productName: mno
productUrl: https://company/product-mno
- productName: pqr
productUrl: https://company/product-pqr
This is what I have so far but I'm not sure if this can be re-written in a much cleaner way, or if there's a bug in my approach.
#Grab('org.yaml:snakeyaml:1.17')
import org.yaml.snakeyaml.Yaml
Yaml parser = new Yaml()
def p = parser.load(("company.yml" as File).text)
Boolean isProductNew = true
p.company.products.each { i ->
if (i.productName == 'test') {
i.productUrl = 'https://company/product-new-test'
isProductNew = false
}
}
if (isProductNew) {
p.company.products << ["productName": "test", "productUrl": "https://company/product-test"]
}
println p

You can put the code in a cleaner way:
def prod = p.company.products.find{ it.productName == 'test' }
if( !prod ){
prod = [productName: "test"]
p.company.products << prod
}
product.productUrl = "https://company/product-test"

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

I am using spark-sql 2.3.1v with java8.
I have data frame like below
val df_data = Seq(
("G1","I1","col1_r1", "col2_r1","col3_r1"),
("G1","I2","col1_r2", "col2_r2","col3_r3")
).toDF("group","industry_id","col1","col2","col3")
.withColumn("group", $"group".cast(StringType))
.withColumn("industry_id", $"industry_id".cast(StringType))
.withColumn("col1", $"col1".cast(StringType))
.withColumn("col2", $"col2".cast(StringType))
.withColumn("col3", $"col3".cast(StringType))
+-----+-----------+-------+-------+-------+
|group|industry_id| col1| col2| col3|
+-----+-----------+-------+-------+-------+
| G1| I1|col1_r1|col2_r1|col3_r1|
| G1| I2|col1_r2|col2_r2|col3_r3|
+-----+-----------+-------+-------+-------+
val df_cols = Seq(
("1", "usa", Seq("col1","col2","col3")),
("2", "ind", Seq("col1","col2"))
).toDF("id","name","list_of_colums")
.withColumn("id", $"id".cast(IntegerType))
.withColumn("name", $"name".cast(StringType))
+---+----+------------------+
| id|name| list_of_colums|
+---+----+------------------+
| 1| usa|[col1, col2, col3]|
| 2| ind| [col1, col2]|
+---+----+------------------+
Question :
As shown above I have columns information in "df_cols" dataframe.
And all the data in the "df_data" dataframe.
how can I select columns dynamically from "df_data" to the given id of "df_cols" ??

Initial question:
val columns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-------+-------+
| col1| col2|
+-------+-------+
|col1_r1|col2_r1|
|col1_r2|col2_r2|
+-------+-------+
Updated question:
1) We may just use 2 lists: static columns + dynamic ones
2) I think that "rdd" is ok in this code. I don't know how to update to "Dataframe" only, unfortunately.
val staticColumns = Seq[String]("group", "industry_id")
val dynamicColumns = df_cols
.where("id = 2")
.select("list_of_colums")
.rdd.map(r => r(0).asInstanceOf[Seq[String]]).collect()(0)
val columns: Seq[String] = staticColumns ++ dynamicColumns
val df_data_result = df_data.select(columns(0), columns.tail: _*)
+-----+-----------+-------+-------+
|group|industry_id| col1| col2|
+-----+-----------+-------+-------+
| G1| I1|col1_r1|col2_r1|
| G1| I2|col1_r2|col2_r2|
+-----+-----------+-------+-------+

How to parse a column that has a custom json format from a spark DataFrame

I have a spark data frame containig a json column, formatted differently from the standard:
|col_name |
|{a=6236.0, b=0.0} |
|{a=323, b=2.3} |
As you can see the json contains the = sign for the fields instead of :
If I use the predefined function from_json this will yield null as the column doesn't have the standard format. Is there another way to parse this column into two separate columns?

I don't see any simple way to parse this input easily. You need to break the string and construct the json using a udf. Check this out:
scala> val df = Seq(("{a=6236.0, b=0.0}"),("{a=323, b=2.3} ")).toDF("data")
df: org.apache.spark.sql.DataFrame = [data: string]
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val sch1 = new StructType().add($"a".string).add($"b".string)
sch1: org.apache.spark.sql.types.StructType = StructType(StructField(a,StringType,true), StructField(b,StringType,true))
scala> def json1(x:String):String=
| {
| val coly = x.replaceAll("[{}]","").split(",")
| val cola = coly(0).trim.split("=")
| val colb = coly(1).trim.split("=")
| "{\""+cola(0)+"\":"+cola(1)+ "," + "\"" +colb(0) + "\":" + colb(1) + "}"
| }
json1: (x: String)String
scala> val my_udf = udf( json1(_:String):String )
my_udf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.withColumn("n1",my_udf('data)).select(from_json($"n1",sch1) as "data").select("data.*").show(false)
+------+---+
|a |b |
+------+---+
|6236.0|0.0|
|323 |2.3|
+------+---+
scala>

Mongodb select all fields group by one field and sort by another field

We have collection 'message' with following fields
_id | messageId | chainId | createOn
1 | 1 | A | 155
2 | 2 | A | 185
3 | 3 | A | 225
4 | 4 | B | 226
5 | 5 | C | 228
6 | 6 | B | 300
We want to select all fields of document with following criteria
distict by field 'chainId'
order(sort) by 'createdOn' in desc order
so, the expected result is
_id | messageId | chainId | createOn
3 | 3 | A | 225
5 | 5 | C | 228
6 | 6 | B | 300
We are using spring-data in our java application. I tried to go with different approaches, nothing helped me so far.
Is it possible to achieve above with single query?

What you want is something that can be achieved with the aggregation framework. The basic form of ( which is useful to others ) is:
db.collection.aggregate([
// Group by the grouping key, but keep the valid values
{ "$group": {
"_id": "$chainId",
"docId": { "$first": "$_id" },
"messageId": { "$first": "$messageId" },
"createOn": { "$first": "$createdOn" }
}},
// Then sort
{ "$sort": { "createOn": -1 } }
])
So that "groups" on the distinct values of "messageId" while taking the $first boundary values for each of the other fields. Alternately if you want the largest then use $last instead, but for either smallest or largest by row it probably makes sense to $sort first, otherwise just use $min and $max if the whole row is not important.
See the MongoDB aggregate() documentation for more information on usage, as well as the driver JavaDocs and SpringData Mongo connector documentation for more usage of the aggregate method and possible helpers.

here is the solution using MongoDB Java Driver
final MongoClient mongoClient = new MongoClient();
final DB db = mongoClient.getDB("mstreettest");
final DBCollection collection = db.getCollection("message");
final BasicDBObject groupFields = new BasicDBObject("_id", "$chainId");
groupFields.put("docId", new BasicDBObject("$first", "$_id"));
groupFields.put("messageId", new BasicDBObject("$first", "$messageId"));
groupFields.put("createOn", new BasicDBObject("$first", "$createdOn"));
final DBObject group = new BasicDBObject("$group", groupFields);
final DBObject sortFields = new BasicDBObject("createOn", -1);
final DBObject sort = new BasicDBObject("$sort", sortFields);
final DBObject projectFields = new BasicDBObject("_id", 0);
projectFields.put("_id", "$docId");
projectFields.put("messageId", "$messageId");
projectFields.put("chainId", "$_id");
projectFields.put("createOn", "$createOn");
final DBObject project = new BasicDBObject("$project", projectFields);
final AggregationOutput aggregate = collection.aggregate(group, sort, project);
and the result will be:
{ "_id" : 5 , "messageId" : 5 , "createOn" : { "$date" : "2014-04-23T04:45:45.173Z"} , "chainId" : "C"}
{ "_id" : 4 , "messageId" : 4 , "createOn" : { "$date" : "2014-04-23T04:12:25.173Z"} , "chainId" : "B"}
{ "_id" : 1 , "messageId" : 1 , "createOn" : { "$date" : "2014-04-22T08:29:05.173Z"} , "chainId" : "A"}
I tried it with SpringData Mongo and it didn't work when I group it by chainId(java.lang.NumberFormatException: For input string: "C") was the exception

Replace this line:
final DBObject group = new BasicDBObject("$group", groupFields);
with this one:
final DBObject group = new BasicDBObject("_id", groupFields);

here is the solution using springframework.data.mongodb:
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.group("chainId"),
Aggregation.sort(new Sort(Sort.Direction.ASC, "createdOn"))
);
AggregationResults<XxxBean> results = mongoTemplate.aggregate(aggregation, "collection_name", XxxBean.class);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to iterate grouped data in spark? - java

I have a dataset like this: uid group_a group_b 1 3 unkown 1 unkown 4 2 unkown 3 2 2 unkown I want to get the result: uid group_a group_b 1 3 4 2 2 3 I try to group the data by "uid" and iterate each group and select the not-unkown value as the final value, but don't know how to do it.

Related

Aerospike filter expression on boolean fields not working

Updating Yaml using Groovy

how to select columns from another dataframe , where these columns are list of value of column in different dataframe

How to parse a column that has a custom json format from a spark DataFrame

Mongodb select all fields group by one field and sort by another field

Categories

Resources