get Json head node value in Scala - java

I'm a beginner of Scala and using lib "json4s" for JSON parsing, and I have JSON data formatted like below:
scala> val str = """
| {
| "index_key": {
| "time":"12938473",
| "event_detail": {
| "event_name":"click",
| "location":"US"
| }
| }
| }
| """
I'm trying to get "index_key" and sign it to a variable. I tried below:
scala> val json = parse(str)
json: org.json4s.JValue = JObject(List((index_key,JObject(List((time,JString(12938473)), (event_detail,JObject(List((event_name,JString(click)), (location,JString(US))))))))))
scala> json.values
res40: json.Values = Map(index_key -> Map(time -> 12938473, event_detail -> Map(event_name -> click, location -> US)))
and I can get the Map from "json.values" by "json.values.head" or "json.values.keys". But I cannot get the first key "index_key" from this map. Could anyone please tell me how to get map key value "index_key"? and what "res40: json.Values" has to do with Map type? Thanks a lot.

I'm not familiar with json4s specifically but I'm pretty sure it acts like most other json libraries in that it provides you with a nice DSL for extracting out data from parsed json.
I had a look at the docs and found this:
scala> val json =
("person" ->
("name" -> "Joe") ~
("age" -> 35) ~
("spouse" ->
("person" ->
("name" -> "Marilyn") ~
("age" -> 33)
)
)
)
scala> json \\ "spouse"
res0: org.json4s.JsonAST.JValue = JObject(List(
(person,JObject(List((name,JString(Marilyn)), (age,JInt(33)))))))
The \\ operator traverses the JSON structure and extracts the data at that node. Note that the double slash operator in this case works recursively, to reach the root node you would use a single slash, i.e '\'.
For your example it would be json \ "index_key" which would return the JSON at that node.

head node value can be retrieved like below, thanks to answer from #bjfletcher
parse(str).asInstanceOf[JObject].values.head._1

Related

Serialization Error: How do I create a UDF that consumes an ArrayType(StringType) column using a function that consumes java.util.List[String]?

I have a dataframe with schema:
df.printSchema()
root
|-- _1: integer (nullable = false)
|-- _2: array (nullable = true)
| |-- element: string (containsNull = true)
Contents look like this
df.show(1)
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[863e3434fffffff,...|
+---+--------------------+
I want add another column called "compacted" of type array[string] that will store the results of my function below, using a UDF. The function accepts a Java List as input java.util.List[String], and outputs a Java List as well - but I have the function outputting to a Scala Array, like this:
def compactf(s: java.util.List[String]) = { H3.instance.compactAddress(s).asScala.toArray }
The function works just as I expect it to, returning a Scala Array.
compactf(my_test_java_list)
res48: Array[String] = Array(863e3434fffffff, 863e3435fffffff, 863e3092fffffff, 863e3090fffffff, 863e30ba7ffffff, 863e30bafffffff, 863e356b7ffffff, 863e356a7ffffff, 863e350d7ffffff, 863e350f7ffffff, 863e35c5fffffff, 863e35c57ffffff, 863e35d8fffffff, 863e35d9fffffff, 863e3436fffffff, 863e34347ffffff, 863e34357ffffff, 863e342afffffff, 863e3428fffffff, 863e30927ffffff, 863e30907ffffff, 863e3091fffffff, 863e308e7ffffff, 863e308efffffff, 863e30bb7ffffff, 863e30b87ffffff, 863e30b8fffffff, 863e30a77ffffff, 863e30a67ffffff, 863e35697ffffff, 863e35687ffffff, 863e356afffffff, 863e35757ffffff, 863e35777ffffff, 863e350dfffffff, 863e350c7ffffff, 863e350e7ffffff, 863e3511fffffff, 863e35117ffffff, 863e35c4fffffff)
However, when I try to incorporate it into a udf (below), it doesn't work. For instance, this fails with a serialization error (Task not serializable):
val compact2udf = udf(compactf _)
df.withColumn("compacted", compact2udf(col("_2")))
df.withColumn("compacted", compact2udf(col("_2"))).show()
org.apache.spark.SparkException: Task not serializable
What I want is:
+---+--------------------+--------------------+
| _1| _2| compacted|
+---+--------------------+--------------------+
| 1|[863e3434fffffff,...|[863e3092fffffff,...|
+---+--------------------+--------------------+
Any pointers appreciated!

How to parse a column that has a custom json format from a spark DataFrame

I have a spark data frame containig a json column, formatted differently from the standard:
|col_name |
|{a=6236.0, b=0.0} |
|{a=323, b=2.3} |
As you can see the json contains the = sign for the fields instead of :
If I use the predefined function from_json this will yield null as the column doesn't have the standard format. Is there another way to parse this column into two separate columns?
I don't see any simple way to parse this input easily. You need to break the string and construct the json using a udf. Check this out:
scala> val df = Seq(("{a=6236.0, b=0.0}"),("{a=323, b=2.3} ")).toDF("data")
df: org.apache.spark.sql.DataFrame = [data: string]
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val sch1 = new StructType().add($"a".string).add($"b".string)
sch1: org.apache.spark.sql.types.StructType = StructType(StructField(a,StringType,true), StructField(b,StringType,true))
scala> def json1(x:String):String=
| {
| val coly = x.replaceAll("[{}]","").split(",")
| val cola = coly(0).trim.split("=")
| val colb = coly(1).trim.split("=")
| "{\""+cola(0)+"\":"+cola(1)+ "," + "\"" +colb(0) + "\":" + colb(1) + "}"
| }
json1: (x: String)String
scala> val my_udf = udf( json1(_:String):String )
my_udf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.withColumn("n1",my_udf('data)).select(from_json($"n1",sch1) as "data").select("data.*").show(false)
+------+---+
|a |b |
+------+---+
|6236.0|0.0|
|323 |2.3|
+------+---+
scala>

Elasticsearch - how to group by and count matches in an index

I have an instance of Elasticsearch running with thousands of documents. My index has 2 fields like this:
|____Type_____|__ Date_added __ |
| walking | 2018-11-27T00:00:00.000 |
| walking | 2018-11-26T00:00:00.000 |
| running | 2018-11-24T00:00:00.000 |
| running | 2018-11-25T00:00:00.000 |
| walking | 2018-11-27T04:00:00.000 |
I want to group by and count how many matches were found for the "type" field, in a certain range.
In SQL I would do something like this:
select type,
count(type)
from index
where date_added between '2018-11-20' and '2018-11-30'
group by type
I want to get something like this:
| type | count |
| running | 2 |
| walking | 3 |
I'm using the High Level Rest Client api in my project, so far my query looks like this, it's only filtering by the start and end time:
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders
.boolQuery()
.must(QueryBuilders
.rangeQuery("date_added")
.from(start.getTime())
.to(end.getTime()))
)
);
How can I do a "group by" in the "type" field? Is it possible to do this in ElasticSearch?
That's a good start! Now you need to add a terms aggregation to your query:
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.boolQuery()
.must(QueryBuilders
.rangeQuery("date_added")
.from(start.getTime())
.to(end.getTime()))
)
);
// add these two lines
TermsAggregationBuilder groupBy = AggregationBuilders.terms("byType").field("type.keyword");
sourceBuilder.aggregation(groupBy);
After using Val's reply to aggregate the fields, I wanted to print the aggregations of my query together with the value of them. Here's what I did:
Terms terms = searchResponse.getAggregations().get("byType");
Collection<Terms.Bucket> buckets = (Collection<Bucket>) terms.getBuckets();
for (Bucket bucket : buckets) {
System.out.println("Type: " + bucket.getKeyAsString() + " = Count("+bucket.getDocCount()+")");
}
This is the output after running the query in an index with 2700 documents with a field called "type" and 2 different types:
Type: walking = Count(900)
Type: running = Count(1800)

apply functions or operations on dataframe java which strips the last special character

I have the data coming in for first column 'code' for dataframe as below
'101-23','23-00-11','NOV-11-23','34-000-1111-1'
and now i want to the values as below for 'code' column after the substring.
101,23-00,NOV-11,34-000-1111
The above can achieved easily by java code as below
String str ="23-00-11";
int index=str.lastindex("-");
String ss=str.substring(0,index);
which gives
'23-00'
How to do with dataframe and to write udf orapply to dataframe with spark 1.6.2 java 1.8?
I tried with df.withcolumn("code",substring("code",0,1)) but didnt find the way to find the last index. Please help.
from pyspark.sql.functions import *
newDf = df.withColumn('_c0', regexp_replace('_c0', '#', ''))\
.withColumn('_c1', regexp_replace('_c1', "'", ''))\
.withColumn('_c2', regexp_replace('_c2', '!', ''))
newDf.show()
Updated
import org.apache.spark.sql.functions._
val df11 = Seq("'101-23','23-00-11','NOV-11-23','34-000-1111-1'").toDS()
df11.show()
//df11.select(col("a"), substring_index(col("value"), ",", 1).as("b"))
val df111=df11.withColumn("value", substring(df11("value"), 0, 10))
df111.show()
Result :
+--------------------+
| value|
+--------------------+
|'101-23','23-00-1...|
+--------------------+
+----------+
| value|
+----------+
|'101-23','|
+----------+
import org.apache.spark.sql.functions._
df11: org.apache.spark.sql.Dataset[String] = [value: string]
df111: org.apache.spark.sql.DataFrame = [value: string]

How to iterate grouped data in spark?

I have a dataset like this:
uid group_a group_b
1 3 unkown
1 unkown 4
2 unkown 3
2 2 unkown
I want to get the result:
uid group_a group_b
1 3 4
2 2 3
I try to group the data by "uid" and iterate each group and select the not-unkown value as the final value, but don't know how to do it.
I would suggest you define a User Defined Aggregation Function (UDAF)
Using inbuilt functions are great ways but they are difficult to be customized. If you own a UDAF then it is customizable and you can edit it according to your needs.
Concerning your problem, following can be your solution. You can edit it according to your needs.
First task is to define a UDAF
class PingJiang extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("group_a", StringType).add("group_b", StringType)
def bufferSchema = new StructType().add("buff0", StringType).add("buff1", StringType)
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, "")
buffer.update(1, "")
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getString(0)
val groupa = input.getString(0)
val groupb = input.getString(1)
if(!groupa.equalsIgnoreCase("unknown")){
buffer.update(0, groupa)
}
if(!groupb.equalsIgnoreCase("unknown")){
buffer.update(1, groupb)
}
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getString(0)+buffer2.getString(0)
val buff2 = buffer1.getString(1)+buffer2.getString(1)
buffer1.update(0, buff1+","+buff2)
}
def evaluate(buffer: Row) : String = {
buffer.getString(0)
}
}
Then you call it from your main class and do some manipulations to get the result you need as
val data = Seq(
(1, "3", "unknown"),
(1, "unknown", "4"),
(2, "unknown", "3"),
(2, "2", "unknown"))
.toDF("uid", "group_a", "group_b")
val udaf = new PingJiang()
val result = data.groupBy("uid").agg(udaf($"group_a", $"group_b").as("ping"))
.withColumn("group_a", split($"ping", ",")(0))
.withColumn("group_b", split($"ping", ",")(1))
.drop("ping")
result.show(false)
Visit databricks and augmentiq for better understanding of UDAF
Note : The above solution gets you the latest value for each group if present (You can always edit according to your needs)
After you format the dataset to a PairRDD you can use the reduceByKey operation to find the single known value. The following example assumes that there is only one known value per uid or otherwise returns the first known value
val input = List(
("1", "3", "unknown"),
("1", "unknown", "4"),
("2", "unknown", "3"),
("2", "2", "unknown")
)
val pairRdd = sc.parallelize(input).map(l => (l._1, (l._2, l._3)))
val result = pairRdd.reduceByKey { (a, b) =>
val groupA = if (a._1 != "unknown") a._1 else b._1
val groupB = if (a._2 != "unknown") a._2 else b._2
(groupA, groupB)
}
The result will be a pairRdd that looks like this
(uid, (group_a, group_b))
(1,(3,4))
(2,(2,3))
You can return to the plain line format with a simple map operation.
You could replace all "unknown" values by null, and then use the function first() inside a map (as shown here), to get the first non-null values in each column per group:
import org.apache.spark.sql.functions.{col,first,when}
// We are only gonna apply our function to the last 2 columns
val cols = df.columns.drop(1)
// Create expression
val exprs = cols.map(first(_,true))
// Putting it all together
df.select(df.columns
.map(c => when(col(c) === "unknown", null)
.otherwise(col(c)).as(c)): _*)
.groupBy("uid")
.agg(exprs.head, exprs.tail: _*).show()
+---+--------------------+--------------------+
|uid|first(group_1, true)|first(group_b, true)|
+---+--------------------+--------------------+
| 1| 3| 4|
| 2| 2| 3|
+---+--------------------+--------------------+
Data:
val df = sc.parallelize(Array(("1","3","unknown"),("1","unknown","4"),
("2","unknown","3"),("2","2","unknown"))).toDF("uid","group_1","group_b")

Categories