How to write a pig algebraic udf for group by - java

i want to write a pig code to perform group by and generate sum of 31 fields, but before that i need to do some custom processing for which i wrote an eval function. i think i can make it run faster if i can include the GROUP and SUM operations into the UDF. To do this can i use algebraic UDF if yes how would my return schema of inital(), intermed() and final() look like, if no how else can i implement this. below is my code and thanks.
a = LOAD './a' using PigStorage('|') AS (val:int, grp1, grp2, amt1:long, amt2:long, amt3 ... amt31:long);
b = FOREACH a GENERATE myudfs.Custom(val) AS custom_val, grp1, grp2, amt1 ... amt31;
c = GROUP b BY (custom_val,grp1, grp2);
d = FOREACH c GENERATE group, SUM(b.amt1) ... SUM(b.amt31);
store d into './op';

How is it possible to perform GROUP within a UDF...?
GROUP is being translated in Pig into a MapReduce job (intermediate key of this job will be combined from custom_val,grp1, grp2).
The ability to iterate (FOREACH) through the entire list of tuples for a certain group is being done in the the Reducer.
Algebraic UDF will not "include the GROUP", but will be executed as part of the GROUP aggregations. So I think that Algebraic is not relevant here.
I guess that the only optimization that you might do here, is to group on the original val, and to call myudfs.Custom(val) only after the GROUP.
Assuming that your UDF is an injective function.
a = LOAD './a' using PigStorage('|') AS (val:int, grp1, grp2, amt1:long, amt2:long, amt3 ... amt31:long);
c = GROUP b BY (val,grp1, grp2);
d = FOREACH c GENERATE myudfs.Custom(group) AS custom_val, SUM(b.amt1) ... SUM(b.amt31);
store d into './op';

Related

Implementing Hashing with spark

So,I have this implementation of seperate chaining hashing in Java : https://github.com/Big-data-analytics-project/Static-hashing-closed/blob/main/Static%20hashing%20closed
The next step is emplementing it using spark, I tried reading tutorials but I'm still lost. How can I do this ?
One possibility is to create a jar from your hashing implementation and register it inside the Spark application as UDF like this:
spark.udf.registerJavaFunction("udf_hash", "function_name_inside_jar", <returnType e.g: StringType()>)
after this, you can use it via SQL expression, like this:
df = df.withColumn("hashed_column", expr("udf_hash({})".format("column")))
useful links:
Register UDF to SqlContext from Scala to use in PySpark
Spark: How to map Python with Scala or Java User Defined Functions?
Important you have to define your jar in spark-submit using --jars
you can use below UDF to get this achived:
#1.define hash id calculation UDF
def calculate_hashidUDF = udf((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
})
#2.register hash id calculation UDF as spark sql function
spark.udf.register("hashid", calculate_hashidUDF)
for direct hash value use md in above def, this function how ever will return values from 1 to 10000
once you register as spark udf then you can use hashid in spark.sql aswell.

How to access the entries in every row and apply custom functions?

My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

Get unique words from Spark Dataset in Java

I'm using Apache Spark 2 to tokenize some text.
Dataset<Row> regexTokenized = regexTokenizer.transform(data);
It returns Array of String.
Dataset<Row> words = regexTokenized.select("words");
sample data looks like this.
+--------------------+
| words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|
Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.
based on the #Haroun Mohammedi answer, I was able to figure it out in Java.
Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();
I'm coming from scala but I do believe that there's a similar way in Java.
I think in this case you have to use the explode method in order to transform your data into a Dataset of words.
This code should give you the desired results :
import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()
For information about the explode methode please refer to the official documentation
Hope it helps.

How to remove duplicate columns after a JOIN in Pig?

Let's say I JOIN two relations like:
-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9
-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3
jnd = JOIN part BY $0, original BY $0;
The output will be:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3
Notice that $0 is shown twice in each tuple. EG:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^ ^
|-----|
I can remove the duplicate key manually by doing:
jnd = foreach jnd generate $0,$1,$3,$4 ..;
Is there a way to remove this dynamically? Like remove(the duplicate key joiner).
Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.
So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.
Pre-Requisite:
Name of all the fields should be present
You need to download this UDF file and make it jar so as to use it.
UDF file location from GitHub :
GitHub UDF Java File Location
We will take the above question as example.
--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9
--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3
PIG Script:
REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;
We will get column name after joining as
DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
For making it to
DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
DSB::ROLLNO:int is removed.
We need to use the UDF as
JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));
Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.
This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

Find out which field matched term in custom score script

I am using a custom score query with a multiMatchQuery. Ultimately what I want is simple and requires little explaination. In my Java Custom Score Script, I want to be able to find out which field a result matched to.
Example:
If I search Starbucks and a result comes back with the name Starbucks then I want to be able to know that name.basic was the field that matched my query. If I search for coffee and starbucks comes back I want to be able to know that tags was the field that matched.
Is there anyway to do this?
Search Query Code:
def basicSearchableSearch(t: String, lat: Double, lon: Double, r: Double, z: Int, bb: BoundingBox, max: Int): SearchResponse = {
val multiQuery = filteredQuery(
multiMatchQuery(t)
//Matches businesses and POIs
.field("name.basic").operator(Operator.OR)
.field("name.no_space")
//Businesses only
.field("tags").boost(6f),
geoBoundingBoxFilter("location")
.bottomRight(bb.botRight.y,bb.botRight.x)
.topLeft(bb.topLeft.y,bb.topLeft.x)
)
val customQuery = customScoreQuery(
multiQuery
)
.script("customJavaScript")
.lang("native")
.param("lat",lat)
.param("lon",lon)
.param("zoom",z)
global.Global.getClient().prepareSearch("searchable")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(customQuery)
.setFrom(0).setSize(max)
.execute()
.actionGet();
}
It's only simple for simple queries. On complex queries, the question which field matched is actually quite nontrivial. So, I cannot think of any efficient way to do it.
Perhaps, you could consider moving your custom score calculation closer to the match. The multi_match query is basically a shortcut for a set of match queries on the same query string combined by a dis_max query. So, you are currently building something like this:
custom_score(
filtered(
dis_max(match_1, match_2, match_3)
)
)
What you can do is to move your custom_score under dis_max and build something like this:
filtered(
dis_max(
custom_score_1(match_1),
custom_score_2(match_2),
custom_score_3(match_3)
)
)
Obviously, this will be a somewhat different query, since dis_max will operate on custom score instead of original score.

Categories