Hadoop PIG Custom UDF method in foreach generate loop - java

Is it possible to code UDF function that will do following
records = load INPUT using PigStorage() AS (vin:chararray , longString:chararray);
simpleData = foreach records generate vin , myUdfFunctionGetValue(longString , 'someKey');
Here longString is of structure "key:Value;key2:Value2,someKey:Value3...."
So I need to parse longString and get the value of asked key. Am I going to wrong direction and is this possible in PIG?

You can do this easily with a python UDF.
UDF:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#outputSchema("output:chararray")
def key_value_parser(s, k):
try:
d = dict([x.split(':') for x in s.split(';')])
return d[k]
except:
return None
Pig:
REGISTER '/root/path/name_of_udf.py' USING jython as udf;
data = LOAD 'input' USING PigStorage() AS (vin:chararray, longString:chararray);
parsedString = FOREACH data GENERATE udf.key_value_parser(longString, 'key3');
DUMP parsedString;
Assuming longString is of the form key1:Value1;Key2:Value2;key3:Value3; ...
Output:
(Value3)

Related

Implementing Hashing with spark

So,I have this implementation of seperate chaining hashing in Java : https://github.com/Big-data-analytics-project/Static-hashing-closed/blob/main/Static%20hashing%20closed
The next step is emplementing it using spark, I tried reading tutorials but I'm still lost. How can I do this ?
One possibility is to create a jar from your hashing implementation and register it inside the Spark application as UDF like this:
spark.udf.registerJavaFunction("udf_hash", "function_name_inside_jar", <returnType e.g: StringType()>)
after this, you can use it via SQL expression, like this:
df = df.withColumn("hashed_column", expr("udf_hash({})".format("column")))
useful links:
Register UDF to SqlContext from Scala to use in PySpark
Spark: How to map Python with Scala or Java User Defined Functions?
Important you have to define your jar in spark-submit using --jars
you can use below UDF to get this achived:
#1.define hash id calculation UDF
def calculate_hashidUDF = udf((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
})
#2.register hash id calculation UDF as spark sql function
spark.udf.register("hashid", calculate_hashidUDF)
for direct hash value use md in above def, this function how ever will return values from 1 to 10000
once you register as spark udf then you can use hashid in spark.sql aswell.

How to access the entries in every row and apply custom functions?

My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;
tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.

How to write a pig algebraic udf for group by

i want to write a pig code to perform group by and generate sum of 31 fields, but before that i need to do some custom processing for which i wrote an eval function. i think i can make it run faster if i can include the GROUP and SUM operations into the UDF. To do this can i use algebraic UDF if yes how would my return schema of inital(), intermed() and final() look like, if no how else can i implement this. below is my code and thanks.
a = LOAD './a' using PigStorage('|') AS (val:int, grp1, grp2, amt1:long, amt2:long, amt3 ... amt31:long);
b = FOREACH a GENERATE myudfs.Custom(val) AS custom_val, grp1, grp2, amt1 ... amt31;
c = GROUP b BY (custom_val,grp1, grp2);
d = FOREACH c GENERATE group, SUM(b.amt1) ... SUM(b.amt31);
store d into './op';
How is it possible to perform GROUP within a UDF...?
GROUP is being translated in Pig into a MapReduce job (intermediate key of this job will be combined from custom_val,grp1, grp2).
The ability to iterate (FOREACH) through the entire list of tuples for a certain group is being done in the the Reducer.
Algebraic UDF will not "include the GROUP", but will be executed as part of the GROUP aggregations. So I think that Algebraic is not relevant here.
I guess that the only optimization that you might do here, is to group on the original val, and to call myudfs.Custom(val) only after the GROUP.
Assuming that your UDF is an injective function.
a = LOAD './a' using PigStorage('|') AS (val:int, grp1, grp2, amt1:long, amt2:long, amt3 ... amt31:long);
c = GROUP b BY (val,grp1, grp2);
d = FOREACH c GENERATE myudfs.Custom(group) AS custom_val, SUM(b.amt1) ... SUM(b.amt31);
store d into './op';

Embedded Pig in Java with Cassandra: Can't look backwards more than one token in stream

I am running a pig (0.9.1) script from Java in local mode that gets records from Cassandra (1.0.6). The script is:
rows = LOAD 'cassandra://Keyspace/Data' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
cols = FOREACH rows GENERATE flatten(columns);
colnames = FOREACH cols GENERATE $0;
namegroups = GROUP colnames BY (chararray) $0;
namecounts = FOREACH namegroups GENERATE COUNT($1), group;
orderednames = ORDER namecounts BY $0;
topnames = LIMIT orderednames 50;
dump topnames;
Whenever I try to run the script, I get:
org.apache.pig.impl.logicalLayer.FrontendException: Error during parsing. can't look backwards more than one token in this stream
Interestingly, when I run a pig script that just reads and writes the filesystem (no Cassandra), it works fine. I am using the CassandraStorage file that comes with Cassandra.
Any ideas? Thanks.

Categories