So,I have this implementation of seperate chaining hashing in Java : https://github.com/Big-data-analytics-project/Static-hashing-closed/blob/main/Static%20hashing%20closed
The next step is emplementing it using spark, I tried reading tutorials but I'm still lost. How can I do this ?
One possibility is to create a jar from your hashing implementation and register it inside the Spark application as UDF like this:
spark.udf.registerJavaFunction("udf_hash", "function_name_inside_jar", <returnType e.g: StringType()>)
after this, you can use it via SQL expression, like this:
df = df.withColumn("hashed_column", expr("udf_hash({})".format("column")))
useful links:
Register UDF to SqlContext from Scala to use in PySpark
Spark: How to map Python with Scala or Java User Defined Functions?
Important you have to define your jar in spark-submit using --jars
you can use below UDF to get this achived:
#1.define hash id calculation UDF
def calculate_hashidUDF = udf((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
})
#2.register hash id calculation UDF as spark sql function
spark.udf.register("hashid", calculate_hashidUDF)
for direct hash value use md in above def, this function how ever will return values from 1 to 10000
once you register as spark udf then you can use hashid in spark.sql aswell.
Related
I am performing something like this for using right join in the spark application in java.
Dataset<Row> dataset3 = dataset1.join(dataset2,
(Seq<String>) dataset1.col("target_guid"),RightOuter.sql());
But getting this error
java.lang.ClassCastException: org.apache.spark.sql.Column cannot be
cast to scala.collection.Seq
Other than this, I couldn't find the way to use joins in java for the datasets.
Could anyone help me finding a way to do this?
If you wanted to use below dataset api in java-
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
then convert the string list into seq. Please keep below method handy to convert java list to scala seq as most of the spark apis accept scala seq
import scala.collection.JavaConversions;
<T> Buffer<T> toScalaSeq(List<T> list) {
return JavaConversions.asScalaBuffer(list);
}
Also you can't use joinType as RightOuter.sql() which evaluates to 'RIGHT OUTER'. The supported join types includes -
'inner', 'outer', 'full', 'fullouter', 'full_outer', 'leftouter', 'left', 'left_outer', 'rightouter', 'right', 'right_outer', 'leftsemi', 'left_semi', 'leftanti', 'left_anti', 'cross'
Now you can use-
Dataset<Row> dataset3 = dataset1.join(dataset2,
toScalaSeq(Arrays.asList("target_guid")), "rightouter");
Can change your code to something like this,
Dataset<Row> dataset3 = dataset1.as("dataset1").join(dataset2.as("dataset2"),
dataset1.col("target_guid").equalTo(dataset2.col("target_guid")), RightOuter.sql());
My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
I'm using Apache Spark 2 to tokenize some text.
Dataset<Row> regexTokenized = regexTokenizer.transform(data);
It returns Array of String.
Dataset<Row> words = regexTokenized.select("words");
sample data looks like this.
+--------------------+
| words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|
Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.
based on the #Haroun Mohammedi answer, I was able to figure it out in Java.
Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();
I'm coming from scala but I do believe that there's a similar way in Java.
I think in this case you have to use the explode method in order to transform your data into a Dataset of words.
This code should give you the desired results :
import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()
For information about the explode methode please refer to the official documentation
Hope it helps.
Is it possible to code UDF function that will do following
records = load INPUT using PigStorage() AS (vin:chararray , longString:chararray);
simpleData = foreach records generate vin , myUdfFunctionGetValue(longString , 'someKey');
Here longString is of structure "key:Value;key2:Value2,someKey:Value3...."
So I need to parse longString and get the value of asked key. Am I going to wrong direction and is this possible in PIG?
You can do this easily with a python UDF.
UDF:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#outputSchema("output:chararray")
def key_value_parser(s, k):
try:
d = dict([x.split(':') for x in s.split(';')])
return d[k]
except:
return None
Pig:
REGISTER '/root/path/name_of_udf.py' USING jython as udf;
data = LOAD 'input' USING PigStorage() AS (vin:chararray, longString:chararray);
parsedString = FOREACH data GENERATE udf.key_value_parser(longString, 'key3');
DUMP parsedString;
Assuming longString is of the form key1:Value1;Key2:Value2;key3:Value3; ...
Output:
(Value3)
i want to write a pig code to perform group by and generate sum of 31 fields, but before that i need to do some custom processing for which i wrote an eval function. i think i can make it run faster if i can include the GROUP and SUM operations into the UDF. To do this can i use algebraic UDF if yes how would my return schema of inital(), intermed() and final() look like, if no how else can i implement this. below is my code and thanks.
a = LOAD './a' using PigStorage('|') AS (val:int, grp1, grp2, amt1:long, amt2:long, amt3 ... amt31:long);
b = FOREACH a GENERATE myudfs.Custom(val) AS custom_val, grp1, grp2, amt1 ... amt31;
c = GROUP b BY (custom_val,grp1, grp2);
d = FOREACH c GENERATE group, SUM(b.amt1) ... SUM(b.amt31);
store d into './op';
How is it possible to perform GROUP within a UDF...?
GROUP is being translated in Pig into a MapReduce job (intermediate key of this job will be combined from custom_val,grp1, grp2).
The ability to iterate (FOREACH) through the entire list of tuples for a certain group is being done in the the Reducer.
Algebraic UDF will not "include the GROUP", but will be executed as part of the GROUP aggregations. So I think that Algebraic is not relevant here.
I guess that the only optimization that you might do here, is to group on the original val, and to call myudfs.Custom(val) only after the GROUP.
Assuming that your UDF is an injective function.
a = LOAD './a' using PigStorage('|') AS (val:int, grp1, grp2, amt1:long, amt2:long, amt3 ... amt31:long);
c = GROUP b BY (val,grp1, grp2);
d = FOREACH c GENERATE myudfs.Custom(group) AS custom_val, SUM(b.amt1) ... SUM(b.amt31);
store d into './op';