I am writing a UDF in Java.
I'd like to perform a more complex operation on the DateSet<Row>. For that
I think I need to pass DataSet<Row> as the input to my UDF and return the output. Here is my code:
UDF1<Dataset<Row>,String> myUDF = new UDF1<Dataset<Row>,String>() {
public String call(Dataset<Row> input) throws Exception {
System.out.println(input);
return "test";
}
};
// Register the UDF with our SQLContext
spark.udf().register("myUDF", myUDF, DataTypes.StringType); {
But when i go and try to use myUDF it seems like callUDF function only accepts a Column not a DataSet<Row>.
Can anyone help how I can pass the DataSet<Row> as an input parameter to a UDF? Is there any other way i can call my UDF in Spark SQL?
But when i go and try to use myUDF it seems like callUDF function only accept the column not the Dataset, Can any one help how i can pass the dataset as input parameter in the UDF. Is there any other way i can call my UDF in Spark SQL
There are a few questions here.
First of all, a UDF is a function that work with (the values inside) Columns. In a sense, you could use struct function to combine required columns to pretend you work with an entire Dataset.
If however you want to work with an entire Dataset, you really want a pure Java/Scala method that simply accepts the Dataset. There's not much Spark can do about it. It's simply a Java/Scala programming.
There's however a very nice method that I don't see much use of, i.e. Dataset.transform:
transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U] Concise syntax for chaining custom transformations.
That allows for chaining methods that accept a Dataset which makes for a very readable code (and seems exactly what you want).
Related
I am not sure if that is possible or not and after a lot of research I ended up here to ask for your help or even guidance.
So, let's say I have a json array that has 10 different types of objects inside the array. This is a json that is being retrieved through an API with sports games.
What I need to do is filtering through these objects in my application. I am using JAVA and so far I have ended up that I will use stream filter and predicates. I am aware that I can create different types of predicates and put them in the stream.filter() function, but is it possible to do it somehow dynamically?
For example, I need to filter this array by time. This predicate will be
return p -> p.getTime() > 1;
And then:
return match.stream().filter( predicate ).collect(Collectors.<Match>toList());
What if another filter has another one condition which is team name. Is it possible to add some how the other predicate and also add the "AND" "OR" condition between those two? I need to do this dynamically using one filter function with different predicates.
Is there a way to make something like a custom query to store it in a database and retrieve it and use it like a predicate? Or the predicate itself is it possible to be stored in a database?
If I am completely wrong on this please guide me to find another way to do this. Otherwise a help would be appreciated. Thank you and happy new year to all. :)
This is an interesting problem. And I think this will not be uncommon face as well considering data lake scenarios.
I think, as suggested in a comment above, the way to apply is to have a Predicate. You may have a predicate that applies the conditions as AND or OR and then supply it to the stream processor. Like this (assuming that you have a base class Data to which you have mapped your API output):
/* Create the predicate with the conditions. Showing 2 here with an "AND" combination. */
Predicate<? extends Data> p = d -> d.getTime() > 1;
p.and( d -> d.getName().equals( "Football" ) ); //Consider ".or()" here, if that is what you need.
/* Supply this predicate to the stream processor. */
match.stream().filter( p ).collect(Collectors.<Match>toList());
Using an and() call is the same as calling .filter() one after the other on the stream processor. Something like this:
stream.filter(...).filter(...)...
So, you will be able to construct such a stream call in a for loop.
Is there a way to make something like a custom query to store it in a database and retrieve it and use it like a predicate? Or the predicate itself is it possible to be stored in a database?
You may do this within your Predicate itself. That is, instead writing the logic as shown above, you may make a database call to fetch you Java code. However, you will have to do dynamic compilation using JavaCompiler. That may be a bit complicated. However, you may consider a JVM-based scripting language like Groovy for such things.
I am trying to implement my own org.apache.spark.ml.Transformer and I need to pass the contents of my org.apache.spark.sql.Dataset in csv format to my Java library which accepts a java.io.reader. I am struggling here because it seems this is really two different worlds. Ideally I don't want to have to create a string out of it, I would want to stream it. At this specific step the data shouldn't be larger than about a gigabyte though so I guess I could make do with a String solution if it is absolutely needed.
In an attempt to get a string I tried something like:
class Bla (override val uid: String) extends Transformer {
[...]
def transform(df: Dataset[_]): DataFrame = {
df.rdd.map(x=>x.mkString(","))
[...]
But I get several errors:
value mkString is not a member of _$1
polymorphic expression cannot be instantiated to expected type; found :
[U]org.apache.spark.rdd.RDD[U]
required: org.apache.spark.sql.DataFrame (which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
So any suggestions?
Edit: I have made a little outline of what I need to at https://github.com/jonalv/spark_java_problem
I'm trying to find out if there's a way to directly query a struct from a Spark schema derived from a dataset of rows. Is there some sort of Java equivalent to the Scala provided dataframe.schema("nameOfStruct")?
I've tried finding such a prebuilt function, but the only thing I could find was a way to iterate through a list of Structs or make an iterator. This seems really redundant when Scala provides a much easier way of doing things, especially if I don't want to check through a loop or find the exact index of my desired Struct.
//adding the metadata to a column
final Metadata metadata = new MetadataBuilder().putLong("metadataExample", 1).build();
final Dataset<Row> dfWithColumnMetadata = df1.withColumn("column_example", df.col("column_example"), metadata);
/*now I want to find the exact Struct and its metadata without having to loop through
an array or create an iterator. However, the array version is the easiest way I could find.
The con here is that I need to know the exact index of the column.*/
System.out.println(dfWithColumnMetadata.schema().fields()[0].metadata().toString());
Is there a way that I could get something like Scala's df.schema("column_example").metadata() ?
I think you can use:
dfWithColumnMetadata.schema().apply("column_example").metadata()
I need to pass a relation to a UDF in PIG
articles = load x using ...;
groupedArticles = udfs.MyUDF(articles);
Is something like this possible? Any workaround?
thanks
I guess you mean to pass all fields of the relation to the UDF? Passing the relation would not make sense. In any case this depends on how your load statement looks like. If you load each entry as a tuple load x using ... as (entry:(a:int, b:chararray, ...)) than you could pass that to the UDF like groupedArticles = foreach articles generate udfs.MyUDF(entry) Passing the whole line as a tuple is probably the most generic way, you have to deal with a generic tuple in your UDF though.
I want to perform Linear Regression on a collection of data using Java. I have couple of questions..
what data types does linear regression method accept?
Because, I have tried to load the data in pure nominal format as well as numeric, but then when i'm trying to pass that 'data' (an Instance Variable created in program) to Linear Regression it gives me this exception. Cannot handle Multi-Valued nominal class
How to be able to print the Linear Regression output to console in java. I'm unable to produce the code to do so, after going through the predefined LinearRegression.java class, i got to know that buildClassifier() is the method that takes 'data' as input file. But then i'm unable to move forward. Can anyone help me understand the sequence of steps to follow to be able to get output to console.
protected static void useLinearRegression(Instances data) throws Exception{
BufferedReader reader = new BufferedReader(new FileReader("c:\somePath\healthCare.arff"));
Instances data = new Instances(reader);
data1.setClassIndex(data1.numAttributes() - 1);
LinearRegression2 rl=new LinearRegression2();
rl.buildClassifier(data); //What after this? or before
Linear Regression should accept both nominal and numeric data types. It is simply that the target class cannot be a nominal data type.
The Model's toString() method should be able to spit out the model (other classifier options may also be required depending on your needs), but if you are also after the predictions and summaries, you may also need an Evaluation object. There, you could use toSummaryString() or toMatrixString() to obtain some other statistics about the model that was generated.
Hope this Helps!