Pig pass relation as argument to UDF - java

I need to pass a relation to a UDF in PIG
articles = load x using ...;
groupedArticles = udfs.MyUDF(articles);
Is something like this possible? Any workaround?
thanks

I guess you mean to pass all fields of the relation to the UDF? Passing the relation would not make sense. In any case this depends on how your load statement looks like. If you load each entry as a tuple load x using ... as (entry:(a:int, b:chararray, ...)) than you could pass that to the UDF like groupedArticles = foreach articles generate udfs.MyUDF(entry) Passing the whole line as a tuple is probably the most generic way, you have to deal with a generic tuple in your UDF though.

Related

How to construct predicates dynamically in Java

I am not sure if that is possible or not and after a lot of research I ended up here to ask for your help or even guidance.
So, let's say I have a json array that has 10 different types of objects inside the array. This is a json that is being retrieved through an API with sports games.
What I need to do is filtering through these objects in my application. I am using JAVA and so far I have ended up that I will use stream filter and predicates. I am aware that I can create different types of predicates and put them in the stream.filter() function, but is it possible to do it somehow dynamically?
For example, I need to filter this array by time. This predicate will be
return p -> p.getTime() > 1;
And then:
return match.stream().filter( predicate ).collect(Collectors.<Match>toList());
What if another filter has another one condition which is team name. Is it possible to add some how the other predicate and also add the "AND" "OR" condition between those two? I need to do this dynamically using one filter function with different predicates.
Is there a way to make something like a custom query to store it in a database and retrieve it and use it like a predicate? Or the predicate itself is it possible to be stored in a database?
If I am completely wrong on this please guide me to find another way to do this. Otherwise a help would be appreciated. Thank you and happy new year to all. :)
This is an interesting problem. And I think this will not be uncommon face as well considering data lake scenarios.
I think, as suggested in a comment above, the way to apply is to have a Predicate. You may have a predicate that applies the conditions as AND or OR and then supply it to the stream processor. Like this (assuming that you have a base class Data to which you have mapped your API output):
/* Create the predicate with the conditions. Showing 2 here with an "AND" combination. */
Predicate<? extends Data> p = d -> d.getTime() > 1;
p.and( d -> d.getName().equals( "Football" ) ); //Consider ".or()" here, if that is what you need.
/* Supply this predicate to the stream processor. */
match.stream().filter( p ).collect(Collectors.<Match>toList());
Using an and() call is the same as calling .filter() one after the other on the stream processor. Something like this:
stream.filter(...).filter(...)...
So, you will be able to construct such a stream call in a for loop.
Is there a way to make something like a custom query to store it in a database and retrieve it and use it like a predicate? Or the predicate itself is it possible to be stored in a database?
You may do this within your Predicate itself. That is, instead writing the logic as shown above, you may make a database call to fetch you Java code. However, you will have to do dynamic compilation using JavaCompiler. That may be a bit complicated. However, you may consider a JVM-based scripting language like Groovy for such things.

Java - Is there a way to query an Apache Spark schema without iterating?

I'm trying to find out if there's a way to directly query a struct from a Spark schema derived from a dataset of rows. Is there some sort of Java equivalent to the Scala provided dataframe.schema("nameOfStruct")?
I've tried finding such a prebuilt function, but the only thing I could find was a way to iterate through a list of Structs or make an iterator. This seems really redundant when Scala provides a much easier way of doing things, especially if I don't want to check through a loop or find the exact index of my desired Struct.
//adding the metadata to a column
final Metadata metadata = new MetadataBuilder().putLong("metadataExample", 1).build();
final Dataset<Row> dfWithColumnMetadata = df1.withColumn("column_example", df.col("column_example"), metadata);
/*now I want to find the exact Struct and its metadata without having to loop through
an array or create an iterator. However, the array version is the easiest way I could find.
The con here is that I need to know the exact index of the column.*/
System.out.println(dfWithColumnMetadata.schema().fields()[0].metadata().toString());
Is there a way that I could get something like Scala's df.schema("column_example").metadata() ?
I think you can use:
dfWithColumnMetadata.schema().apply("column_example").metadata()

How to pass Row in UDF?

I am writing a UDF in Java.
I'd like to perform a more complex operation on the DateSet<Row>. For that
I think I need to pass DataSet<Row> as the input to my UDF and return the output. Here is my code:
UDF1<Dataset<Row>,String> myUDF = new UDF1<Dataset<Row>,String>() {
public String call(Dataset<Row> input) throws Exception {
System.out.println(input);
return "test";
}
};
// Register the UDF with our SQLContext
spark.udf().register("myUDF", myUDF, DataTypes.StringType); {
But when i go and try to use myUDF it seems like callUDF function only accepts a Column not a DataSet<Row>.
Can anyone help how I can pass the DataSet<Row> as an input parameter to a UDF? Is there any other way i can call my UDF in Spark SQL?
But when i go and try to use myUDF it seems like callUDF function only accept the column not the Dataset, Can any one help how i can pass the dataset as input parameter in the UDF. Is there any other way i can call my UDF in Spark SQL
There are a few questions here.
First of all, a UDF is a function that work with (the values inside) Columns. In a sense, you could use struct function to combine required columns to pretend you work with an entire Dataset.
If however you want to work with an entire Dataset, you really want a pure Java/Scala method that simply accepts the Dataset. There's not much Spark can do about it. It's simply a Java/Scala programming.
There's however a very nice method that I don't see much use of, i.e. Dataset.transform:
transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U] Concise syntax for chaining custom transformations.
That allows for chaining methods that accept a Dataset which makes for a very readable code (and seems exactly what you want).

Java ORM: modify record in generic way

Which Java ORM support not-typesave modification of object?
I want to modify objects / records in a generic way, where the fieldName is a string input parameter, and value is a generic AnyObject parameter. Do you know something like this?
I.e. in Core Data in iOS it can work like this:
I went though ormlite tutorial and I just realized, need to get the appropriate Dao, to insert an item:
This is exactly how ActiveJDBC works. Check out at http://javalite.io.

How to avoid a large if-else statement in Java

I'm developing a framework in java which relies on a number of XML files with large number of parameters.
When reading the parameters from the XML file, I have to have a large if-else statement to decide what the parameters is and then call appropriate methods.
Is this normal? to have a large if-else statement?
I am thinking that there is a simple and neater way of doing this, e.g. Java XML mapping or Java Reflections? is this the answer? if so, can you please provide examples of how this is done so I don't have to rely on a large if-else statement?
Thanks!
You want to first create an interface:
public interface XMLParameterHandler {
public handle_parameter (String XMLData);
}
Next you want to create a map:
private Map<string, XMLParameterHandler> handlers;
...and initialize it with one of the relevant Map implementations:
this.handlers = new HashMap<>();
You need to implement the interface on a number of classes, one for each parameter you intend to handle. This is a good use of inner classes. Insert each of these implemented handerls into the map:
handlers.put ("Param1", new XMLParam1HandlerImpl());
handlers.put ("Param2", new XMLParam2HandlerImpl());
Then you can call the handler from the xml processing loop:
handlers.get (paramValue).handle_parameter(XmlData);
There is JAXB (http://en.wikipedia.org/wiki/Java_Architecture_for_XML_Binding) for mapping java class to xml.
But you can't map methods with it: you only can map attributes to xml file values (deserialize parameters from xml).
i recommend to use Map, that have parameter as key and xml entry as value(not whole xml)
Reflection would be one approach. Perhaps combined with a custom annotation on the target method to indicate which parameter to pass to that method. This is an advanced technique, though.
A more standard technique would be to use a map, where the key is the attribute name, and the value is an instance of an implementation of some interface you define, like AttributeHandler. The implementations then contain the code for each attribute. This involves writing a lot of little classes, but you can do them as anonymous classes to save space and keep the code inline.
a large if-else statement to decide what the parameters is and then call appropriate methods
You could instead use the Strategy design pattern, with one Strategy object per parameter, and use a map from the parameter name to the Strategy object to use. I've found this approach useful for even a moderately complicated application of XML.
It sounds to me as if you want a data-driven rule-based approach to writing your application, rather like you get in XSLT. One way of achieving this is to write it in XSLT instead of Java - XSLT, after all, was specifically designed for processing XML, while Java wasn't. If you can't do that, you could study how XSLT does it using rules and actions, and emulate this design in your Java code.
N functions with M parameters can always be implemented with a single function with M + 1 parameters.
If you need a big if then else statement to decide which method to dispatch to, then you can just add a parameter to your method and call a single method.
You shouldn't need an if-then-else statement to bind the parameter values.
If there is complex logic dependent on the particular parameter values, you might use a table driven approach. You can map various combinations of paramemter values into equivalence classes, then variouos equivalence class combinations into a row in a table with a unique id, then have a switch statement based on that unique id.

Categories