I have got a lot of custom Dataframe transformations in my code.
First group is simple casting:
dframe = dframe.withColumn("account_number", col("account").cast("decimal(38,0)"));
The second group is UDF-Transformations:
(UDF1<Timestamp, Integer>) s -> s.toLocalDateTime().extractMonth()
dframe = dframe.withColumn("month", callUDF(("monthExtractor"), dframe.col("trans_date_t")));
They are all working so the code is testing. But my final goal is to create ML Pipeline out of the code so I'd able to reuse . So is there a way to convert the code above into various Transformers?
You can create your own features transformation (with udf, or other method), and then override the transform method of spark, and put inside your own operation.
The spark code on github gives you some insight on this possibility to extend the transformer functionality provided you create the wrapper objects that are necessary.
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
val xModel = new feature.XModel()
val xOp = udf {xModel.transform _ }
dataset.withColumn($(outputCol), xOp(col($(inputCol))))
}
where xModel, and xOp are abstractions. The model will transform your dataset accordingly given the defined operation.
Related
I have a PMML model that was exported from Python and I'm using that in Spark for downstream processing. Since the jpmml Evaluator isn't serializable, I'm using it inside mapPartitions. This works fine but takes a while to complete, as the mapPartition would have to materialize the iterator and collect/build the new RDD. I'm wondering if there's a more optimal way to execute the Evaluator.
I've noticed that when Spark is executing this rdd, my CPU is under utilized (drops to ~30%). Also from the SparkUI, the TaskTime (GC Time) is Red at 53s/15s
JavaRDD<List<ClassifiedPojo>> classifiedRdd = toBeClassifiedRdd.mapPartitions( r -> {
// initialized JPMML evaluator
List<ClassifiedPojo> list;
while(r.hasNext()){
// classify
list.add(new ClassifiedPojo())
}
return list.iterator();
});
Finally! I had to do 2 things.
First, I had to fix the SAX Locator by running this:
LocatorNullifier locatorNullifier = new LocatorNullifier();
locatorNullifier.applyTo(pmml);
Second, I refactored my mapPartitions to use Streams, details here.
This gave me a big boost. Hope it helps
I'm having a scheduler that gets our cluster metrics and writes the data onto a HDFS file using an older version of the Cloudera API. But recently, we updated our JARs and the original code errors with an exception.
java.lang.ClassCastException: org.apache.hadoop.io.ArrayWritable cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116)
at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:324)
I need help in using the ParquetHiveRecord class write the data (which are POJOs) in parquet format.
Code sample below:
Writable[] values = new Writable[20];
... // populate values with all values
ArrayWritable value = new ArrayWritable(Writable.class, values);
writer.write(value); // <-- Getting exception here
Details of "writer" (of type ParquetWriter):
MessageType schema = MessageTypeParser.parseMessageType(SCHEMA); // SCHEMA is a string with our schema definition
ParquetWriter<ArrayWritable> writer = new ParquetWriter<ArrayWritable>(fileName, new
DataWritableWriteSupport() {
#Override
public WriteContext init(Configuration conf) {
if (conf.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null)
conf.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
}
});
Also, we were using CDH and CM 5.5.1 before, now using 5.8.3
Thanks!
I think you need to use DataWritableWriter rather than ParquetWriter. The class cast exception indicates the write support class is expecting an instance of ParquetHiveRecord instead of ArrayWritable. DataWritableWriter likely breaks down the individual records in ArrayWritable to individual messages in the form of ParquetHiveRecord and sends each to the write support.
Parquet is sort of mind bending at times. :)
Looking at the code of the DataWritableWriteSupport class:
https ://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
You can see it is using the DataWritableWriter, hence you do not need to create an instance of DataWritableWriter, the idea of Write support is that you will be able to write different formats to parquet.
What you do need is to wrap your writables in ParquetHiveRecord
I'm making checkboxes and using Scala, I found nice example but in Java. But I couldn't convert it to Scala.
This is Java code:
Form<StudentFormData> formData = Form.form(StudentFormData.class).fill(studentData);
Scala's play.api.data.Form class doesn't have "fill" and "form" methods like Java's play.data.Form. How I can create Form in Scala?
Here is a function that I use to get data from the Form and generate an object Location.
def add = DBAction { implicit rs =>
val data = LocationForm.form.bindFromRequest.get
Locations.create(Some(data.venueName), data.lat, data.lon)
Redirect(routes.LocationController.all) }
I'm trying to read in a csv in the hdfs, parse it with cascading, and then use the resulting tuple stream to form the basis of regex expressions in another tuple stream using RegexParser. As far as I can tell, the only way to do this would be to write a custom Function of my own, and I was wondering if anybody knew how to use the Java API to do this instead.
Pointers on how to write my own function to do this inside the cascading framework would be welcome, too.
I'm running Cascading 2.5.1
The best resource for this question is the Palo Alto cascading example tutorial. It's in java and provides examples of a lot of use cases, including writing custom functions.
https://github.com/Cascading/CoPA/wiki
And yes, writing a function that allows an input regex that references other argument inputs is your best option.
public class SampleFunction extends BaseOperation implements Function
{
public void operate( FlowProcess flowProcess, FunctionCall functionCall )
{
TupleEntry argument = functionCall.getArguments();
String regex = argument.getString( 0 );
String argument = argument.getString( 1 );
String parsed = someRegexOperation();
Tuple result = new Tuple();
result.add( parsed );
functionCall.getOutputCollector().add( result );
}
}
I'm relying on an old Java API that kinda sucks and loves to throw null pointer exceptions when data is missing. I want to create a subclass that has option type accessors but preserves the old accessors until I decide I need to create safe accessors for them. Is there a good way to create a subclass from a copy of the original object? I'd like to achieve something like the following:
SafeIssue extends Issue {
def safeMethod: Option[Value] = { //... }
}
val issue = oldapi.getIssue()
val safeIssue = SafeIssue(issue)
//Preserves issue's methods and data if I need them
val unsafeVal = safeIssue.unsafeMethod
val maybeVal = safeIssue.safeMethod
Why not try an implicit conversion instead? This works better with Java APIs that like to create their own objects. So you would
class SafeIssue(issue: Issue) {
def original = issue
def safeFoo = Option(issue.foo)
// ... You must write any of these you need
}
implicit def make_issues_safe(issue: Issue) = new SafeIssue(issue)
Then you can--as long as you've supplied the method--write things like
val yay = Issue.myStaticFactoryMethodThing.safeFoo.map(x => pleaseNoNull(x))
(You can then decide whether you want to carry SafeIssue or Issue around in your code, and you can always get back the Issue from SafeIssue with the exposed original method (or you could make the issue parameter a val.)