Feeding a spark dataset as a java reader in csv format - java

I am trying to implement my own org.apache.spark.ml.Transformer and I need to pass the contents of my org.apache.spark.sql.Dataset in csv format to my Java library which accepts a java.io.reader. I am struggling here because it seems this is really two different worlds. Ideally I don't want to have to create a string out of it, I would want to stream it. At this specific step the data shouldn't be larger than about a gigabyte though so I guess I could make do with a String solution if it is absolutely needed.
In an attempt to get a string I tried something like:
class Bla (override val uid: String) extends Transformer {
[...]
def transform(df: Dataset[_]): DataFrame = {
df.rdd.map(x=>x.mkString(","))
[...]
But I get several errors:
value mkString is not a member of _$1
polymorphic expression cannot be instantiated to expected type; found :
[U]org.apache.spark.rdd.RDD[U]
required: org.apache.spark.sql.DataFrame (which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
So any suggestions?
Edit: I have made a little outline of what I need to at https://github.com/jonalv/spark_java_problem

Related

Can a YAML value string be evaluated in Java?

Is it possible to pass Java code as a value in a YAML file. For example, something like this:
---
dueDate: "DueDateCalc()"
DueDateCalc() might be a method defined in the Java code that is parsing the YAML. It would then set the Java dueDate property to the return of the predefined DueDateCalc() method.
This is possible within the constraints of Java runtime reflection, however you need to implement it yourself.
For example, your YAML could look like this:
---
dueDate: !call DueDateCalc
!call is a local tag for telling the loading code that the scalar value DueDateCalc should be interpreted as method to be called (this is chosen by you, not something predefined). You can implement this with a custom constructor for the !calc tag that searches for a method with the given name within some given class, and then calls it on some given object.
What about parameters? Well, still possible, but will get ugly fast. First problem is how you define the paramaters:
with nested YAML sequences: !call [MyMethod, [1, 2, 3]]
with a scalar that needs to be parsed: !call MyMethod(1, 2, 3)
The former option lets YAML parse the parameters and you'll get a list; the latter option requires you to parse the method call yourself from the string you get from YAML.
The second problem is to load the values into Java variables so that you can give them as argument list. Java reflection lets you get the method's parameter types and you can use those to load the parameter values. For example, if the first parameter's type is a String, you would parse 1 as a "1", while if it's an int, you can parse 1 as int. This is possible with SnakeYAML's builtin facilities if you're using nested YAML sequences for method call encoding.
This would even work if parameters are class objects with complex structure, you'd just use normal YAML syntax and the objects will be loaded properly. Referring to variables in your code is not directly possible, but you could define another tag !lookup which retrieves values from a given Map structure.
While reflection lets you make method calls, you can not directly evaluate an expression like 6*9. So before you try and implement anything, evaluate which functionality you need and check whether it's doable via reflection.

How to pass Row in UDF?

I am writing a UDF in Java.
I'd like to perform a more complex operation on the DateSet<Row>. For that
I think I need to pass DataSet<Row> as the input to my UDF and return the output. Here is my code:
UDF1<Dataset<Row>,String> myUDF = new UDF1<Dataset<Row>,String>() {
public String call(Dataset<Row> input) throws Exception {
System.out.println(input);
return "test";
}
};
// Register the UDF with our SQLContext
spark.udf().register("myUDF", myUDF, DataTypes.StringType); {
But when i go and try to use myUDF it seems like callUDF function only accepts a Column not a DataSet<Row>.
Can anyone help how I can pass the DataSet<Row> as an input parameter to a UDF? Is there any other way i can call my UDF in Spark SQL?
But when i go and try to use myUDF it seems like callUDF function only accept the column not the Dataset, Can any one help how i can pass the dataset as input parameter in the UDF. Is there any other way i can call my UDF in Spark SQL
There are a few questions here.
First of all, a UDF is a function that work with (the values inside) Columns. In a sense, you could use struct function to combine required columns to pretend you work with an entire Dataset.
If however you want to work with an entire Dataset, you really want a pure Java/Scala method that simply accepts the Dataset. There's not much Spark can do about it. It's simply a Java/Scala programming.
There's however a very nice method that I don't see much use of, i.e. Dataset.transform:
transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U] Concise syntax for chaining custom transformations.
That allows for chaining methods that accept a Dataset which makes for a very readable code (and seems exactly what you want).

Load class according to string

I'm quite new in Java and need some help implementing the following:
In a mongodb-instance I save follwing data-structure:
{
name: String,
type: String // or anything else that might make this task easier
data [Array of objects]
}
The data structure of the data field depends of the type of the document.
The type also declares a certain Class, that handles populating and parsing the data.
My idea of accomplishing something like this with my current knowlegde would be to have a HashMap<String, Class> and then registering the type and the appropriate class.
Another way (which I don't know yet how to implement) would be to store the excat classname as the type and then trying to cast this string to a Class, but since the HashMap-way would probably be easier I thought I'd rather try this way.
What would be a good approach solving this problem?

Perform Linear Regression on data (from .arff file) - JAVA, Weka

I want to perform Linear Regression on a collection of data using Java. I have couple of questions..
what data types does linear regression method accept?
Because, I have tried to load the data in pure nominal format as well as numeric, but then when i'm trying to pass that 'data' (an Instance Variable created in program) to Linear Regression it gives me this exception. Cannot handle Multi-Valued nominal class
How to be able to print the Linear Regression output to console in java. I'm unable to produce the code to do so, after going through the predefined LinearRegression.java class, i got to know that buildClassifier() is the method that takes 'data' as input file. But then i'm unable to move forward. Can anyone help me understand the sequence of steps to follow to be able to get output to console.
protected static void useLinearRegression(Instances data) throws Exception{
BufferedReader reader = new BufferedReader(new FileReader("c:\somePath\healthCare.arff"));
Instances data = new Instances(reader);
data1.setClassIndex(data1.numAttributes() - 1);
LinearRegression2 rl=new LinearRegression2();
rl.buildClassifier(data); //What after this? or before
Linear Regression should accept both nominal and numeric data types. It is simply that the target class cannot be a nominal data type.
The Model's toString() method should be able to spit out the model (other classifier options may also be required depending on your needs), but if you are also after the predictions and summaries, you may also need an Evaluation object. There, you could use toSummaryString() or toMatrixString() to obtain some other statistics about the model that was generated.
Hope this Helps!

Performance-effective way to transform XML data represented as Writeable

I'm working on utility method that allows conversion of XML data into formatted String and before you're going to think it's a trivial task for javax.xml.transform.Transformer let me explain the specific constraints I've faced with.
The input data does not exist at the moment conversion starts. Actually it's represented as groovy.lang.Writeable (javadoc) instance that I could output into any java.io.Writer instance. Signature of method looks like this:
static String serializeToString(Writable source)
My current solution involves few steps and actually provides expected result:
Create StringWriter, output source there and convert to String
Create javax.xml.transform.stream.StreamSource instance based on this string (using StringReader)
Create new StringWriter instance and wrap it into javax.xml.transform.stream.StreamResult
Perform transformation using instance of javax.xml.transform.Transformer
Convert StringWriter to String
While solution does work I'm not pleased enough with its efficiency. This method will be used really often and I do want to optimize it. What I'd like to avoid is necessity to perform multiple conversions along the line:
From Writeable to String (unformatted)
From String to StreamSource (which means that data will be parsed again)
From StreamSource to String again (formatted)
So the question is whether it's possible to build pipe-like flow which eliminates unnecessary conversions?
UPDATE #1:
To give a little bit more context, I'm converting GPathResult instance to formatted string using StreamingMarkupBuilder.bindNode() method which produces Writable instance. Unfortunately there is no way to specify StreamingMarkupBuilder to produce formatted output.
UPDATE #2:
I did experiment with implementation based on PipedWriter + PipedReader but experiments didn't show much speed gain from this approach. Looks like it's not that critical issue in this case.
Not knowing what you mean exactly by "XML data", but you could think of representing the "Yet-to-be" stuff as a SAXSource directly, thereby by-passing the "to-string" and "parse-string" steps.

Categories