Java-callable n-Sampler for Spark Dataset - java

I'm migrating code from Python to Java and want to build an n-Sampler for Dataset<Row>. It's been a bit frustrating, I ended up cheating and making a very inefficient Scala function for it based off other posts. I then run the function from my Java code, but even that hasn't worked
N-Sample behaviour:
- Select N-rows randomly from dataset
- No repetitions (no replacement)
Current Solution (broken)
import scala.util.Random
object ScalaFunctions {
def nSample(df : org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], n : Int) : org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
//inefficient! Shuffles entire dataframe
val output = Random.shuffle(df).take(n)
return output.asInstanceOf[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
}
}
Error Message
Error:(6, 25) inferred type arguments [org.apache.spark.sql.Row,org.apache.spark.sql.Dataset] do not conform to method shuffle's type parameter bounds [T,CC[X] <: TraversableOnce[X]]
val output = Random.shuffle(df).take(n)
Error:(6, 33) type mismatch;
found : org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: CC[T]
val output = Random.shuffle(df).take(n)
I'm new to Java and Scala, so even though I understand the shuffle function doesn't seem to like Datasets, I have no idea how to fix it.
- Virtual beer if you have a solution that doesn't involve shuffling the entire dataframe (for me, this could be like 4M rows) for a small n sample (250)

Related

groovy.lang.MissingMethodException: No signature of method: java.lang.String.name() is applicable for argument types: () values: []

I am new to this groovy script, as i have requirement to update the java 1.7 to java8 for my project,after upgrading the version both java & groovy then facing issue with the groovy syntax.
In java 7 it worked without any issues,but in java 8 facing below error .
The below error is occruing while i am trying to find specific tag element as below :
def tradeString = msg."**".find{it.name() == "m__tradeString"}
groovy.lang.MissingMethodException: No signature of method: java.lang.String.name() is applicable for argument types: () values: []
Possible solutions: take(int), take(int), any(), any(groovy.lang.Closure), wait(), dump()
Note : in some places name() method is working but other places facing this error
please help me here.
Thanks in advance.
After discussion in comments..
For now I see you iterating through the collection with groovy.util.Node objects, and calling it.name() seems valid for them.
But from the initial error message in your question I saw the String object too. So you might have at least 2 different types (String and Node). And you have to handle this.
def tradeString = msg."**".find{
String expected = 'm__tradeString'
String actual = it instanceof groovy.util.Node ? it.name() : it
actual == expected
}
This code can be shortened of course.
And for such cases I suggest to look throw the initial collection in debugger, or print each object type .getClass() and .toString(). So It'll become more clear.
groovy.lang.MissingMethodException: No signature of method: java.lang.String.name() is applicable for argument types: () values: []
Means that method name() was invoked on java.lang.String, but this class doesn't have such method.
Less code version:
// this method can be reused
static String extractNodeName(node) {
node instanceof groovy.util.Node ? node.name() : node
}
def tradeString = msg."**".find{ extractNodeName(it) =='m__tradeString' }
It happened to me when I used a variable with the same name as a method.
Example:
A function with sshCommand String parameter, in which I called:
sshCommand remote: remote, command: fullCommand
(fullCommand was sshCommand with some flags I added).
So I had a string, and after that used the method, that was not a string, and this gave the error.

Getting shortestPaths in GraphFrames with Java

I am new to Spark and GraphFrames.
When I wanted to learn about shortestPaths method in GraphFrame, GraphFrames documentation gave me a sample code in Scala, but not in Java.
In their document, they provided following (Scala code):
import org.graphframes.{examples,GraphFrame}
val g: GraphFrame = examples.Graphs.friends // get example graph
val results = g.shortestPaths.landmarks(Seq("a", "d")).run()
results.select("id", "distances").show()
and in Java, I tried:
import org.graphframes.GraphFrames;
import scala.collection.Seq;
import scala.collection.JavaConverters;
GraphFrame g = new GraphFrame(...,...);
Seq landmarkSeq = JavaConverters.collectionAsScalaIterableConverter(Arrays.asList((Object)"a",(Object)"d")).asScala().toSeq();
g.shortestPaths().landmarks(landmarkSeq).run().show();
or
g.shortestPaths().landmarks(new ArrayList<Object>(List.of((Object)"a",(Object)"d"))).run().show();
Casting to java.lang.Object was necessary since the API demands Seq<Object> or ArrayList<Object> and I could not pass ArrayList<String> to compile it right.
After running the code, I saw the message:
Exception in thread "main" org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;
To follow the 3., I have added the code:
System.setProperty("spark.sql.legacy.allowUntypedScalaUDF","true");
but situation did not change.
Since there are limited number of sample code or stackoverflow questions about GraphFrames in Java, I could not find any useful information while seeking around.
Could anyone experienced in this area help me solve this problem?
This seems a bug in GraphFrames 0.8.0.
See Issue #367 in github.com

Apache beam / Scala Combine perkey

I'm looking for someone to provide a working example of using Apache beam Combine.perkey transform in Scala using the beam Java API.
I'm coming across issues with the scala/java type interoperability.
I can't get combine per key to work. I can never get it to be correct in syntax.
Example:
val sales: PCollection KV[(Int,Int), Long]
sales.apply(Combine.perKey[(Int,Int),Long,Long](new SumLongs()))
import org.apache.beam.sdk.transforms.SerializableFunction
class SumLongs extends SerializableFunction[Iterable[Long], Long] {
override def apply(input: Iterable[Long]): Long = {
var sum = 0L
for (item <- input) {
sum += item
}
sum
}
}
It gives error, "too many type arguments for perkey". When I take that out, it states "Unspecified type parameters: OutputT".
I just needed to change scala Iterable with java.lang.Iterable

Java 8 Stream from scala code

I'm trying to use Java 8 Stream from scala code as below and stuck with an compilation error.
any help appreciated!!
def sendRecord(record: String): Unit throws Exception
bufferedReader.lines().forEach(s => sendRecord(s))
Cannot resolve forEach with such signature, expect: Consumer[_ >: String], actual: (Nothing)
PS: though there is some indication that it is almost straight forward like https://gist.github.com/adriaanm/892d6063dd485d7dd221 it doesn't seem to work. I'm running Scala 2.11.8
You can convert toiterator to iterate java Stream, like:
import scala.collection.JavaConverters._
bufferedReader.lines().iterator.asScala.forEach(s => sendRecord(s))
Look at top of the file you linked in your question. It mentions -Xexperimental flag. If you run scala compiler or repl with this flag scala functions will be translated to java equivalents. Another option is to just pass in java function manually.
scala> java.util.stream.Stream.of("asd", "dsa").map(new java.util.function.Function[String, String] { def apply(s: String) = s + "x" }).toArray
res0: Array[Object] = Array(asdx, dsax)
you can also create (implicit) conversion to wrap scala functions for you.
You can also wait for scala 2.12, with this version you won't need the flag anymore.
Update
As scala 2.12 is out, the code in question would just compile normally.
The problem is that that expression in scala does not automatically implement the expected java functional interface Consumer.
See these questions for details how to solve it. In Scala 2.12,it will probably work without conversions.
"Lambdifying" scala Function in Java
Smooth way of using Function<A, R> java interface from scala?
In Scala 2.12 you can work with Java streams very easily:
import java.util.Arrays
val stream = Arrays.stream(Array(1, 2, 3, 4, 6, 7))
stream
.map {
case i: Int if i % 2 == 0 => i * 2
case i: Int if i % 2 == 1 => i * 2 + 2
}
.forEach(println)

Expression of type Pointer[Float] doesn't conform to expected type Pointer[Float] in Scala

I am new to Scala and I have been trying to use library bridj implemented in Java.
Here is the code (allocateFloats is a method in class org.bridj.Pointer)
import org.bridj.Pointer
import org.bridj.Pointer._
class EstimatingPi {
def main(args: Array[String]) {
val n: Int = 1024
val aPtr : Pointer[Float] = allocateFloats(n)
}
}
This would result in "Expression of type Pointer[Float] doesn't conform to expected type Pointer[Float]". But if I don't specify the type of the aPtr as shown below, the code compiles.
val aPtr = allocateFloats(n)
I tried to find the solution online but the questions are mostly like "Expression of type someClass[T1] doesn't conform to expected type someClass[T2]". But in my case, they are the same type.
I would really appreciate any help.
One of them is probably a java.lang.Float. You probably would need to
val aPtr: Pointer[java.lang.Float] = allocateFloats(n)
If it is not obvious from the allocateFloats documentation, you can find out what type it is by doing something like
allocateFloats(0).getClass.getName
which will give you the full name with a [L before and ; after, e.g.
[Ljava.lang.Float;

Categories