Apache beam / Scala Combine perkey - java

I'm looking for someone to provide a working example of using Apache beam Combine.perkey transform in Scala using the beam Java API.
I'm coming across issues with the scala/java type interoperability.
I can't get combine per key to work. I can never get it to be correct in syntax.
Example:
val sales: PCollection KV[(Int,Int), Long]
sales.apply(Combine.perKey[(Int,Int),Long,Long](new SumLongs()))
import org.apache.beam.sdk.transforms.SerializableFunction
class SumLongs extends SerializableFunction[Iterable[Long], Long] {
override def apply(input: Iterable[Long]): Long = {
var sum = 0L
for (item <- input) {
sum += item
}
sum
}
}
It gives error, "too many type arguments for perkey". When I take that out, it states "Unspecified type parameters: OutputT".

I just needed to change scala Iterable with java.lang.Iterable

Related

Getting shortestPaths in GraphFrames with Java

I am new to Spark and GraphFrames.
When I wanted to learn about shortestPaths method in GraphFrame, GraphFrames documentation gave me a sample code in Scala, but not in Java.
In their document, they provided following (Scala code):
import org.graphframes.{examples,GraphFrame}
val g: GraphFrame = examples.Graphs.friends // get example graph
val results = g.shortestPaths.landmarks(Seq("a", "d")).run()
results.select("id", "distances").show()
and in Java, I tried:
import org.graphframes.GraphFrames;
import scala.collection.Seq;
import scala.collection.JavaConverters;
GraphFrame g = new GraphFrame(...,...);
Seq landmarkSeq = JavaConverters.collectionAsScalaIterableConverter(Arrays.asList((Object)"a",(Object)"d")).asScala().toSeq();
g.shortestPaths().landmarks(landmarkSeq).run().show();
or
g.shortestPaths().landmarks(new ArrayList<Object>(List.of((Object)"a",(Object)"d"))).run().show();
Casting to java.lang.Object was necessary since the API demands Seq<Object> or ArrayList<Object> and I could not pass ArrayList<String> to compile it right.
After running the code, I saw the message:
Exception in thread "main" org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;
To follow the 3., I have added the code:
System.setProperty("spark.sql.legacy.allowUntypedScalaUDF","true");
but situation did not change.
Since there are limited number of sample code or stackoverflow questions about GraphFrames in Java, I could not find any useful information while seeking around.
Could anyone experienced in this area help me solve this problem?
This seems a bug in GraphFrames 0.8.0.
See Issue #367 in github.com

Java-callable n-Sampler for Spark Dataset

I'm migrating code from Python to Java and want to build an n-Sampler for Dataset<Row>. It's been a bit frustrating, I ended up cheating and making a very inefficient Scala function for it based off other posts. I then run the function from my Java code, but even that hasn't worked
N-Sample behaviour:
- Select N-rows randomly from dataset
- No repetitions (no replacement)
Current Solution (broken)
import scala.util.Random
object ScalaFunctions {
def nSample(df : org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], n : Int) : org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
//inefficient! Shuffles entire dataframe
val output = Random.shuffle(df).take(n)
return output.asInstanceOf[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
}
}
Error Message
Error:(6, 25) inferred type arguments [org.apache.spark.sql.Row,org.apache.spark.sql.Dataset] do not conform to method shuffle's type parameter bounds [T,CC[X] <: TraversableOnce[X]]
val output = Random.shuffle(df).take(n)
Error:(6, 33) type mismatch;
found : org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: CC[T]
val output = Random.shuffle(df).take(n)
I'm new to Java and Scala, so even though I understand the shuffle function doesn't seem to like Datasets, I have no idea how to fix it.
- Virtual beer if you have a solution that doesn't involve shuffling the entire dataframe (for me, this could be like 4M rows) for a small n sample (250)

Is there any replacement or alternative way for using scala.collection.TraversableOnce.mkString() method in java 7

I am able to use mkString method in scala successfully. But while trying to do it with java, there does not exist mkString method for java7. So is there any way by which i can do the same thing in java.
Below is my code for reference:
val records: util.List[Tuple2[Void, Array[AnyRef]]] = dataSource.collect
import scala.collection.JavaConversions._
for (record <- records) {
println(record.f1.mkString(","))
}
You can use Arrays.deepToString to get the string representation of your array elements.
List<Tuple2<Void, Object[]>> records= dataSource.collect();
Tuple2<Void, Object[]> record = records.iterator().next();
System.out.println(Arrays.deepToString(record.f1));

How to convert a Java Collection/List to a Scala seq?

I'm trying to instantiate a Kafka Scala case class from Java code, and it has the following signature:
case class OffsetFetchRequest(groupId: String,
requestInfo: Seq[TopicAndPartition],
versionId: Short = OffsetFetchRequest.CurrentVersion,
correlationId: Int = 0,
clientId: String = OffsetFetchRequest.DefaultClientId)
I'm able to send all the requested parameters, except for the Seq[TopicAndPartition].
On the Java side, I have the following code:
OffsetFetchRequest offsetFetchRequest = new OffsetFetchRequest(
"someGroup",
topicAndPartitions,
(short)1,
1,
"clientId");
As expected, a java.util.Listis not compatible with a Scala Seq. However, I've tried all types of conversion methods in JavaConversions and JavaConverters, and I can't find anything that fits this case.
How can I create a Scala seq from a normal java.util.List or even a java.util.Collection? Or am I approaching this incorrectly?
Use scala.collection.JavaConversions.asScalaBuffer which would convert Java list to Scala buffer, of which toList method can be used to convert to immutable seq.
Alternative, you could use CyclicIterator as well.

Filtering a Scala Set from Java

I want to filter a scala set from java, below is my code.
scala.collection.immutable.Set<Member> set = cluster.state().members();
Function1<Member, UniqueAddress> filter = new AbstractFunction1<Member, UniqueAddress>(){
public UniqueAddress apply(Member member){
return member.uniqueAddress();
}
};
scala.collection.immutable.Set<UniqueAddress> set1 = set.filter(filter);
But it has error with
The method filter(Function1 < Member,Object>) in the type
TraversableLike < Member,Traversable < Member>> is not applicable for the
arguments (Function1 < Member,UniqueAddress>)
How can I fix this?
After filtering a Set<Member>, you'll get a Set<Member>, not a Set<UniqueAddress>. Do you mean map? At any rate, given how Scala collections use implicits, I really wouldn't recommend working with them from Java, except by either
converting them to Java collections using JavaConversions first (but of course, this doesn't give you equivalents to map/filter/etc.), or
writing a wrapper specifically for using them from Java.

Categories