spark broadcast variable not serializable exception

spark broadcast variable not serializable exception - java

I want to use an Rtree like https://github.com/davidmoten/rtree2 as a spark broadcast variable.
However, java.io.NotSerializableException: com.github.davidmoten.rtree2.RTree it is not supported.
Is there a workaround? https://github.com/davidmoten/rtree2 suggestes:
/ deserialize the entries from disk (for example)
List<Entry<Thing, Point> entries = ...
// bulk load
RTree<Thing, Point> tree = RTree.maxChildren(28).star().create(entries);
But I do not know how to fit this in the context of the broadcast variable. I.e. I certainly could broadcast the list of entries, but I do not know an entry point in initializing the Rtree then on all the executors when using a UDF.
Certainly, it should be possible via map-partitions, but I would by far prefer the UDF approach.
import com.github.davidmoten.grumpy.core.Position
import com.github.davidmoten.rtree2.geometry.{Geometries, Point}
import com.github.davidmoten.rtree2.{Entry, Iterables, RTree}
val sydney = Geometries.point(151.2094, -33.86)
val canberra = Geometries.point(149.1244, -35.3075)
val brisbane = Geometries.point(153.0278, -27.4679)
val bungendore = Geometries.point(149.4500, -35.2500)
var tree = RTree.star.create[String, Point]
tree = tree.add("Sydney", sydney)
tree = tree.add("Brisbane", brisbane)
val broadcastVar = spark.sparkContext.broadcast(tree)
this fails with the aforementioned exception.
By the way this also applies to:
https://github.com/davidmoten/rtree
RTree tree = ...;
OutputStream os = ...;
Serializer serializer =
Serializers.flatBuffers().utf8();
serializer.write(tree, os);
apparently this should work, but at least for spark it will not work with the same exception
edit 2
Workaround:
Using something like https://github.com/plokhotnyuk/rtree2d which properly supports serialization. Nonetheless it would be interesting how to retrofit it to the first example.

Related

setting variables in apache flink

I'm asking this question because I'm having trouble setting variables in apache flink. i would like to use a stream to fetch data with which i will initialize the variables i need for the second stream. The problem is that the streams execute in parallel, which results in a missing value when initializing the second stream. sample code:
KafkaSource<Object> mainSource1 = KafkaSource.<Object>builder()
.setBootstrapServers(...)
.setTopicPattern(Pattern.compile(...))
.setGroupId(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setDeserializer(new ObjectDeserializer())
.build();
DataStream<Market> mainStream1 = env.fromSource(mainSource, WatermarkStrategy.forMonotonousTimestamps(), "mainSource");
// fetching data from the stream and setting variables
Map<TopicPartition, Long> endOffset = new HashMap<>();
endOffset.put(new TopicPartition("topicName", 0), offsetFromMainStream1);
KafkaSource<Object> mainSource2 = KafkaSource.<Object>builder()
.setBootstrapServers(...)
.setTopicPattern(Pattern.compile(...))
.setGroupId(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setBounded(OffsetsInitializer.offsets(endOffset))
.setDeserializer(new ObjectDeserializer())
.build();
DataStream<Market> mainStream2 = env.fromSource(mainSource, WatermarkStrategy.forMonotonousTimestamps(), "mainSource");
// further stream operations
I would like to call the first stream from which I will fetch the data and set it locally then I can use it in operations on the second stream

You want to use one Stream's data to control another Stream's behavior. The best way is to use the Broadcast state pattern.
This involves creating a BroadcastStream from mainStream1, and then connecting mainStream2 to mainStream1. Now mainStream2 can access the data from mainStream1.
Here is a high level example based on your code. I am assuming that the key is String.
// Broadcast Stream
MapStateDescriptor<String, Market> stateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<Market>() {}));
// broadcast the rules and create the broadcast state
BroadcastStream<Market> mainStream1BroadcastStream = mainStream1.keyBy(// key by Id).
.broadcast(stateDescriptor);
DataStream<Market> yourOutput = mainStream2
.connect(mainStream1BroadcastStream)
.process(
new KeyedBroadcastProcessFunction<>() {
// You can access mainStream1 output and mainStream2 data here.
}
);
This concept is explained in detail here. The code is also a modified version shown here -
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/broadcast_state/#the-broadcast-state-pattern

Simple Code to Copy Same Name Properties?

I have an old question sustained in my mind for a long time. When I was writing code in Spring, there are lots dirty and useless code for DTO, domain objects. For language level, I am hopeless in Java and see some light in Kotlin. Here is my question:
Style 1 It is common for us to write following code (Java, C++, C#, ...)
// annot: AdminPresentation
val override = FieldMetadataOverride()
override.broadleafEnumeration = annot.broadleafEnumeration
override.hideEnumerationIfEmpty = annot.hideEnumerationIfEmpty
override.fieldComponentRenderer = annot.fieldComponentRenderer
Sytle 2 Previous code can be simplified by using T.apply() in Kotlin
override.apply {
broadleafEnumeration = annot.broadleafEnumeration
hideEnumerationIfEmpty = annot.hideEnumerationIfEmpty
fieldComponentRenderer = annot.fieldComponentRenderer
}
Sytle 3 Can such code be even simplified to something like this?
override.copySameNamePropertiesFrom (annot) { // provide property list here
broadleafEnumeration
hideEnumerationIfEmpty
fieldComponentRenderer
}
First Priority Requirments
Provide property name list only one time
The property name is provided as normal code, so as to we can get IDE auto complete feature.
Second Priority Requirements
It's prefer to avoid run-time cost for Style 3. (For example, 'reflection' may be a possible implementation, but it do introduce cost.)
It's prefer to generated code like style1/style2 directly.
Not care
The final syntax of Style 3.
I am a novice for Kotlin language. Is it possible to use Kotlin to define somthing like 'Style 3' ?

It should be pretty simple to write a 5 line helper to do this which even supports copying every matching property or just a selection of properties.
Although it's probably not useful if you're writing Kotlin code and heavily utilising data classes and val (immutable properties). Check it out:
fun <T : Any, R : Any> T.copyPropsFrom(fromObject: R, vararg props: KProperty<*>) {
// only consider mutable properties
val mutableProps = this::class.memberProperties.filterIsInstance<KMutableProperty<*>>()
// if source list is provided use that otherwise use all available properties
val sourceProps = if (props.isEmpty()) fromObject::class.memberProperties else props.toList()
// copy all matching
mutableProps.forEach { targetProp ->
sourceProps.find {
// make sure properties have same name and compatible types
it.name == targetProp.name && targetProp.returnType.isSupertypeOf(it.returnType)
}?.let { matchingProp ->
targetProp.setter.call(this, matchingProp.getter.call(fromObject))
}
}
}
This approach uses reflection, but it uses Kotlin reflection which is very lightweight. I haven't timed anything, but it should run almost at same speed as copying properties by hand.
Now given 2 classes:
data class DataOne(val propA: String, val propB: String)
data class DataTwo(var propA: String = "", var propB: String = "")
You can do the following:
var data2 = DataTwo()
var data1 = DataOne("a", "b")
println("Before")
println(data1)
println(data2)
// this copies all matching properties
data2.copyPropsFrom(data1)
println("After")
println(data1)
println(data2)
data2 = DataTwo()
data1 = DataOne("a", "b")
println("Before")
println(data1)
println(data2)
// this copies only matching properties from the provided list
// with complete refactoring and completion support
data2.copyPropsFrom(data1, DataOne::propA)
println("After")
println(data1)
println(data2)
Output will be:
Before
DataOne(propA=a, propB=b)
DataTwo(propA=, propB=)
After
DataOne(propA=a, propB=b)
DataTwo(propA=a, propB=b)
Before
DataOne(propA=a, propB=b)
DataTwo(propA=, propB=)
After
DataOne(propA=a, propB=b)
DataTwo(propA=a, propB=)

Get actual field name from JPMML model's InputField

I have a scikit model that I'm using in my java app using JPMML. I'm trying to set the InputFields using the name of the column that was used during training, but "inField.getName().getValue()" is obfuscated to "x{#}". Is there anyway i could map "x{#}" back to the original feature/attribute name?
Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
or (InputField inField : patternEvaluator.getInputFields()) {
int value = activeFeatures.contains(inField.getName().getValue()) ? 1 : 0;
FieldValue inputFieldValue = inField.prepare(value);
arguments.put(inField.getName(), inputFieldValue);
}
Map<FieldName, ?> results = patternEvaluator.evaluate(arguments);
Here's how I'm generating the modal
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import PMMLPipeline
import os
import pandas as pd
from sklearn.pipeline import Pipeline
import numpy as np
data = pd.read_csv('/pydata/training.csv')
X = data[data.keys()[:-1]].as_matrix()
y = data['classname'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
estimators = [("read", RandomForestClassifier(n_jobs=5,n_estimators=200, max_features='auto'))]
pipe = PMMLPipeline(estimators)
pipe.fit(X_train,y_train)
pipe.active_fields = np.array(data.columns)
sklearn2pmml(pipe, "/pydata/model.pmml", with_repr = True)
Thanks

Does the PMML document contain actual field names at all? Open it in a text editor, and see what are the values of /PMML/DataDictionary/DataField#name attributes.
Your question indicates that the conversion from Scikit-Learn to PMML was incomplete, because it didn't include information about active field (aka input field) names. In that case they are assumed to be x1, x2, .., xn.

Your pipeline only includes the estimator, that is why the names are lost. You have to include all the preprocessing steps as well in order to get them into the PMML.
Let's assume you do not do any preprocessing at all, then that is probably what you need (I do not repeat parts of your code which are required in this snippet):
nones = [(d, None) for d in data.columns]
mapper = DataFrameMapper(nones,df_out=True)
lm = PMMLPipeline([
("mapper", mapper),
("estimator", estimators)
])
lm.fit(X_train,y_train)
sklearn2pmml(lm, "ScikitLearnNew.pmml", with_repr=True)
In case you do require some preprocessing on your data, instead of None you can use any other transformator (e.g. LabelBinarizer). But the preprocessing has to be happening inside the pipeline in order to be included in the PMML.

Passing java object to python

I am prototyping an interface to our application to allow other people to use python, our application is written in java. I would like to pass some of our data from the java app to the python code but I am unsure how to pass an object to python. I have done a simple java->python function call using simple parameters using Jython and found it very useful for what I am trying to do. Given the class below, how can I then use it in Python/Jython as an input to a function/class:
public class TestObject
{
private double[] values;
private int length;
private int anotherVariable;
//getters, setters
}

One solution. You could use some sort of message system, queue, or broker of some sort to serialize/deserialize, or pass messages between python and java. Then create some sort workers/producer/consumers to put work on the queues to be processed in python, or java.
Also consider checking out for inspiration: https://www.py4j.org/
py4j is used heavily by/for pyspark and hadoop type stuff.
To answer your question more immediately.
Example using json-simple.:
import org.apache.commons.io.FileUtils;
import org.json.simple.JSONObject;
//import org.json.simple.JSONObject;
public class TestObject
{
private double[] values;
private int length;
private int anotherVariable;
private boolean someBool;
private String someString;
//getters, setters
public String toJSON() {
JSONObject obj=new JSONObject();
obj.put("values",new Double(this.values));
obj.put("length",new Integer(this.length));
obj.put("bool_val",new Boolean(this.SomeBool));
obj.put("string_key",this.someString);
StringWriter out = new StringWriter();
obj.writeJSONString(out);
return out.toString();
}
public void writeObject(){
Writer writer = new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream("anObject.json"), "utf-8")
)
)
writer.write(this.toJSON());
}
public static void setObject(){
values = 100.134;
length = 12;
anotherVariable = 15;
someString = "spam";
}
}
And in python:
class DoStuffWithObject(object):
def __init__(self,obj):
self.obj = obj
self.changeObj()
self.writeObj()
def changeObj(self):
self.obj['values'] = 100.134;
self.obj['length'] = 12;
self.obj['anotherVariable'] = 15;
self.obj['someString'] = "spam";
def writeObj(self):
''' write back to file '''
with open('anObject.json', 'w') as f:
json.dump(self.obj, f)
def someOtherMethod(self, s):
''' do something else '''
print('hello {}'.format(s))
import json
with open('anObject.json','r') as f:
obj = json.loads(f.read())
# print out obj['values'] obj['someBool'] ...
for key in obj:
print(key, obj[key])
aThing = DoStuffWithObject(obj)
aThing.someOtherMethod('there')
And then in java read back the object. There are solutions that exist implementing this idea (JSON-RPC, XML-RPC, and variants). Depending, you may may also want to consider using something like ( http://docs.mongodb.org/ecosystem/drivers/java/ ) the benefit being that mongo does json.
See:
https://spring.io/guides/gs/messaging-reactor/
http://spring.io/guides/gs/messaging-rabbitmq/
http://spring.io/guides/gs/scheduling-tasks/
Celery like Java projects
Jedis
RabbitMQ
ZeroMQ
A more comprehensive list of queues:
http://queues.io/
Resources referenced:
http://www.oracle.com/technetwork/articles/java/json-1973242.html
How do I create a file and write to it in Java?
https://code.google.com/p/json-simple/wiki/EncodingExamples

Agree with the answer below. I think that the bottom line is that "Python and Java are separate interpreter-environments." You therefore shouldn't expect to transfer "an object" from one to the other. You shouldn't expect to "call methods." But it is reasonable to pass data from one to another, by serializing and de-serializing it through some intermediate data format (e.g. JSON) as you would do with any other program.
In some environments, such as Microsoft Windows, it's possible that a technology like OLE (dot-Net) might be usable to allow environments to be linked-together "actively," where the various systems implement and provide OLE-objects. But I don't have any personal experience with whether, nor how, this might be done.
Therefore, the safest thing to do is to treat them as "records," and to use serialization techniques on both sides. (Or, if you got very adventurous, run (say) Java in a child-thread.) An "adventurous" design could get out-of-hand very quickly, with little return on investment.

You need to make the python file to exe using py2exe , Refer the link : https://www.youtube.com/watch?v=kyoGfnLm4LA. Then use the program in java and pass arguements:
Please refer this link it will be having the details:
Calling fortran90 exe program from java is not executing

Scala macros and the JVM's method size limit

I'm replacing some code generation components in a Java program with Scala macros, and am running into the Java Virtual Machine's limit on the size of the generated byte code for individual methods (64 kilobytes).
For example, suppose we have a large-ish XML file that represents a mapping from integers to integers that we want to use in our program. We want to avoid parsing this file at run time, so we'll write a macro that will do the parsing at compile time and use the contents of the file to create the body of our method:
import scala.language.experimental.macros
import scala.reflect.macros.Context
object BigMethod {
// For this simplified example we'll just make some data up.
val mapping = List.tabulate(7000)(i => (i, i + 1))
def lookup(i: Int): Int = macro lookup_impl
def lookup_impl(c: Context)(i: c.Expr[Int]): c.Expr[Int] = {
import c.universe._
val switch = reify(new scala.annotation.switch).tree
val cases = mapping map {
case (k, v) => CaseDef(c.literal(k).tree, EmptyTree, c.literal(v).tree)
}
c.Expr(Match(Annotated(switch, i.tree), cases))
}
}
In this case the compiled method would be just over the size limit, but instead of a nice error saying that, we're given a giant stack trace with a lot of calls to TreePrinter.printSeq and are told that we've slain the compiler.
I have a solution that involves splitting the cases into fixed-sized groups, creating a separate method for each group, and adding a top-level match that dispatches the input value to the appropriate group's method. It works, but it's unpleasant, and I'd prefer not to have to use this approach every time I write a macro where the size of the generated code depends on some external resource.
Is there a cleaner way to tackle this problem? More importantly, is there a way to deal with this kind of compiler error more gracefully? I don't like the idea of a library user getting an unintelligible "That entry seems to have slain the compiler" error message just because some XML file that's being processed by a macro has crossed some (fairly low) size threshhold.

Imo putting data into .class isn't really a good idea.
They are parsed as well, they're just binary. But storing them in JVM may have negative impact on performance of the garbagge collector and JIT compiler.
In your situation, I would pre-compile the XML into a binary file of proper format and parse that. Elligible formats with existing tooling can be e.g. FastRPC or good old DBF. Or maybe pre-fill an ElasticSearch repository if you need quick advanced lookups and searches. Some implementations of the latter may also provide basic indexing which could even leave the parsing out - the app would just read from the respective offset.

Since somebody has to say something, I followed the instructions at Importers to try to compile the tree before returning it.
If you give the compiler plenty of stack, it will correctly report the error.
(It didn't seem to know what to do with the switch annotation, left as a future exercise.)
apm#mara:~/tmp/bigmethod$ skalac bigmethod.scala ; skalac -J-Xss2m biguser.scala ; skala bigmethod.Test
Error is java.lang.RuntimeException: Method code too large!
Error is java.lang.RuntimeException: Method code too large!
biguser.scala:5: error: You ask too much of me.
Console println s"5 => ${BigMethod.lookup(5)}"
^
one error found
as opposed to
apm#mara:~/tmp/bigmethod$ skalac -J-Xss1m biguser.scala
Error is java.lang.StackOverflowError
Error is java.lang.StackOverflowError
biguser.scala:5: error: You ask too much of me.
Console println s"5 => ${BigMethod.lookup(5)}"
^
where the client code is just that:
package bigmethod
object Test extends App {
Console println s"5 => ${BigMethod.lookup(5)}"
}
My first time using this API, but not my last. Thanks for getting me kickstarted.
package bigmethod
import scala.language.experimental.macros
import scala.reflect.macros.Context
object BigMethod {
// For this simplified example we'll just make some data up.
//final val size = 700
final val size = 7000
val mapping = List.tabulate(size)(i => (i, i + 1))
def lookup(i: Int): Int = macro lookup_impl
def lookup_impl(c: Context)(i: c.Expr[Int]): c.Expr[Int] = {
def compilable[T](x: c.Expr[T]): Boolean = {
import scala.reflect.runtime.{ universe => ru }
import scala.tools.reflect._
//val mirror = ru.runtimeMirror(c.libraryClassLoader)
val mirror = ru.runtimeMirror(getClass.getClassLoader)
val toolbox = mirror.mkToolBox()
val importer0 = ru.mkImporter(c.universe)
type ruImporter = ru.Importer { val from: c.universe.type }
val importer = importer0.asInstanceOf[ruImporter]
val imported = importer.importTree(x.tree)
val tree = toolbox.resetAllAttrs(imported.duplicate)
try {
toolbox.compile(tree)
true
} catch {
case t: Throwable =>
Console println s"Error is $t"
false
}
}
import c.universe._
val switch = reify(new scala.annotation.switch).tree
val cases = mapping map {
case (k, v) => CaseDef(c.literal(k).tree, EmptyTree, c.literal(v).tree)
}
//val res = c.Expr(Match(Annotated(switch, i.tree), cases))
val res = c.Expr(Match(i.tree, cases))
// before returning a potentially huge tree, try compiling it
//import scala.tools.reflect._
//val x = c.Expr[Int](c.resetAllAttrs(res.tree.duplicate))
//val y = c.eval(x)
if (!compilable(res)) c.abort(c.enclosingPosition, "You ask too much of me.")
res
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.