I'm trying to map a function across a JavaRDD in spark, and I keep getting NotSerializableError on the map call.
public class SparkPrunedSet extends AbstractSparkSet {
private final ColumnPruner pruner;
public SparkPrunedSet(#JsonProperty("parent") SparkSet parent, #JsonProperty("pruner") ColumnPruner pruner) {
super(parent);
this.pruner = pruner;
}
public JavaRDD<Record> getRdd(SparkContext context) {
JavaRDD<Record> rdd = getParent().getRdd(context);
Function<Record, Record> mappingFunction = makeRecordTransformer(pruner);
//The line below throws the error
JavaRDD<Record> mappedRdd = rdd.map(mappingFunction);
return mappedRdd;
}
private Function<Record, Record> makeRecordTransformer() {
return new Function<Record, Record>() {
private static final long serialVersionUID = 1L;
#Override
public Record call(Record record) throws Exception {
// Obviously i'd like to do something more useful in here, but this is enough
// to throw the error
return record;
}
};
}
}
When it runs, I get:
java.io.NotSerializableException: com.package.SparkPrunedSet
Record is an interface that implements serializable, and MapRecord is an implementation of it. Similar code to this exists and works in the codebase, except it's using rdd.filter instead. I've read through most of the other stack overflow entries on this, and none of them seem to help. I thought it might have to do with troubles serializing SparkPrunedSet (although I don't understand why it would even need to do this), so I set all of the fields on it to transient, but that didn't help either. Does anyone have any ideas?
The Function you are creating for the transformation is, in fact, an (anonymous) inner class of SparkPrunedSet. Therefore every instance of that function has an implicit reference to the SparkPrunedSet object that created it.
Therefore, serialization of it will require serialization of SparkPrunedSet.
Related
This question already has answers here:
Thread safe Hash Map?
(3 answers)
Closed 1 year ago.
Maybe this question has the answer but based on my searching words could not find something that fits best for me so posting the question here. This question may seem silly but this is something new to me. Please provide some suggestions or workarounds.
I have a Populator class which has a Map. I want this map to be populated with various values during the code execution then finally I want to obtain all values in Map then process it further based on my requirement.
As of now, I am using the Static method and variable to achieve this. This seems to work fine as of now. But my mentor advised me that this is not going to be thread-safe when processing multiple requests in parallel. I would like to know how to make the below code thread safe.
I would explain with code for better understanding:
Following is my Populcator.class which will be used to populate the Map during the code processing which I access at the end for some further processing. I have created one more class AnotherPopulator as a work-around for the issue but it does not work as per my need:
public class Populator {
#Getter
private static final HashMap<String,String> namespaces = new HashMap<>();
public static void namespacePopulator(String key,String value){
namespaces.put(key,value);
}
}
#NoArgsConstructor
#Getter
class AnotherPopulator {
private final HashMap<String,String> namespaces = new HashMap<>();
public void namespacePopulator(String key,String value){
this.namespaces.put(key,value);
}
}
Following are class A and B which will be invoked my Main.class to populate the Map during the code execution:
public class A {
public void populatorA() {
Populator.namespacePopulator("KeyA", "ValueA");
}
public void anotherPopulatorA(){
AnotherPopulator anotherPopulator = new AnotherPopulator();
anotherPopulator.namespacePopulator("KeyAA","ValueA1");
}
}
public class B {
public void populatorB() {
Populator.namespacePopulator("KeyB", "ValueB");
}
public void anotherPopulatorB(){
AnotherPopulator anotherPopulator = new AnotherPopulator();
anotherPopulator.namespacePopulator("KeyB1","ValueB1");
}
}
Following is my Main.class which will invoke A and B then finally obtain the Map with all the values populated during the execution:
public class Main {
public static void main(String[] args) {
A a = new A();
B b = new B();
a.populatorA();
b.populatorB();
//This will provide me with desired result but does not provide the thread safety
System.out.println(Populator.getNamespaces());
System.out.println("****************************");
//This will provide thread safety but does not provide the desired result I would want as new object will be created at every stage
AnotherPopulator anotherPopulator = new AnotherPopulator();
System.out.println(anotherPopulator.getNamespaces());
//I would like to populate a Map present in class from various classes during the execution then finally I would like to obtain all the values that were added during the
// execution but want to do this using the thread safety approach
}
}
Following is the output I get. 1st part has the values I need but it's not a thread-safe approach. 2nd part does not have the values I need but it's a thread-safe approach I believe.
{KeyB=ValueB, KeyA=ValueA}
****************************
{}
I would like to know how can I declare a Map using the Thread-safety approach and populate it during the entire code execution life cycle and then finally obtain all the values together.
I hope I am able to explain the issue clearly. Any help/workarounds/suggestions will be really helpful for me. Thanks in advance.
As mentioned in the comment use ConcurrentHashMap
public class Populator {
#Getter
private static final ConcurrentHashMap<String,String> namespaces = new ConcurrentHashMap <>();
public static void namespacePopulator(String key,String value){
namespaces.putIfAbsent(key,value);
}
}
I'm trying to implement (just starting work with Java and Flink) a non-keyed state in KafkaConsumer object, since in this stage no keyBy() in called. This object is the front end and the first module to handle messages from Kafka.
SourceOutput is a proto file representing the message.
I have the KafkaConsumer object :
public class KafkaSourceFunction extends ProcessFunction<byte[], SourceOutput> implements Serializable
{
#Override
public void processElement(byte[] bytes, ProcessFunction<byte[], SourceOutput>.Context
context, Collector<SourceOutput> collector) throws Exception
{
// Here, I want to call to sorting method
collector.collect(output);
}
}
I have an object (KafkaSourceSort) that do all the sorting and should keep the unordered message in priorityQ in the state and also responsible to deliver the message if it comes in the right order thru the collector.
class SessionInfo
{
public PriorityQueue<SourceOutput> orderedMessages = null;
public void putMessage(SourceOutput Msg)
{
if(orderedMessages == null)
orderedMessages = new PriorityQueue<SourceOutput>(new SequenceComparator());
orderedMessages.add(Msg);
}
}
public class KafkaSourceState implements Serializable
{
public TreeMap<String, SessionInfo> Sessions = new TreeMap<>();
}
I read that I need to use a non-keyed state (ListState) which should contain a map of sessions while each session contains a priorityQ holding all messages related to this session.
I found an example so I implement this:
public class KafkaSourceSort implements SinkFunction<KafkaSourceSort>,
CheckpointedFunction
{
private transient ListState<KafkaSourceState> checkpointedState;
private KafkaSourceState state;
#Override
public void snapshotState(FunctionSnapshotContext functionSnapshotContext) throws Exception
{
checkpointedState.clear();
checkpointedState.add(state);
}
#Override
public void initializeState(FunctionInitializationContext context) throws Exception
{
ListStateDescriptor<KafkaSourceState> descriptor =
new ListStateDescriptor<KafkaSourceState>(
"KafkaSourceState",
TypeInformation.of(new TypeHint<KafkaSourceState>() {}));
checkpointedState = context.getOperatorStateStore().getListState(descriptor);
if (context.isRestored())
{
state = (KafkaSourceState) checkpointedState.get();
}
}
#Override
public void invoke(KafkaSourceState value, SinkFunction.Context contex) throws Exception
{
state = value;
// ...
}
}
I see that I need to implement an invoke message which probably will be called from processElement() but the signature of invoke() doesn't contain the collector and I don't understand how to do so or even if I did OK till now.
Please, a help will be appreciated.
Thanks.
A SinkFunction is a terminal node in the DAG that is your job graph. It doesn't have a Collector in its interface because it cannot emit anything downstream. It is expected to connect to an external service or data store and send data there.
If you share more about what you are trying to accomplish perhaps we can offer more assistance. There may be an easier way to go about this.
I need to use variables initialized in outer class to be used in inner class.So I had used static variables.Also this is Flink application.
When built as eclipse-export-runnable jar --it works fine--state of variable retains
When built as maven or eclipse-export-jar--it fails--state of variable lost
FileMonitorWrapper.fileInputDir--values is "" and don't fetch the passed value.
Sounds strange..any thoughts
static transient String fileInputDir="";
static transient String fileArchiveDir="";
#SuppressWarnings("serial")
public DataStream<String> ScanDirectoryForFile(String inputDir, String inputFilePattern,String archiveDir, StreamExecutionEnvironment env) {
try {
FileMonitorWrapper.fileArchiveDir = archiveDir;
FileMonitorWrapper.fileInputDir = inputDir;
filteredDirFiles = dirFiles.filter(new FileMapper());
.
.
.
}
}
#SuppressWarnings("serial")
static class FileMapper implements FilterFunction<TimestampedFileInputSplit>{
#Override
public boolean filter(TimestampedFileInputSplit value) throws Exception {
if(value.toString().contains("done"))
FileMonitorWrapper.doneFound = true;
if(value.toString().contains("dat"));
FileMonitorWrapper.datFound = true;
if(FileMonitorWrapper.datFound && FileMonitorWrapper.doneFound) {
try {
if(value.getPath().toString().contains("done")) {
Files.move(Paths.get(FileMonitorWrapper.fileInputDir+"\\"+value.getPath().getName()),
Paths.get(FileMonitorWrapper.fileArchiveDir+"\\"+value.getPath().getName()));
}
}catch(Exception e){
e.printStackTrace();
}
return (!value.toString().contains("done"));
}
else
return false;
}
}
}
Generally speaking, serialization of POJOs does not capture the state of static variables. From what I have read about it, Flink serialization is no different.
So when you say that the static variable state is "retained" in some cases, I think you are misinterpreting the evidence. Something else is preserving the state of the static variables OR they are being initialized to the values that happen to be the same in the "before" and "after" cases.
Why am I so sure about this? The issue is that serializing static variables doesn't make much sense. Consider this
public class Cat {
private static List<Cat> allCats = new ArrayList<>();
private String name;
private String colour;
public Cat(...) {
...
allCats.add(this);
}
...
}
Cat fluffy = new Cat("fluffy", ...);
Cat claus = new Cat("claus", ...);
If the static field of Cat is serialized:
Every time a serial stream contains a Cat it will (must) contain all cats created so far.
Whenever I deserialize a stream contains a Cat, I also need to deserialize the ArrayList<Cat>. What do I do with it?
Do I overwrite allCats with it? (And lose track of the other cats?)
Do I throw it away?
Do I try to merge the lists? (How? What semantics? Do I get two cats called "fluffy"?)
Basically, there is no semantic for this scenario that is going to work out well in general. The (universal) solution is to NOT serialize static variables.
does someone maybe have sample code how to replace a java List (LinkedList or ArrayList) with something similar in BerkeleyDB? My problem is that I have to replace Lists to scale beyond main memory limits. Some simple sample code would be really nice.
I've now used a simple TupleBinding for Integers (keys) and a SerialBinding for the Diff-class (data values).
Now I'm receiving the Error:
14:03:29.287 [pool-5-thread-1] ERROR o.t.g.view.model.TraverseCompareTree - org.treetank.diff.Diff; local class incompatible: stream classdesc serialVersionUID = 8484615870884317488, local class serialVersionUID = -8805161170968505227
java.io.InvalidClassException: org.treetank.diff.Diff; local class incompatible: stream classdesc serialVersionUID = 8484615870884317488, local class serialVersionUID = -8805161170968505227
The listener and TransactionRunner classes which I'm using are:
/** {#inheritDoc} */
#Override
public void diffListener(final EDiff paramDiff, final IStructuralItem paramNewNode,
final IStructuralItem paramOldNode, final DiffDepth paramDepth) {
try {
mRunner.run(new PopulateDatabase(mDiffDatabase, mKey++, new Diff(paramDiff, paramNewNode.getNodeKey(), paramOldNode.getNodeKey(), paramDepth)));
} catch (final Exception e) {
LOGWRAPPER.error(e.getMessage(), e);
}
}
private static class PopulateDatabase implements TransactionWorker {
private StoredMap<Integer, Diff> mMap;
private int mKey;
private Diff mValue;
public PopulateDatabase(final DiffDatabase paramDatabase, final int paramKey, final Diff paramValue) {
Objects.requireNonNull(paramDatabase);
Objects.requireNonNull(paramValue);
mMap = paramDatabase.getMap();
mKey = paramKey;
mValue = paramValue;
}
#Override
public void doWork() throws DatabaseException {
mMap.put(mKey, mValue);
}
}
I don't know why it doesn't work :-/
Edit: Sorry, I just had to delete the generated environment/database and create a new one.
I'm afraid, it wont be that simple. A first step, you might want to take is to refactor your code to move all accesses to the list into a separate class (call it a DAO, if you like). Then it will be a lot easier to move to a database instead of the list.
Berkeley DB is severe over-kill for this type of task. It's a fair beast to configure and set up, plus I believe the license is now commercial. You'll be much better off using a disk-backed list or map. As an example of the latter take a look at Kyoto Cabinet. It's extremely fast, implements the standard Java Collections interface and is as easy to use as a List or Map. See my other answer for example code.
I have an array that I have created from a database ResultSet. I am trying to Serialize it so that I can send it over a socket stream. At the moment I am getting an error telling me that the array is not Serializable. The code I have is down below, the first part is the class to create an object for the array:
class ProteinData
{
private int ProteinKey;
public ProteinData(Integer ProteinKey)
{
this.ProteinKey = ProteinKey;
}
public Integer getProteinKey() {
return this.ProteinKey;
}
public void setProteinKey(Integer ProteinKey) {
this.ProteinKey = ProteinKey;
}
}
The code to populate the array:
public List<ProteinData> readJavaObject(String query, Connection con) throws Exception
{
PreparedStatement stmt = con.prepareStatement(query);
query_results = stmt.executeQuery();
while (query_results.next())
{
ProteinData pro = new ProteinData();
pro.setProteinKey(query_results.getInt("ProteinKey"));
tableData.add(pro);
}
query_results.close();
stmt.close();
return tableData;
}
And the code to call this is:
List dataList = (List) this.readJavaObject(query, con);
ObjectOutputStream output_stream = new ObjectOutputStream(socket.getOutputStream());
output_stream.writeObject(dataList);
And the code recieving this is:
List dataList = (List) input_stream.readObject();
Can someone help me serailize this array. All I can find in forums is simple arrays(EG. int[]).
I tried to add the serializable to the class and the UID number but got java.lang.ClassNotFoundException: socketserver.ProteinData error message. Does anyone now why?
Thanks for any help.
Basically you need that the classes you want to serialize are implementing Serializable. And if you want to avoid the warning related to the serial you should have also a long serialVersionUIDfor each one, that is a code used to distinguish your specific version of the class. Read a tutorial like this one to get additional info, serialization is not so hard to handle..
However remember that serialization is faulty when used between two different versions of the JVM (and it has some flaws in general).
Just a side note: the interface Serializabledoesn't actually give any required feature to the class itself (it's not a typical interface) and it is used just to distinguish between classes that are supposed to be sent over streams and all the others. Of course, if a class is Serializable, all the component it uses (instance variables) must be serializable too to be able to send the whole object.
Change your class declaration to:
class ProteinData implements Serializable {
...
}
I would have thought as a minimum that you would need
class ProteinData implements Serializable
and a
private static final long serialVersionUID = 1234556L;
(Eclipse will generate the magic number for you).
in the class.