I want to provide communication between many JVM using protobuf. Those JVM are executing a component-based middleware, hence there are arbitrary objects that I cannot anticipate because they are written by third-party developers.
The problem is that I want to free components' developer of the burden of specifying the serialization mechanism. I think this decision has some advantages:
There are legacy components that were written without thinking in a specific serialization mechanism (in fact, they use built-in java serialization)
If the channel manages the encoding/decoding of messages then you can connect any pair of components
It is easier to write components.
However, the only way of doing automatic serialization is using java built-in serialization, but as we all know that's very slow. So, my question is: Can we create a mechanism to, given a Java Object, build a protobuf messsage with its content that we can send to another process??
I am aware that this is not the way you should use protobuf and I can see some problems. Let me first explain how I think we can achieve my goal.
If an object (O) of the class (C) has never been serialized go to to step 2; otherwise, we already have a message class to serialize this class and we can go to step 7.
Build a proto specification using reflection on class C as the built-in serialization does.
Generate message class using protoc
Build the generated class using the java compiler.
Generate class on the fly using ASM for bytecode manipulation. This class will transform O into a message we can send. It will also perform the opposite transformation.
Save in a cache all the classes generated for objects of class C
Use the class generated in 5 to create a message.
Send the message with whatever mechanism the channel supports (i.e. sockets, shared memory)
Note 1: You can see that we are doing this on one side of the communication channel, we need to do that on both sides. I think, it is possible to send the first message using built-in serialization (use the first object to build the protobuf message) and further objects with protobuf.
Note 2: Step 5 is not required, but it is useful to avoid reflection every time you send an object.
Note 3: Protobuf is not mandatory here. I am including it because maybe it offers some tool to deal with the problem I have.
I can see that there is a lot of work to do. I can also see that maybe it won't work in some corner cases. Thus, I am wondering if there is some library already built and capable of doing that?
Related
Is there a way to completely disable java deserialization?
Java deserialization as in java.io.ObjectInputStream potentially opens an application to security issues by deserializing so-called serialization gadgets.
I do not use java serialization intentionally, but it is hard to make sure no library that is trusted with some outside input will never perform deserialization. For this reason I would love some kind of kill switch to disable serialization completely.
This is different from caching issues - I want to make sure no object is ever deserialized in my application, including through libraries.
A simple way to prevent deserialization is to define an agressive deserialization filter (introduced in Java 9 via JEP 290).
For example with java -Djdk.serialFilter=maxbytes=0 MyApp, any deserialization attempt (byte stream size > 0 byte) will throw an java.io.InvalidClassException: filter status: REJECTED exception.
Or you could use maxrefs=0, or simply exclude all classes using a wildcard, i.e. java -Djdk.serialFilter=!* MyApp or java -Djdk.serialFilter='!*' MyApp on Unix where the "!" needs to be escaped.
You can use a Java agent to do that. Try this one. Also, a nice read is this blog post discussing more on the topic of disabling deserialization.
I have noticed in my web based project, we are implementing Serialization in every DTO class and not using ObjectOutputStream/ObjectInputStream anywhere in project, while in every serialization tutorial they are using ObjectOutput/InputStream. Does serialization happen even without it? (i.e. stream conversion and sending it over network without using ObjectOutputStream/ObjectInputStream)?
Does serialization happen even without it? (i.e. stream conversion and sending it over network without using ObjectOutputStream/ObjectInputStream)?
First of all, Serialization doesn't necessarily have anything to do with a network (or a temp file as per your original question).
Secondly, Java Object Serialisation by definition involves java.io.Serializable and java.io.ObjectOutputStream.
Thirdly, there are other things beside your own code executing in any application. The JRE classes, for a start. It is open to any of those to use Serialization. For example, and please note that this is a list of examples, without the slightest pretension to being exhaustive:
RMI
Serialization of sessions by web containers
EJB, which is built on RMI
Object messages in JMS
...
I am new to the project, and I am trying to create a connector between Dataflow and a database.
The documentation clearly states that I should use a Source and a Sink but I see a lot of people using directly a PTransform associated with a PInput or a PDone.
The source/sink API is in experimental (which explaines all the examples with the PTransform), but seems more easy to integrate it with a custom runner (ie: spark for example).
If I refer to the code, the two methods are used. I cannot see any use case where it will be more interesting to use the PTransform API.
Is the Source/Sink API is supposed to remplace the PTranform API?
Did I miss something that clearly differentiate the two methods?
Is the Source/Sink API stable enough to be considered the good way to code inputs and outputs?
Thank for you advice!
The philosophy of Dataflow is that PTransform is the main unit of abstraction and composability, i.e., any self-contained data processing task should be encapsulated as a PTransform. This includes the task of connecting to a third-party storage system: ingesting data from somewhere or exporting it to somewhere.
Take, for example, Google Cloud Datastore. In the code snippet:
PCollection<Entity> entities =
p.apply(DatastoreIO.readFrom(dataset, query));
...
p.apply(some processing)
.apply(DatastoreIO.writeTo(dataset));
the return type of DatastoreIO.readFrom(dataset, query) is a subclass of PTransform<PBegin, PCollection<Entity>>, and the type of DatastoreIO.writeTo(dataset) is a subclass of PTransform<PCollection<Entity>, PDone>.
It is true that these functions are under the hood implemented using the Source and Sink classes, but to a user who just wants to read or write something to Datastore, that's an implementation detail that usually should not matter (however, see the note at the end of this answer about exposing the Source or Sink class). Any connector, or for that matter, any other data processing task is a PTransform.
Note: Currently connectors that read from somewhere tend to be PTransform<PBegin, PCollection<T>>, and connectors that write to somewhere tend to be PTransform<PCollection<T>, PDone>, but we are considering options to make it easier to use connectors in more flexible ways (for example, reading from a PCollection of filenames).
However, of course, this detail matters to somebody who wants to implement a new connector. In particular, you may ask:
Q: Why do I need the Source and Sink classes at all, if I could just implement my connector as a PTransform?
A: If you can implement your connector by just using the built-in transforms (such as ParDo, GroupByKey etc.), that's a perfectly valid way to develop a connector. However, the Source and Sink classes provide some low-level capabilities that, in case you need them, would be cumbersome or impossible to develop yourself.
For example, BoundedSource and UnboundedSource provide hooks for controlling how parallelization happens (both initial and dynamic work rebalancing - BoundedSource.splitIntoBundles, BoundedReader.splitAtFraction), while these hooks are not currently exposed for arbitrary DoFns.
You could technically implement a parser for a file format by writing a DoFn<FilePath, SomeRecord> that takes the filename as input, reads the file and emits SomeRecord, but this DoFn would not be able to dynamically parallelize reading parts of the file onto multiple workers in case the file turned out to be very large at runtime. On the other hand, FileBasedSource has this capability built-in, as well as handling of glob filepatterns and such.
Likewise, you could try implementing a connector to a streaming system by implementing a DoFn that takes a dummy element as input, establishes a connection and streams all elements into ProcessingContext.output(), but DoFns currently don't support writing unbounded amounts of output from a single bundle, nor do they explicitly support the checkpointing and deduplication machinery needed for the strong consistency guarantees Dataflow gives to streaming pipelines. UnboundedSource, on the other hand, supports all this.
Sink (more precisely, the Write.to() PTransform) is also interesting: it is just a composite transform that you could write yourself if you wanted to (i.e. it has no hard-coded support in the Dataflow runner or backend), but it was developed with consideration for typical distributed fault tolerance issues that arise when writing data to a storage system in parallel, and it provides hooks that force you to keep those issues in mind: e.g., because bundles of data are written in parallel, and some bundles may be retried or duplicated for fault tolerance, there is a hook for "committing" just the results of the successfully completed bundles (WriteOperation.finalize).
To summarize: using Source or Sink APIs to develop a connector helps you structure your code in a way that will work well in a distributed processing setting, and the source APIs give you access to advanced capabilities of the framework. But if your connector is a very simple one that needs neither, then you are free to just assemble your connector from other built-in transforms.
Q: Suppose I decide to make use of Source and Sink. Then how do I package my connector as a library: should I just provide the Source or Sink class, or should I wrap it into a PTransform?
A: Your connector should ultimately be packaged as a PTransform, so that the user can just p.apply() it in their pipeline. However, under the hood your transform can use Source and Sink classes.
A common pattern is to expose the Source and Sink classes as well, making use of the Fluent Builder pattern, and letting the user wrap them into a Read.from() or Write.to() transform themselves, but this is not a strict requirement.
I'm currently stuck between two options:
1) Store the object's information in the file.xml that is returned to my application at initialization to be displayed when the GUI is loaded and then perform asynchronous calls to my backend whenever the object is edited via the GUI (saving to the file.xml in the process).
-or-
2) Make the whole thing asynchronous so that when my custom object is brought up for editing by the end-user it queries the backend for the object, returns the xml to be displayed in the GUI, and then do another asynchronous call for if something was changed.
Either way I see many cons to both of these approaches. I really only need one representation of the object (on the backend) and would not like to manage the front-end version of the object as well as the conversion of my object to an xml representation and then breaking that out into another object on the flex front-end to be used in datagrids.
Is there a better way to do this that allows me to only manage my backend java object and create the interface to it on the front-end without worrying about the asynchronous nature of it and multiple representations of the same object?
You should look at Granite Data Services: http://www.graniteds.org If you are using Hibernate: it should be your first choice, as BlazeDS is not so advanced. Granite implements a great facade in Flex to access backend java objects with custom serialization in AMF, support for lazy-loading, an entity cache on the flex-side with bean validation. Globally, it is a top-down approach with generation of AS3 classes from your java classes.
If you need real-time features you can push data changes on flex client (Gravity module) and solve conflicts on the front side or implement conflict resolvers on the backend.
Still you will eventually have to deal with advanced conflicts (with some "deprecated" flex objects to work with on the server: you don't want to deal with that), a basic feature for instance is to add a version field and reject manipulation of such objects on the backend automatically (many ways to do that): you will have to implement a custom way for a flex client to update itself to the current changes implying that some work could be dropped (data lost) on the flex client.
If not so many people work on the same objects on your flex application, this will not happen a lot, like in a distributed VCS.
Depending on your real-time needs (what is the frequency of changes of your java object? This is the most important question), you can choose to "cache" changes in the flex side then updating the whole thing once (but you'll get troublesome conflicts if changes have happened) or you can check everytime the server-side (granite enables this) with less conflicts (and if one happens: it is simpler) but you'll generate probably more code to synchronize objects and more network traffic.
I have a post-compilation step that manipulates the Java bytecode of generated classes. I'd like to make life as painless as possible for library consumers, so I'm looking at ways I can make this process automatic and (if possible) compiler agnostic.
The Annotation Processing API provides many of the desired features (automatic service discovery; supported by Eclipse). Unfortunately, this is aimed at code generators and doesn't support manipulation of existing artefacts:
The initial inputs to the tool are
considered to be created by the zeroth
round; therefore, attempting to create
a source or class file corresponding
to one of those inputs will result in
a FilerException.
The Decorator pattern recommended by the API is not an option.
I can see how to perform the step with a runtime agent/instrumentation, but this is a worse option than a manual build step as it would require anyone even peripherally touched by the API to configure their JVMs in a non-obvious manner.
Is there a way to plug into or wrap the compiler tool as invoked by javac? Has anyone successfully subverted the annotation processors to manipulate bytecode, no matter what the doc says?
The Groovy compiler is the only bytecode compiler which allows to hook into the compilation process (example: Generate bytecode to support the Singleton pattern)
The Annotation Processing API is not meant to change the code. As you have already found out, all you can do is install a classloader, examine the bytecode at runtime and manipulate it. It's braindead but it works. This follows the general "we're afraid that a developer could try something stupid" theme which you will find throughout Java. There is no way to extend javac. The relevant classes are either private, final or will change with the next version of Java.
Another option is to write annotated Java, for example you write a class "ExampleTpl.java". Then, you use a precompiler which expands the annotations in that file to get "Example.java". In the rest of the code, you use Example and ignore ExampleTpl.
For Eclipse, there is a bug report to automate this step. I'm not aware of any other work in this area.
It can be done.
Take a look at my blog post Roman Numerals, in our Java where an annotation processor is used to rewrite code. Limitation being that it works with Sun's javac only.