Source Vs PTransform - java

I am new to the project, and I am trying to create a connector between Dataflow and a database.
The documentation clearly states that I should use a Source and a Sink but I see a lot of people using directly a PTransform associated with a PInput or a PDone.
The source/sink API is in experimental (which explaines all the examples with the PTransform), but seems more easy to integrate it with a custom runner (ie: spark for example).
If I refer to the code, the two methods are used. I cannot see any use case where it will be more interesting to use the PTransform API.
Is the Source/Sink API is supposed to remplace the PTranform API?
Did I miss something that clearly differentiate the two methods?
Is the Source/Sink API stable enough to be considered the good way to code inputs and outputs?
Thank for you advice!

The philosophy of Dataflow is that PTransform is the main unit of abstraction and composability, i.e., any self-contained data processing task should be encapsulated as a PTransform. This includes the task of connecting to a third-party storage system: ingesting data from somewhere or exporting it to somewhere.
Take, for example, Google Cloud Datastore. In the code snippet:
PCollection<Entity> entities =
p.apply(DatastoreIO.readFrom(dataset, query));
...
p.apply(some processing)
.apply(DatastoreIO.writeTo(dataset));
the return type of DatastoreIO.readFrom(dataset, query) is a subclass of PTransform<PBegin, PCollection<Entity>>, and the type of DatastoreIO.writeTo(dataset) is a subclass of PTransform<PCollection<Entity>, PDone>.
It is true that these functions are under the hood implemented using the Source and Sink classes, but to a user who just wants to read or write something to Datastore, that's an implementation detail that usually should not matter (however, see the note at the end of this answer about exposing the Source or Sink class). Any connector, or for that matter, any other data processing task is a PTransform.
Note: Currently connectors that read from somewhere tend to be PTransform<PBegin, PCollection<T>>, and connectors that write to somewhere tend to be PTransform<PCollection<T>, PDone>, but we are considering options to make it easier to use connectors in more flexible ways (for example, reading from a PCollection of filenames).
However, of course, this detail matters to somebody who wants to implement a new connector. In particular, you may ask:
Q: Why do I need the Source and Sink classes at all, if I could just implement my connector as a PTransform?
A: If you can implement your connector by just using the built-in transforms (such as ParDo, GroupByKey etc.), that's a perfectly valid way to develop a connector. However, the Source and Sink classes provide some low-level capabilities that, in case you need them, would be cumbersome or impossible to develop yourself.
For example, BoundedSource and UnboundedSource provide hooks for controlling how parallelization happens (both initial and dynamic work rebalancing - BoundedSource.splitIntoBundles, BoundedReader.splitAtFraction), while these hooks are not currently exposed for arbitrary DoFns.
You could technically implement a parser for a file format by writing a DoFn<FilePath, SomeRecord> that takes the filename as input, reads the file and emits SomeRecord, but this DoFn would not be able to dynamically parallelize reading parts of the file onto multiple workers in case the file turned out to be very large at runtime. On the other hand, FileBasedSource has this capability built-in, as well as handling of glob filepatterns and such.
Likewise, you could try implementing a connector to a streaming system by implementing a DoFn that takes a dummy element as input, establishes a connection and streams all elements into ProcessingContext.output(), but DoFns currently don't support writing unbounded amounts of output from a single bundle, nor do they explicitly support the checkpointing and deduplication machinery needed for the strong consistency guarantees Dataflow gives to streaming pipelines. UnboundedSource, on the other hand, supports all this.
Sink (more precisely, the Write.to() PTransform) is also interesting: it is just a composite transform that you could write yourself if you wanted to (i.e. it has no hard-coded support in the Dataflow runner or backend), but it was developed with consideration for typical distributed fault tolerance issues that arise when writing data to a storage system in parallel, and it provides hooks that force you to keep those issues in mind: e.g., because bundles of data are written in parallel, and some bundles may be retried or duplicated for fault tolerance, there is a hook for "committing" just the results of the successfully completed bundles (WriteOperation.finalize).
To summarize: using Source or Sink APIs to develop a connector helps you structure your code in a way that will work well in a distributed processing setting, and the source APIs give you access to advanced capabilities of the framework. But if your connector is a very simple one that needs neither, then you are free to just assemble your connector from other built-in transforms.
Q: Suppose I decide to make use of Source and Sink. Then how do I package my connector as a library: should I just provide the Source or Sink class, or should I wrap it into a PTransform?
A: Your connector should ultimately be packaged as a PTransform, so that the user can just p.apply() it in their pipeline. However, under the hood your transform can use Source and Sink classes.
A common pattern is to expose the Source and Sink classes as well, making use of the Fluent Builder pattern, and letting the user wrap them into a Read.from() or Write.to() transform themselves, but this is not a strict requirement.

Related

Is there a mature Java Workflow Engine for BPM backed by NoSQL?

I am researching how to build a general application or microservice to enable building workflow-centric applications. I have done some research about frameworks (see below), and the most promising candidates share a hard reliance upon RDBMSes to store workflow and process state combined with JPA-annotated entities. In my opinion, this damages the possibility of designing a general, data-driven workflow microservice. It seems that a truly general workflow system can be built upon NoSQL solutions like MondoDB or Cassandra by storing data objects and rules in JSON or XML. These would allow executing code to enforce types or schemas while using one or two simple Java objects to retrieve and save entities. As I see it, this could enable a single application to be deployed as a Controller for different domains' Model-View pairs without modification (admittedly given a very clever interface).
I have tried to find a workflow engine/BPM framework that supports NoSQL backends. The closest I have found is Activiti-Neo4J, which appears to be an abandoned project enabling a connector between Activity and Neo4J.
Is there a Java Work Engine/BPM framework that supports NoSQL backends and generalizes data objects without requiring specific POJO entities?
If I were to give up on my ideal, magically general solution, I would probably choose a framework like jBPM and Activi since they have great feature sets and are mature. In trying to find other candidates, I have found a veritable graveyard of abandoned projects like this one on Java-Source.net.
Yes, Temporal Workflow has pluggable persistence and runs on Cassandra as well as on SQL databases. It was tested to up to 100 Cassandra nodes and could support tens of thousands of events per second and hundreds of millions of open workflows.
It allows to model your workflow logic as plain old java classes and ensures that the code is fully fault tolerant and durable across all sorts of failures. This includes local variable and threads.
See this presentation that goes into more details about the programming model.
I think the reason why workflow engines are often based on RDBMS is not the database schema but more the combination to a transaction-safe data store.
Transactional robustness is an important factor for workflow engines, especially for long-running or nested transactions which are typical for complex workflows.
So maybe this is one reason why most engines (like activi) did not focus on a data-driven approach. (I am not talking about data replication here which is covered by NoSQL databases in most cases)
If you take a look at the Imixs-Workflow Project you will find a different approach based on Java Enterprise. This engine uses a generic data object which can consume any kind of serializable data values. The problem of the data retrieval is solved with the Lucene Search technology. Each object is translated into a virtual document with name/value pairs for each item. This makes it easy to search through the processed business data as also to query structured workflow data like the status information or the process owners. So this is one possible solution.
Apart from that, you always have the option to store your business data into a NoSQL database. This is independent from the workflow data of a running process instance as far as you link both objects together.
Going back to the aspect of transactional robustness it's a good idea to store the reference to your NoSQL data storage into the process instance, which is transaction aware. Take also a look here.
So the only problem you can run into is the fact that it's very hard to synchronize a transaction context from a EJB/JPA to an 'external' NoSQL database. For example: what will you do when your data was successful saved into your NoSQL data storage (e.g. Casnadra), but the transaction of the workflow engine fails and a role-back is triggered?
The designers of the Activiti project have also been aware of the problem you have stated, but knew it would be quite a re-write to implement such flexibility which, arguably, should have been designed into the project from the beginning. As you'll see in the link provided below, the problem has been a lack of interfaces toward which to code different implementations other than that of a relational database. With version 6 they went ahead and ripped off the bandaid and refactored the framework with a set of interfaces for which different implementations (think Neo4J, MongoDB or whatever other persistence technology you fancy) could be written and plugged in.
In the linked article below, they provide some code examples for a simple in-memory implementation of the aforementioned interfaces. Looks pretty cool and sounds to perhaps be precisely what you're looking for.
https://www.javacodegeeks.com/2015/09/pluggable-persistence-in-activiti-6.html

Serializing arbitrary objects with Protobuf in Java

I want to provide communication between many JVM using protobuf. Those JVM are executing a component-based middleware, hence there are arbitrary objects that I cannot anticipate because they are written by third-party developers.
The problem is that I want to free components' developer of the burden of specifying the serialization mechanism. I think this decision has some advantages:
There are legacy components that were written without thinking in a specific serialization mechanism (in fact, they use built-in java serialization)
If the channel manages the encoding/decoding of messages then you can connect any pair of components
It is easier to write components.
However, the only way of doing automatic serialization is using java built-in serialization, but as we all know that's very slow. So, my question is: Can we create a mechanism to, given a Java Object, build a protobuf messsage with its content that we can send to another process??
I am aware that this is not the way you should use protobuf and I can see some problems. Let me first explain how I think we can achieve my goal.
If an object (O) of the class (C) has never been serialized go to to step 2; otherwise, we already have a message class to serialize this class and we can go to step 7.
Build a proto specification using reflection on class C as the built-in serialization does.
Generate message class using protoc
Build the generated class using the java compiler.
Generate class on the fly using ASM for bytecode manipulation. This class will transform O into a message we can send. It will also perform the opposite transformation.
Save in a cache all the classes generated for objects of class C
Use the class generated in 5 to create a message.
Send the message with whatever mechanism the channel supports (i.e. sockets, shared memory)
Note 1: You can see that we are doing this on one side of the communication channel, we need to do that on both sides. I think, it is possible to send the first message using built-in serialization (use the first object to build the protobuf message) and further objects with protobuf.
Note 2: Step 5 is not required, but it is useful to avoid reflection every time you send an object.
Note 3: Protobuf is not mandatory here. I am including it because maybe it offers some tool to deal with the problem I have.
I can see that there is a lot of work to do. I can also see that maybe it won't work in some corner cases. Thus, I am wondering if there is some library already built and capable of doing that?

Sharing a java object across a cluster

My requirement is to share a java object across a cluster.
I get Confused
whether to write an EJB and share the java objects across the cluster
or
to use any third party such as infinispan or memecached or terracotta or
what about JCache?
with the constraint that
I can't change any of my source code with specific to any application
server (such as implementing the weblogic's singleton services).
I can't offer two builds for cluster and non cluster environment.
Performance should not be downgraded.
I am looking for only open source third party if I need to use it.
It need to work in weblogic , Websphere , Jbos and Tomcat too.
Can any one come up with the best option with these constraints in mind.
It can depend on the use case of the objects you want to share in the cluster.
I think it comes down to really the following options in most complex to least complex
Distributed cacheing
http://www.ehcache.org
Distributed cacheing is good if you need to ensure that an object is accessible from a cache on every node. I have used ehache to distribute quite successfully, no need to setup a terracotta server unless you need the scale, can just point instances together via rmi. Also works synchronously and asynchronously depending on requirements. Also cache replication is handy if nodes go down so cache is actually redundant and dont lose anything. Good if you need to make sure that the object has been updated across all the nodes.
Clustered Execution/data distribution
http://www.hazelcast.com/
Hazelcast is also a nice option as provides a way of executing java classes across a cluster. This is more useful if you have an object that represents a unit of work that needs to be performed and you dont care so much where it gets executed.
Also useful for distributed collections, i.e. a distributed map or queue
Roll your own RMI/Jgroups
Can write your own client/server but I think you will start to run into issues that the bigger frameworks solve if the requirements of the objects your dealing with starts to get complex. Realistically Hazelcast is really simple and should really eliminate the need to roll your own.
It's not open source, but Oracle Coherence would easily solve this problem.
If you need an implementation of JCache, the only one that I'm aware of being available today is Oracle Coherence; see: http://docs.oracle.com/middleware/1213/coherence/develop-applications/jcache_part.htm
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
It is just an idea. you might want to check the exact implementation.
It will downgrade performance but I don't see how it is possible to avoid it.
It not an easy one to implement. might be you should consider load balance instead of clustering.
you might consider RMI and/or dynamic-proxy.
extract interface of your objects.
use RMI to access the real object (from all clusters even the one that actually holds the object)
in order to create RMI for an existing code you might use dynamic-proxy (again..not sure about implementation)
*dynamic proxy can wrap any object and do some pre and post task on each method invocation. in this case it might use the original object for RMI invocation
you will need connectivity between clusters in order to propogate the RMI object.

How to merge common parts of WSDL and XSD from different services?

I have to interact with a set of web-services that each come with their own WSDL and XSD. The XSD are sometimes merged in a single file sometimes spread along multiple files (20-30). However, from experience I know that most of the message structure and data share a large common subset, perhaps only 20% are different amongst the different transactions.
Unfortunately I have no control over the server parts or the declaration of the services so getting them to fix it is out of the question. A first version of the client generated each services separately and then used them as individual facades to form a coherent higher level service as an adapter for another system.
I used CXF with the default JAXB binding and imposed different generated packages for each services. I did this because some most services use a common data model but not all use the same version or customization so I have conflicts and thus opted for the brute force so I can get the system done.
However, this causes the memory requirements of the adapter to go through the roof as each services load their context. Right now I have upwards 500M of memory utilized just for the adapter that houses the service clients even before I start sending requests and processing responses. Although I can run the system without problems using current situation this create constraints that jeopardize the deployment of the solution; my client would like to reduce this dramatically (60% or more) so that this system can be installed along side others without requiring hardware upgrades.
Question is follows :
Is there a tool or technique that would allow me to put the common parts of each transactions together such that they can be generated once and referenced where needed ?
I am not bound to CXF or JAXB other than the time required to re-factor the system towards a different framework or data bindings.
Thank you in advance for your help.
--- EDIT ---
Thank you Blaise. This points to a feature of JAXB that would be useful : episodes. Unfortunately I still need to extract the common base part of the different services. So now what I need is a means to extract this common parts through a structural diff, that is a diff tool that would be aware of the structure and type hierarchy the XSD describes so that proper references be put in place to connect the common sections with the specialized parts.
If you want to trim down a little, an alternative marshalling technology (in any framework) might do the trick - drop JAXB and try JiBX, which was added to the latest CXF release, or maybe just StAX.
Since you're looking to do something a little more custom than the conventional JAX-Ws/JAXB style services, you may want to consider Spring-WS.
Spring-WS gives you control over all aspects of the web services stack. It can route messages in different ways (payload, XPath expressions, etc), and you can use any marshalling/serialization technology you want (Jibx, jDOM, SAX, etc)
Here is a table that illustrates the options:
http://static.springsource.org/spring-ws/sites/2.0/reference/html/server.html#d4e1062
If you really want to get fancy, you can take one of the lower level APIs, start marshalling the message and once you hit critical mass for one of your common areas, start a JAXB marshall right on the spot.
The ability to route messages to different 'endpoints' (in Spring-WS) terms, means you can also do things like "accept any message" on this one interface (that looks like DOM/SAX/etc) and then have one big marshalling operation there.
The key thing Spring-WS will buy you here is to break out of the JAX-WS mold, do play a little up front game, and then you can always marshall back to JAXB later, whether it be in interceptors, your app, etc. In theor you can the same with JAXB DOM Source, but it's my opinion that the Spring-WS stack gives you the finest grained control for special situations like you have here.
The best trick is to serve a static wsdl. Just open the wsdl, save it, upload in the server and indicate to the client to point to the static one instead of the dynamic-self generated.

Share file storage index with multiple open applications in Java

I'm writing an HTTP Cache library for Java, and I'm trying to use that library in the same application which is started twice. I want to be able to share the cache between those instances.
What is the best solution for this? I also want to be able to write to that same storage, and it should be available for both instances.
Now I have a memory-based index of the files available to the cache, and this is not shareable over multiple VMs. It is serialized between startups, but this won't work for a shared cache.
According to the HTTP Spec, I can't just map files to URIs as there might be a variation of the same payload based on the request. I might, for instance, have a request that varies on the 'accept-language' header: In that case I would have a different file for each subsequent request which specifies a different language.
Any Ideas?
First, are you sure you want to write your own cache when there are several around? Things like:
ehcache
jboss cache
memcached
The first two are written in Java and the third can be accessed from Java. The first two also handle distributed caching, which is the general case of what you are asking for, I think. When they start up, they look to connect to other members so that they maintain a consistent cache across instances. Changes to one are reflected across instances. They can be set up to connect via multicast or with specific lists of servers specified.
Memcached typically works in a slightly different manner in that it is running externally to the Java processes you are running, so that all Java instances that start up will be talking to a common service. You can set up memcached to work in a distributed manner, but it does so by hashing keys so that the server you want to connect to can be determined by what it is you are looking for.
Doing a true distributed cache with consistent content is very hard to do well, which is why I suggest looking at an existing library. If you want to do it yourself, it would still help to look at those listed to see how they go about it and consider using something like JGroups as your underlying mechanism.
I think you should have a look at the WebDav-Specifications. It's an HTTP extension for sharing/editing/storing/versioning resources on a server. There exists an implementation as an Apache module, wich allows you a swift start using them.
So instead of implementing your own cache server implementation, you might be better off with a local Apache + mod-dav instance that is available to both of your applications.
Extra bonus: Since WebDav is a specified protocoll you get the interoperability with lots of tools for free.

Categories