A Scalable Architecture for Reconstructing events

A Scalable Architecture for Reconstructing events - java

I have been tasked to develop the architecture for a data transformation pipeline.
Essentially, data comes in at one end and is routed through various internal systems acquiring different forms before ending up in its destination.
The main objectives are -
Fault Tolerant. The message should be recoverable if one of the intermediate systems were down.
Replay/ Resequence - The message can be replayed from any stage and it should be possible to recreate the events in an idempotent manner.
I have a few custom solutions in mind to address
Implement a checkpoint system where a message can be logged at both entry and exit points at each checkpoint so we know where failure happens.
Implement a recovery mechanism that can go to the logged storage ( database, log file etc.. ) and reconstruct events programmatically.
However, I have a feeling this is a fairly standard problem with well defined solutions.
So, I would welcome any thoughts on a suitable architecture to go with, any tools/packages/patterns to refer to etc..
Thanks

Akka is obvious choice. Of course Scala version is more powerful, but even with Java bindings you can achieve a lot.
I think you can follow CQRS approach and use Akka Persistence module. In this case it's easy to replay any sequence of events, because you always have a persistent journal.
Generally Actor Model provides you fault-tolerance using supervision.
Akka Clustering will give you scalability you need.
Really awesome example of using Akka Clustering with Akka Persistence and Cassandra - https://github.com/boldradius/akka-dddd-template (only Scala unfortunately).

One common solution is JMS, where a central component (the JMS Broker) keeps a transactional store of pending messages. Because it does nothing other than that, it can have a high uptime (uptime can further be increased with a failover cluster, in which case you'll likely its persistence store to be a failover cluster, too).
Sending a JMS message can be made transactional, as can consuming a message. These transaction can be synchronized with database transactions through XA-transactions, which does its utmost get as close to exactly-once delivery as possible, but is rather heavy machinery.
In many cases (idempotent receiver), at-least-once delivery is sufficient. This can be accomplished by sending the message with a synchronous transaction (that is, the sender only succeeds once the broker has acknowledged receipt of the message), and consuming a message only after it has been processed.

Related

Asynchronous Message-Passing and Microservices

I am planning the develop of a microservice based architecture application and I decided to use kafka for the internal communicaton while I was reading the book Microservice Architecture by Ronnie Mitra; Matt McLarty; Mike Amundsen; Irakli Nadareishvili where they said:
letting microservices directly interact with message brokers (such as
RabbitMQ, etc.) is rarely a good idea. If two microservices are
directly communicating via a message-queue channel, they are sharing a
data space (the channel) and we have already talked, at length, about
the evils of two microservices sharing a data space. Instead, what we
can do is encapsulate message-passing behind an independent
microservice that can provide message-passing capability, in a loosely
coupled way, to all interested microservices.
I am using Netflix Eureka for Service registration and discovery, Zuul as edge server and Hystrix.
Said so, in practice, how can I implement that kind of microservice? How can I make my microservices indipendent from the communcation channel ( in this case Kafka)?
Actually I'm directly interacting with the channel, so I don't have an extra layer between my publishers/subscribers and kafka.
UPDATE 06/02/2018
to be more precise, we have a couple of microservices: one is publishing news on a topic (activemq, kafka...) and the other microservice is subscribed on that topic and doing some operations on the messages that are coming through. So we have these services that are coupled to the message broker (to the channel)... we have the the message broker's apis "embedeed" on our code and for example, if we want to change the message broker we have to change all the microservices that made use of the message broker's api. So, they are suggesting to use a microservice(in the picture I assume is the Events Hub) that is the "dispatcher" of the various messages. In this way it is the only component that interacts with the channel.

A general foreword - Don't do it if you don't need it. Introducing a queue system can be a big improvement if you are dealing with high number of events and events backing up issues etc. But if you don't face any issues you are probably better off with the lower complexity of a direct service communication.
Back to your question - It sounds like you want to abstract your communication with the queue because you are worried about the effort for replacing the queue with a different system - Is that correct?
In this case you can either do what you proposed - Develop a new service in the middle. This comes with all the baggage of a physical service (including deployment, scaling, etc).
Or the second alternative is to write a client library that abstracts the queue the way you want and allows you to reuse it in all services requiring to participate in the queue. This way you don't have to physically deploy another service for this purpose but you are still in full control of what your interface to the queue should look like and you have a single piece of code to incorporate changes (at least toward the direction of the queue). This would work given you are sure the app-facing side of the library can be stable enough.
But, again, don't do any of those in the first iteration when you are not sure you need all the complexity. (Over-engineering is a dangerous thing)

You should create a Interface lets say "Queue" which provide all functionalities which you want from Kafka or RabbitMQ, the create diff. impl like KafkaQueue and RabbitMQQueue of the Queue interface and inject the right impl which you want to use in your system.
In this your if new queue system is used , your existing code will not be changed
Creating another microservice is an extra overhead in this case

In a service architecture proper way to make your code independent out of constraints of communication channel is by having properly modeled self-sufficient messages. Historic examples would be WSDL in document mode, EDIFACT, HATEOAS etc. From this point of view microservices with spring-boot and kafka are just different implementation of same old thing done since mainframes ruled the world.
Essentially if you take a view of your app as blackbox asynchronous server; everything app does is receives events and produces new ones. It should not matter how events are raised within app. Http requests, xml within jms messages, json in kafka, whatever - all those things are just a way to pass events and business layer of application should respond only to a content of events.
So business layer is usually structured around some custom model/domain which are delivered as payload. Business layer is invoked/triggered by listener/producer layer which talks to communcation channel (kafka listener, http listener etc..). Aside from logging and enforcing security you should not have communication channel logic in app. I have seen unfortunate examples of business logic driven by by originating jms connection or parsing url of request. If you ever have this in your code you have failed to properly structure your code.
However that is easier to say than to implement. Some people are good at this level of modeling, and some never learn.
And there is no other way to learn but to try and fail.

Consume message only once from Topic per listeners running in cluster

I'm implementing an Domain Event infrastructure, but the project doesn't allow any messaging infra(financial services client) so found an alternative in Hazelcast Topics and ExecutorService,
But the problem is when running in cluster the message shall be delivered to listeners which is going to be running in cluster, so for a cluster of 2, we have same listener running in 2 jvm, and message consume twice and acted upon, suppose the Domain event is supposed to perform some non idempotent operation like credit some loyalty points, unless I explicitly maintain a trace of domain event acted upon and check against that everytime I receive an event, I will end up crediting it twice, "any suggestion implementing this without having to write those boiler plates possibly at the infralayer", or is there a known patter for such implementation.
Edit: Meanwhile I'm also evaluating Hazelcast ExecutorService as suggested Here

The use case you described can be solved by using Hazelcast's Queues instead of Topics. The main reason to use topics is if you are interested that multiple (possibly independent) consumers get the same message. Your requirement sounds like you are interested that only one of the consumers gets the message, and that's what queues are for, see the Hazelcast documentation for Queues.

Local message queue for sharing data between two processes

I have a server application A that produces records as requests arrive. I want these records to be persisted in a database. However, I don't want to let application A threads spend time persisting the records by communicating directly with the database. Therefore, I thought about using a simple producers-consumers architecture where application A threads produce records and, another application B threads are the consumers that persist the records to the database.
I'm looking for the "best" way to share these records between applications A and B. An important requirement is that application A threads will always be able to send records to the IPC system (e.g. queue but that may be some other solution). Therefore, I think the records must always be stored locally so that application A threads will be able to send record event if network is down.
The initial idea that came to my mind was to use a local message queue (e.g. ActiveMQ). Do you think a local message queue is appropriate? If yes, do you recommend a specific message queue implementation? Note that both applications are written in Java.
Thanks, Mickael

For this type of needs Queueing solution seems to be the best fit as the producer and consumer of the events can work in isolation. There are many solutions out there, and I have personally worked with RabbitMQ and ActiveMQ. Both are equally good. I don't wish to compare their performance characteristics here but RabbitMQ is written in Erlang which a language tailer-made for building real time applications.
Since you're already on Java platform ActiveMQ might be a better option and is capable producing high throughput. With a solution like this, the consumer does not have to be online all the time. Based on how critical your events data are, you may also want to have persistent queues and messages so that in the event of a message broker failure, you can still recover important "event" messages your application A produced.
If there are many applications producing events and later if you wish to scale out(or horizontally scale) the broker service because it's getting a bottleneck, both of the above solutions provide clustering services.
Last but not least, if you want to share these events between different platforms you may wish to share messages in AMQP format, which is a platform-independent wire-level protocol to share messages between heterogenous systems, and I'm not sure if this is requirement for you. RabbitMQ and ActiveMQ both support AMQP. Both of these solutions also support MQTT which is a lightweight messaging protocol but it seems that you don't wish to use MQTT.
There are other products such as HornetQ and Apache Qpid which are also production ready solutions but I have not used them personally.
I think queueing solution is a the best approach in terms of maintainability, loose coupling nature of participating applications and performance.

Java internal message queue /JMS

I have a web application i am rewriting that currently performs a large amount of audit data sql writes. Every step of user interaction results in a method being executed that writes some information to a database.
This has the potential to impact users by causing the interaction to stop due to database problems.
Ideally I want to move this is a message based approach where if data needs to be written it is fired off too a queue, where a consumer picks these up and writes them to the database. It is not essential data, and loss is acceptable if the server goes down.
I'm just a little confused if I should try and use an embedded JMS queue and broker, or a Java queue. Or something I'm not familiar with (suggestions?)
What is the best approach?
More info:
The app uses spring and is running on websphere 6. All message communication is local, it will not talk to another server.

I think logging with JMS is overkill, and especially if loggin is the only reason for using JMS.
Have a look at DBAppender, you can log directly to the database. If performance is your concern you can log asynchronously using Logback.
If you still want to go JMS way then Logback has JMS Queue & Topic appenders

A plain queue will suffice based on your description of the problem. You can have a fixed size queue and discard messages if it fills too quickly since you say they are not critical.
Things to consider:
Is this functionality required by other apps too, now or in the
future.
Are the rate of producing messages so huge that it can start
consuming lot of heap memory when large number of users are logged
in. Important if messages should not be lost.

I'm not sure if that is best practice inside a Java EE container however.
Since you already run on a WebSphere machine, you do have a JMS broker going (SIBus). The easiest way to kick off asynchronous things are to send JMS messages and have a MDB reading them off - and doing database insertions. You might have some issues spawning own threads in WebSphere can still utilise the initial context for JNDI resources.
In a non Java EE case, I would have used a something like a plain LinkedBlockingQueue or any blocking queue, and just have a thread polling that queue for new messages to insert into a database.

I would uses JMS queue only if there are different servers involved. So in your case I would do it in simple plain pure java with some Java queue.

what is JMS good for? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm looking for (simple) examples of problems for which JMS is a good solution, and also reasons why JMS is a good solution in these cases. In the past I've simply used the database as a means of passing messages from A to B when the message cannot necessarily be processed by B immediately.
A hypothetical example of such a system is where all newly registered users should be sent a welcome e-mail within 24 hours of registration. For the sake of argument, assume the DB does not record the time when each user registered, but instead a reference (foreign key) to each new user is stored in the pending_email table. The e-mail sender job runs once every 24 hours, sends an e-mail to all the users in this table, then deletes all the pending_email records.
This seems like the kind of problem for which JMS should be used, but it's not clear to me what benefit JMS would have over the approach I've described. One advantage of the DB approach is that the messages are persistent. I understand that JMS message queues can also be persisted, but in that case there seems to be little difference between JMS and the "database as message queue" approach I've described?
What am I missing?
- Don

JMS and messaging is really about 2 totally different things.
publish and subscribe (sending a message to as many consumers as are interested - a bit like sending an email to a mailing list, the sender does not need to know who is subscribed
high performance reliable load balancing (message queues)
See more info on how a queue compares to a topic
The case you are talking about is the second case, where yes you can use a database table to kinda simulate a message queue.
The main difference is a JMS message queue is a high performance highly concurrent load balancer designed for huge throughput; you can send usually tens of thousands of messages per second to many concurrent consumers in many processes and threads. The reason for this is that a message queue is basically highly asynchronous - a good JMS provider will stream messages ahead of time to each consumer so that there are thousands of messages available to be processed in RAM as soon as a consumer is available. This leads to massive throughtput and very low latency.
e.g. imagine writing a web load balancer using a database table :)
When using a database table, typically one thread tends to lock the whole table so you tend to get very low throughput when trying to implement a high performance load balancer.
But like most middleware it all depends on what you need; if you've a low throughput system with only a few messages per second - feel free to use a database table as a queue. But if you need low latency and high throughput - then JMS queues are highly recommended.

In my opinion JMS and other message-based systems are intended to solve problems that need:
Asynchronous communications : An application need to notify another that an event has occurred with no need to wait for a response.
Reliability. Ensure once-and-only-once message delivery. With your DB approach you have to "reinvent the wheel", specially if you have several clients reading the messages.
Loose coupling. Not all systems can communicate using a database. So JMS is pretty good to be used in heterogeneous environments with decoupled systems that can communicate over system boundaries.

The JMS implementation is "push", in the sense that you don't have to poll the queue to discover new messages, but you register a callback that gets called as soon as a new message arrives.

to address the original comment. what was originally described is the gist of (point-to-point) JMS. the benefits of JMS are, however:
you don't need to write the code yourself (and possibly screw up the logic so that it's not quite as persistent as you think it is). also, third-party impl might be more scalable than simple database approach.
jms handles publish/subscribe, which is a bit more complicated that the point-to-point example you gave
you are not tied to a specific implementation, and can swap it out if your needs change in the future, w/out messing w/ your java code.

One advantage of JMS is to enable asynchronous processing which can by done by database solution as well. However following are some other benefit of JMS over database solution
a) The consumer of the message can be in a remote location. Exposing database for remote access is dangerous. You can workaround this by providing additional service for reading messages from database, that requires more effort.
b) In the case of database the message consumer has to poll the database for messages where as JMS provides callback when a message is arrived (as sk mentioned)
c) Load balancing - if there are lot of messages coming it is easy to have pool of message processors in JMS.
d) In general implementation via JMS will be simpler and take less effort than database route

JMS is an API used to transfer messages between two or more clients. It's specs are defined under JSR 914.
The major advantage of JMS is the decoupled nature of communicating entities - Sender need not have information about the receivers. Other advantages include the ability to integrate heterogeneous platforms, reduce system bottlenecks, increase scalability, and respond more quickly to change.
JMS are just kind of interfaces/APIs and the concrete classes must be implemented. These are already implemented by various organizations/Providers. they are called JMS providers. Example is WebSphere by IBM or FioranoMQ by Fiorano Softwares or ActiveMQ by Apache, HornetQ, OpenMQ etc. .Other terminologies used are Admin Objects(Topics,Queues,ConnectionFactories),JMS producer/Publisher, JMS client and the message itself.
So coming to your question - what is JMS good for?
I would like to give a practical example to illustrate it's importance.
Day Trading
There is this feature called LVC(Last value cache)
In Trading share prices are published by a publisher at regular intervals. Each share has an associated Topic to which it is published to. Now if you know what a Topic is then you must know messages are not saved like queues. Messages are published to the subscribers alive at the time the message was published(Exception being Durables subscribers which get all the messages published from the time it was created but then again we don't want to get too old stock prices which discard the possibility of using it). So if a client want to know a stock price he create a subscriber and then he has to wait till next stock price is published(which is again not what we want). This is where LVC comes into picture. Each LVC message has an associated key. If a messages is sent with a LVC key(for a particular stock) and then another update message with same key them the later overrides the previous one. When ever a subscriber subscribes to a topic(which has LVC enabled) the subscriber will get all the messages with distinct LVC keys. If we keep a distinct key per listed company then when client subscribes to it it will get the latest stock price and eventually all the updates.
Ofcourse this is one of the factors other that reliability,security etc which makes JMS so powerful.

Guido has the full definition. From my experience all of these are important for a good fit.
One of the uses I've seen is for order distribution in warehouses. Imagine an office supply company that has a fair number of warehouses that supply large offices with office supplies. Those orders would come into a central location and then be batched up for the correct warehouse to distribute. The warehouses don't have or want high speed connections in most cases so the orders are pushed down to them over dialup modems and this is where asynchronous comes in. The phone lines are not really that important either so half the orders may get in and this is where reliability is important.

The key advantage is decoupling unrelated systems rather than have them share comon databases or building custom services to pass data around.
Banks are a keen example, with intraday messaging being used to pass around live data changes as they happen. It's very easy for the source system to throw a message "over the wall"; the downside is there's very little in the way of contract between these systems, and you normally see hospitalisation being implemented on the consumer's side. It's almost too loosly coupled.
Other advantages are down to the support for JMS out of the box for many application servers, etc. and all the tools around that: durability, monitoring, reporting and throttling.

There's a nice write-up with some examples here: http://www.winslam.com/laramee/jms/index.html

The 'database as message queue' solution may be heavy for the task. The JMS solution is less tightly coupled in that the message sender does not need to know anything about the recipient. This could be accomplished with some additional abstraction in the 'database as message queue' as well so it is not a huge win...Also, you can use the queue in a 'publish and subscribe' way which can be handy depending on what you are trying to accomplish. It is also a nice way to further decouple your components. If all of your communication is within one system and/or having a log that is immediately available to an application is very important, your method seems good. If you are communicating between separate systems JMS is a good choice.

JMS in combination with JTA (Java Transaction API) and JPA (Java persistence API) can be very useful. With a simple annotation you can put several database actions + message sending/receiving in the same transaction. So if one of them fails everything gets rolled back using the same transaction mechanism.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.