Preventing Mule servers from reprocessing same information from a database - java

I am working on a Mule application which reads a series of database records generates reports and posts them to a number of HTTP locations. Unfortunately, the servers are not clustered, so it is possible that both servers could read the records and post them multiple times which is undesirable. Could someone suggest the simplest way to prevent all three Mule servers reading the database, generating the reports and sending them off??

Short answer - use cluster.
Long answer - there is no magic in this world. If you don't use cluster which coordinates your efforts then you should do it by yourself. Since servers are not in cluster they should communicate somehow to prevent duplication. Cluster is the best answer and it is designed to do so. Without cluster - do it "manually".
There are many ways to do so. Main point is that it should be the only one place responsible for coordination (may I say cluster? :) - the best way IMHO it is database - it is one place which is common for all these servers. simplest way is to mark processed records and process only unprocessed ones. How you do this - extra table or extra field - it's up to you.

Related

How to write Kafka consumers - single threaded vs multi threaded

I have written a single Kafka consumer (using Spring Kafka), that reads from a single topic and is a part of a consumer group. Once a message is consumed, it will perform all downstream operations and move on to the next message offset. I have packaged this as a WAR file and my deployment pipeline pushes this out to a single instance. Using my deployment pipeline, I could potentially deploy this artifact to multiple instances in my deployment pool.
However, I am not able to understand the following, when I want multiple consumers as part of my infrastructure -
I can actually define multiple instances in my deployment pool and
have this WAR running on all those instances. This would mean, all of
them are listening to the same topic, are a part of the same consumer
group and will actually divide the partitions among themselves. The
downstream logic will work as is. This works perfectly fine for my
use case, however, I am not sure, if this is the optimal approach to
follow ?
Reading online, I came across resources here and here,
where people are defining a single consumer thread, but internally,
creating multiple worker threads. There are also examples where we
could define multiple consumer threads that do the downstream logic.
Thinking about these approaches and mapping them to deployment
environments, we could achieve the same result (as my theoretical
solution above could), but with less number of machines.
Personally, I think my solution is simple, scalable but might not be optimal, while the second approach might be optimal, but wanted to know your experiences, suggestions or any other metrics / constraints I should consider ? Also, I am thinking with my theoretical solution, I could actually employ bare bones simple machines as Kafka consumers.
While I know, I haven’t posted any code, please let me know if I need to move this question to another forum. If you need specific code examples, I can provide them too, but I didn’t think they are important, in the context of my question.
Your existing solution is best. Handing off to another thread will cause problems with offset management. Spring kafka allows you to run multiple threads in each instance, as long as you have enough partitions.
If your current approach works, just stick to it. It's the simple and elegant way to go.
You would only go to approach 2 in case you cannot for some reason increase the number of partitions but need higher level of parallelism. But then you have ordering and race conditions to worry about. If you ever need to go that route, I'd recommend the akka-stream-kafka library, which provides facilities to handle offset commits correctly and to do what you need in parallel and then merge back into a single stream preserving the original ordering, etc. Otherwise, these things are error-prone to do yourself.

Distributed Metrics

I have been working on a single box application which uses codehale metrics heavily for instrumentation. Right now we are moving to cloud and I have below questions on how I can monitor metrics when the application is distributed.
Is there a metrics reporter that can write metrics data to Cassandra?
When and how does the aggregation happen if there are records per server in the database?
Can I define the time interval at which the metrics data gets saved into the database?
Are there any inbuilt frameworks that are available to achieve this?
Thanks a bunch and appreciate all your help.
I am answering your questions first, but I think you are misunderstanding how to use Metrics.
You can google this fairly easily. I don't know of any (I also don't understand what you'll do with it in cassandra?). You would normally use something like graphite for that. In any case, a reporter implementation is very straight forward and easy.
That question does not make too much sense. Why would you aggregate over 2 different servers - they are independent. Each of your monitored instances should be standalone. Aggregation happens on the receiving side (e.g. graphite)
You can - see 1. Write a reporter, and configure it accordingly.
Not that i know of.
Now to metrics in general:
I think you are having the wrong idea. You can monitor X servers, that is not a problem at all, but you should not aggregate that on the client side (or database side). how would that even work? Restarts zero the clients, and essentially that means you need to track the state of each of your servers so that your aggregation does work. How do you manage outages?
The way you should monitor your servers with metrics:
create a namespace
io.my.server.{hostname}.my.metric
now you have X different namespaces, but they all have a common prefix. That means, you have grouped them.
Send them to your prefered monitoring solution.
There are heaps out there. I do not understand why you want this to be cassandra - what kind of advantage do you gain from that? http://graphite.wikidot.com/ for example is a graphng solution. Your applications can automatically submit data there (graphite comes with a reporter in java that you can use). See http://graphite.wikidot.com/screen-shots on how it looks like.
The main point is that graphite (and all or most providers) know how to handle your namespaces. E.g. also look at Zabix, which can do the same thing.
Aggregations
Now the aggregation happens on the receiving side. Your provider knows how to do that, and you can define rules.
For example, you could wildcard alerts like:
io.my.server.{hostname}.my.metric.count > X
Graphite (I believe) even supports operations, e.g:
sum(io.my.server.{hostname}.my.metric.request) - which would sum up ALL your hosts's requests
That is where the aggregation happens. At that point, your servers are again standalone (as they should), and have no dependency on each other or any monitoring database etc. They simply report on their own metrics (which is what they should do) and you - as the consumer of those metrics - are responsible to make the right alerts/aggregations/formulars on the receiving end.
Aggregating this on server side would involve:
Discover all other servers
Monitor their state
Receive/send metrics back and forth
Synchronise what they report etc
That just sounds like a nightmare for maintenance :) I hope that gives you some inside/ideas.
(Disclaimer: Neither a metrics dev nur a graphite dev - this is just how I did this in the past/ and the approach I still use)
Edit:
With your comment in mind, here are my two fave solutions on what you want to achieve:
DB
you can use the DB and store dates e.g. for start message and end message.
This is not really a metric thing so maybe not preferred. As per your question you could write your own reporter on that, but it would get complicated with regards to upserts/updates etc. I think option 2 is easier and has more potential.
Logs
This is I think what you need. Your servers independently log on Start/Stop/Pause etc - whatever it is you want to report on. You then set up logstash and collect those logs.
Logstash allows you to track these events over time and create metrics on it, see:
https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html
Or:
https://github.com/logstash-plugins/logstash-filter-elapsed
The first one uses actual metrics. The second one is a different plugin that just measures times between start/stop events.
This is the option with the most potential because it does not rely on any format/ any data store or anything other. You even get Kibana for plotting out of the box if you use the entire ELK stack.
Say you wanted to measure your messages. You can just look for the logs, there are no application changes involved. The solution does not even touch on your application (e.g. storing your reporting data manually does take up threads and processing in your applications, so if you need to be real-time compatible this will put your overall performance down), it is a complete standalone solution. Later on, when wanting to measure other metrics, you can easily add to your logstash configuration and start doing other metrics.
I hope this helps

multiple java spring app instances accessing the same DB resources

In my database, i have many records of a certain table that need to be processed from time to time by my java spring app.
There is a boolean flag, on each row of that table saying whether a given record is currently being processed.
What I'm looking at is having my java spring app deployed multiple times on different servers, all accessing the same shared DB, the same app duplicated with some load balancer, etc.
But only one java app instance at a time can process a given DB record of that particular table.
What are the different approaches to enforce that constraint?
I can think of some unique queue that would dispatch those processing tasks to different java running instances making sure that the same DB record is not processed simultaneously by two different java instances. But that sounds quite complicated for what it is. Maybe there is something simpler? Anything else? Thanks in advance.
you can use the locking strategies to enforce the exclusiveness of access to the particular records in you table. there are 2 different approaches that can be applied to reach this requirement. optimistic locking or pessimistic locking, take a look at hibernate docs
additionally, there's another issue that you should think of. with current approach, if the server would crash during the time when it was processing a certain record and eventually would not succeed to complete, then this record would stay in "incomplete" state and would not be processed by others. one possible solution that come to my mind is to use the 'node id' of server that took responsibility for processing instead of state flag.

Migrating A Java Application to Hadoop : Architecture/Design Roadblocks?

Alrite.. so.. here's a situation:
I am responsible for architect-ing the migration of an ETL software (EAI, rather) that is java-based.
I'll have to migrate this to Hadoop (the apache version). Now, technically this is more like a reboot and not a migration - coz I've got no database to migrate. This is about leveraging Hadoop, such that, the Transformation phase (of 'ETL') is parallel-iz-ed. This would make my ETL software,
Faster - with transformation parallel-iz-ed.
Scalable - Handling more data / big data is about adding more nodes.
Reliable - Hadoop's redundancy and reliability will add to my product's features.
I've tested this configuration out - changed my transformation algos into a mapreduce model, tested it out on a high end Hadoop cluster and bench-marked the performance. Now, I'm trying to understand and document all those things that could stand in the way of this application redesign/ rearch / migration. Here's a few I could think of:
The other two phases: Extraction and Load - My ETL tool can handle a variety of datasources - So, do I redesign my data adapters to read data from these data sources, load it to HDFS and then transform it and load it into the target datasource? Could this step act as a huge bottleneck to the entire architecture?
Feedback: So my transformation fails on a record - how do I let the end user know that the ETL hit an error on a particular record? In short, how do I keep track of what is actually going on at the app level with all the maps/reduces/merges and sorts happening - The default Hadoop web interface is not for the end-user - its for admins. So should I build a new web app that scrapes from the Hadoop web interface? (I know this is not recommended)
Security: How do I handle authorization at Hadoop level? Who can run jobs, who are not allowed to run 'em - how to support ACL?
I look forward to hearing from you with possible answers to above questions and more questions/facts I'd need to consider, based on your experiences with Hadoop / problem analysis.
Like always, I appreciate your help and thank ya all in advance.
I do not expect loading to the HDFS to be a bottlneck, since the load is distributed among datanodes - so the network interface will be only bottleneck. Loading data back to the database might be a bottlneck but I think it is no worse then now. I would design jobs to have their input and their output to sit in the HDFS, and then run some kind of bulk load of results into the database.
Feedback is a problematic point, since actually MR have only one result - and it is transformed data. All other tricks, like write failed records into HDFS files will lack "functional" reliability of the MR, because it is a side effect. One of the ways to mitigate this problem you should design you software in the way to be ready for duplicated failed records. There is also scoop = the tool specific for migrating data between SQL databases and Hadoop. http://www.cloudera.com/downloads/sqoop/
In the same time I would consider usage of HIVE - if Your SQL transformations are not that complicated - it might be practical to create CSV files, and make initial preaggregation with Hive, therof reducing data volumes before going to (perhaps single node) database.

Monitoring Changes of a Bean to build deltas?

I have several Beans in my Application which getting updated regularly by the usual setter methods. I want to synchronize these beans with a remote application which has the same bean classes. In my case, bandwidth matters, so i have to keep the amount of transferred bytes as low as possible. My idea was to create deltas of the state changes and transfer them instead of the whole Objects. Currently, I want to write the protocol to transfer those changes by myself but I'm not bound to it and would prefer an existing solution.
Is there already a solution for this Problem out there? And if not, how could I easily monitor those state changes in an generalized way? AOP?
Edit: This problem is not caching related even it may first seem so. The data must be replicated from a central server to several clients (about 4 to 10) over the internet. The client is a standalone desktop application.
This sounds remarkably similar to JBossCache running in POJO mode.
This is a distributed, delta-based cache that breaks down java objects into a tree structure, and only transmits changes to the bits of the tree that changes.
Should be a perfect fit for you.
I like your idea of creating deltas and sending them.
A simple Map could handle the delta for one object. Serialization could simply get you the effective message send.
To reduce the number of messages that would kill your performance, you should group your deltas for all objects and send them as a whole. So you could have others collections or maps to contain this.
To monitor all changes to many beans, AOP seem like a good solution.
EDIT : see Skaffmann's answer.
Using an existing cache technology could be better.
Many problems could already have solutions implemented...

Categories