I have problem with counting responses from response queue. I mean, once per day we run a job which gather some data from db and send them to queue. When we receive all responses we should shutdown connection. The problem is how we can check if all responses arrived ? Keeping this in global variable is risky because of concurrence issue. Any idea ? I am quite new in JMS so maybe solution is obvious but I dont see it.
I don't know what your stack is or whatever tools you might be using to accomplish this but I've got this in mind and this might help you out (hopefully).
Generate a hash for each job you plan on queuing and store it in a concurrent list/map. (i.e: ConcurrentHashMap)
Send the job to the queue.
Once the job is done and sends back a response, reproduce the hash and store it a separate concurrent list/map that holds all the jobs that are done.
Now that you have two lists of all the jobs supposed to be executed and the jobs that you got a response from. There multiple ways to accomplish this. If you lookup Java Concurrency, you'd find plenty of tutorials and documentation. I like to use CyclicBarrierandCountDownLatch`. If plan on using any of these methods, take extra precautions to prevent your application from hanging or worse, a filthy memory leak.
OR, you could simply check on how many queuing requests and responses you've and if they are equal to each other, drop the connection.
Related
I need a solution for the following scenario which is similar to a queue:
I want to write messages to a queue continuously. My message is very big, containing a lot of data so I do want to make as few requests as possible.
So my queue will contain a lot of messages at some point.
My Consumer will read from the queue every 1 hour. (not whenever a new message is written) and it will read all the messages from the queue.
The problem is that I need a way to read ALL the messages from the queue using only one call (I also want the consumer to make as few requests to the queue as possible).
A close solution would be ActiveMQ but the problem is that you can only read one message at a time and I need to read them all in one request.
So my question is.. Would there be other ways of doing this more efficiently? The actual thing that I need is to persist in some way messages created continuously by some application and then consume them (also delete them) by the same application all at once, every 1 hour.
The reason I thought a queue would be fit is because as the messages are consumed they are also deleted but I need to consume them all at once.
I think there's some important things to keep in mind as you're searching for a solution:
In what way do you need to be "more efficient" (e.g. time, monetary cost, computing resources, etc.)?
It's incredibly hard to prove that there are, in fact, no other "more efficient" ways to solve a particular problem, as that would require one to test all possible solutions. What you really need to know is, given your specific use-case, what solution is good enough. This, of course, requires knowing specifically what kind of performance numbers you need and the constraints on acquiring those numbers (e.g. time, monetary cost, computing resources, etc.).
Modern message broker clients (e.g. those shipped with either ActiveMQ 5.x or ActiveMQ Artemis) don't make a network round-trip for every message they consume as that would be extremely inefficient. Rather, they fetch blocks of messages in configurable sizes (e.g. prefetchSize for ActiveMQ 5.x, and consumerWindowSize for ActiveMQ Artemis). Those messages are stored locally in a buffer of sorts and fed to the client application when the relevant API calls are made to receive a message.
Making "as few requests as possible" is rarely a way to increase performance. Modern message brokers scale well with concurrent consumers. Consuming all the messages with a single consumer drastically limits the message throughput as compared to spinning up multiple threads which each have their own consumer. Rather than limiting the number of consumer requests you should almost certainly be maximizing them until you reach a point of diminishing returns.
I am researching if it is possible to have multiple threads output to elasticsearch concurrently using the transport client and bulk upload apis. Specifically, I want to have multiple transport clients or bulk upload api instances run on their own threads and handle input to elasticsearch. My specific reason for wanting to do this is so I can create a load balancing algorithm to handle a very large number of json messages efficiently. I have been googling for some time and can't find any documentation on this type of thing, or anyone else asking similar questions. Additionally, I am new to elasticsearch. Does anyone have any insight on this, some literature they could share, or a good place to start? Thanks.
An idea on how you can achieve this is to have a static class that acts as a wrapper for an elastic Client object. You can then spawn several threads in whatever code you are executing using the ExecutorService. The ExecutorService includes many utility methods, detailed in the link, that might help you manage your processing. These threads would then call into the static class to get the client object when doing processing, prepare their bulk requests, and then send them.
If you are lazy, you can just have loops that execute indefinitely and have sleep calls to help prevent overloading.
A few caveats to watch out for:
1) Be very mindful of Elasticsearch's Thread pool and queue sizes. Do not submit data to ES faster than your hardware can handle. If you are submitting data to ES too fast such that you are overloading the queue, bulk requests will be aborted. Do not increase the bulk queue size unless you need to and know your hardware can keep up and prevent overload. Increasing the queue size if you are running into roadblocks will only delay the inevitable. If you are overloading the bulk, include a way to throttle requests in your code.
2) Partition up your bulk requests by type/index. I am not 100% sure how ES handles bulk requests under the hood, but I have noticed some inconsistent behavior in the queue size when shoving tons requests to different indexes in one bulk request. It would make sense that Elasticsearch partitions up the requests to prevent tons of useless seqs and optimize shard/node traversal, but I have noticed that the queue size goes up much quicker if you mix.
Just start scaling APNS provider program unfortunately I am really new to networking protocol implementation.
The provider now only runs on one thread and it's just handling a tiny amount of notifications. Now I want to increase its capability to send significantly more than before.
My questions are:
According to Apple doc I can maintain multiple connections to gateways. So my understanding is that I run multithreads in the provider program and maintain a separate connection in each. Is this right?
It first one is right the real difficulty for me comes: my program polls a queue database every 5 seconds to check new message that's to be sent. I do not think it's a good idea for all the threads to poll this same database because there should be duplicate message same to users. How to solve this problem?
I have seen the connections pooling but I do not really understand what that is. Is that the thing I need to study and use? If it is can someone offer an brief explanation regarding what it is and how to use it?
Thanks guys!
Your first assumption is reasonable. Each thread should have its own connection.
As for the second point, the access to the DB that contains the new messages should be synchronized. For example, you can access that DB by a synchronized method that fetches a message or several messages that haven't been processed yet, and marks them as being processed. Two threads can't access that method at the same time, and therefore won't get the same messages.
Another option is to put the messages in memory in a blocking quoue (with the DB only serving for backup in case of a crash). The threads can request an item from the queue, which would block them until an item is available.
I have a problem which I believe is the classic master/worker pattern, and I'm seeking advice on implementation. Here's what I currently am thinking about the problem:
There's a global "queue" of some sort, and it is a central place where "the work to be done" is kept. Presumably this queue will be managed by a kind of "master" object. Threads will be spawned to go find work to do, and when they find work to do, they'll tell the master thing (whatever that is) to "add this to the queue of work to be done".
The master, perhaps on an interval, will spawn other threads that actually perform the work to be done. Once a thread completes its work, I'd like it to notify the master that the work is finished. Then, the master can remove this work from the queue.
I've done a fair amount of thread programming in Java in the past, but it's all been prior to JDK 1.5 and consequently I am not familiar with the appropriate new APIs for handling this case. I understand that JDK7 will have fork-join, and that that might be a solution for me, but I am not able to use an early-access product in this project.
The problems, as I see them, are:
1) how to have the "threads doing the work" communicate back to the master telling them that their work is complete and that the master can now remove the work from the queue
2) how to efficiently have the master guarantee that work is only ever scheduled once. For example, let's say this queue has a million items, and it wants to tell a worker to "go do these 100 things". What's the most efficient way of guaranteeing that when it schedules work to the next worker, it gets "the next 100 things" and not "the 100 things I've already scheduled"?
3) choosing an appropriate data structure for the queue. My thinking here is that the "threads finding work to do" could potentially find the same work to do more than once, and they'd send a message to the master saying "here's work", and the master would realize that the work has already been scheduled and consequently should ignore the message. I want to ensure that I choose the right data structure such that this computation is as cheap as possible.
Traditionally, I would have done this in a database, in sort of a finite-state-machine manner, working "tasks" through from start to complete. However, in this problem, I don't want to use a database because of the high volume and volatility of the queue. In addition, I'd like to keep this as light-weight as possible. I don't want to use any app server if that can be avoided.
It is quite likely that this problem I'm describing is a common problem with a well-known name and accepted set of solutions, but I, with my lowly non-CS degree, do not know what this is called (i.e. please be gentle).
Thanks for any and all pointers.
As far as I understand your requirements, you need ExecutorService. ExecutorService have
submit(Callable task)
method which return value is Future. Future is a blocking way to communicate back from worker to master. You could easily expand this mechanism to work is asynchronous manner. And yes, ExecutorService also maintaining work queue like ThreadPoolExecutor. So you don't need to bother about scheduling, in most cases. java.util.concurrent package already have efficient implementations of thread safe queue (ConcurrentLinked queue - nonblocking, and LinkedBlockedQueue - blocking).
Check out java.util.concurrent in the Java library.
Depending on your application it might be as simple as cobbling together some blocking queue and a ThreadPoolExecutor.
Also, the book Java Concurrency in Practice by Brian Goetz might be helpful.
First, why do you want to hold the items after a worker started doing them? Normally, you would have a queue of work and a worker takes items out of this queue. This would also solve the "how can I prevent workers from getting the same item"-problem.
To your questions:
1) how to have the "threads doing the
work" communicate back to the master
telling them that their work is
complete and that the master can now
remove the work from the queue
The master could listen to the workers using the listener/observer pattern
2) how to efficiently have the master
guarantee that work is only ever
scheduled once. For example, let's say
this queue has a million items, and it
wants to tell a worker to "go do these
100 things". What's the most efficient
way of guaranteeing that when it
schedules work to the next worker, it
gets "the next 100 things" and not
"the 100 things I've already
scheduled"?
See above. I would let the workers pull the items out of the queue.
3) choosing an appropriate data
structure for the queue. My thinking
here is that the "threads finding work
to do" could potentially find the same
work to do more than once, and they'd
send a message to the master saying
"here's work", and the master would
realize that the work has already been
scheduled and consequently should
ignore the message. I want to ensure
that I choose the right data structure
such that this computation is as cheap
as possible.
There are Implementations of a blocking queue since Java 5
Don't forget Jini and Javaspaces. What you're describing sounds very like the classic producer/consumer pattern that space-based architectures excel at.
A producer will write the jobs into the space. 1 or more consumers will take out jobs (under a transaction) and work on that in parallel, and then write the results back. Since it's under a transaction, if a problem occurs the job is made available again for another consumer .
You can scale this trivially by adding more consumers. This works especially well when the consumers are separate VMs and you scale across the network.
If you are open to the idea of Spring, then check out their Spring Integration project. It gives you all the queue/thread-pool boilerplate out of the box and leaves you to focus on the business logic. Configuration is kept to a minimum using #annotations.
btw, the Goetz is very good.
This doesn't sound like a master-worker problem, but a specialized client above a threadpool. Given that you have a lot of scavenging threads and not a lot of processing units, it may be worthwhile simply doing a scavaging pass and then a computing pass. By storing the work items in a Set, the uniqueness constraint will remove duplicates. The second pass can submit all of the work to an ExecutorService to perform the process in parallel.
A master-worker model generally assumes that the data provider has all of the work and supplies it to the master to manage. The master controls the work execution and deals with distributed computation, time-outs, failures, retries, etc. A fork-join abstraction is a recursive rather than iterative data provider. A map-reduce abstraction is a multi-step master-worker that is useful in certain scenarios.
A good example of master-worker is for trivially parallel problems, such as finding prime numbers. Another is a data load where each entry is independant (validate, transform, stage). The need to process a known working set, handle failures, etc. is what makes a master-worker model different than a thread-pool. This is why a master must be in control and pushes the work units out, whereas a threadpool allows workers to pull work from a shared queue.
I am building an application that reaches out to a FHIR API that implements paging, and only gives me a maximum of 100 results per page. However, our app requires the aggregation of these pages in order to hand over metadata to the UI about the entire result set.
When I loop through the pages of a large result set, I get HTTP status 429 - Too many requests. I am wondering if handing off these requests to a kafka service will help me get around this issue and maybe increase performance. I've read through the Intro and Use Cases sections of the Kafka documentation, but am still unclear as to whether implementing this tool will help.
You're getting 429 errors because you're making too many requests too quickly; you need to implement rate limiting.
As far as whether to use Kafka, a big part of that is whether your result set can fit in memory. If you can fit it in memory, then I would really suggest avoiding bringing in a separate service (KISS). If not, then yes, you can use Kafka. But I'd suggest taking a long think about whether you can use a relational datastore, because they're much more flexible. Or maybe even reading/writing directly to the disk
I were you, before I look into Kafka, I would try to solve why you are getting a 429 error. I would not leave that unnoticed. I would try to see how I am going to solve that.
I would looking into the following:
1) Sleep your process. The server response usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying.
2) Exponential backoff If the server's response does not tell you how long to wait, you can retry your request by inserting pauses by yourself in between.
Do keep it mind, before implementing sleep, it warrants extensive testing. You would have to make sure that your existing functionality does not get impacted.
To answer your question if Kafka would help you or not, the answer is it may or may not, with the limited info I can get from your question. Do understand that implementing Kafka would change your network architecture. You are bringing in a streaming platform to the equation. You would most probably implement caching which would aggregate your results. But at the moment all these concepts are at a very holistic level. I would suggest that you first ought to solve the 429 error and then warrant if a proper technical reason is present to implement Kafka which would improve your website's performance.