Some questions regarding architecture/design to this usecase?

Some questions regarding architecture/design to this usecase? - java

My application needs to work as middleware where it has got orders(in form of xml) from various
customers which contains the supplier id. Once it get the order, it needs to send order request
to different suppliers in the form of xml.i am double minded about three aspects of it. Here they are:-
Questions:
What i am planning at high level is as soon as request come, put it on jms queue.(Now i am not sure
should i create queue for each supplier or one queue should be sufficient. I think one queue will be sufficient.
as maintaining large number of queues will be overhead.). Advantage of maintaining separate queue per supplier is message can be processed faster as there will be separate producer on each queue.
Before putting the object on queue
i need to do some business validations. Also the structure of input xml i am receiving and output xml i need to send to supplier is different. For this i am planning to convert the input xml to java object then put on queue
so that validation can be done with ease at consumer side. Another thought is dont convert the xml into java object, just get all elements
value thru xpath/xstream api and validate them and put xml string as it is on queue because. Then at consumer side convert xml to java object then to different xml format. Is there a way of doing it?
Now my requirement is consumer on queue process the messages on queue after every 5 hours and send the xml request
to suppliers. I am planning to use quartz scheduler here which will pick the job one by one and send to corresponding
supplier based on supplierId. Here is my question is if my job pick the message one by one and then send it to supplier.
it will be too slow . I am planning to handle it where quartz job create ThreadPool with size of say ten threads at time
which concurrently process the messages from queue(So here will be multiple consumers on queue. I think thats not valid for queue. Do i need topic here instead of queue?). Is second approach is better or there is some better than this?
i am expecting a load of 50k request per hour which mean around 15 request per second

Your basic requirement is ,
Get order from customer as XML ( you have not told how you are receiving)
Do basic Business validation .
Send the Orders to Suppliers
And you would be excepting 50k Request ( You haven't provided the approximate an Order size).
Assuming your Average order size 10K, it would be around 500 MB required just to hold it in Queue ( irrespective of number of queues) . i am not sure which environment you are running.
For Point #1
I would choose single Queue instead of multiple Queue
- Choose the appropriate persistent store.
I am assuming you would be using Distributed Queue , so that it can be easily scale while adding clusters.
For Point #2
I would be converting in POJO (Your own format ) and perform business validation. So that later if you want to extend the business validation to ruler or any other conversion it would be easy to maintain.
- basically get the input in any form ( XML / POJO / JSON ...) and convert into Middle format ( you can write custome validator / conversion utility on top of Middle fomart) . And have Keep Mappings between the Common format to input as well output. So that you can write formatters and use them. which will not impact in future while changing format for any specific supplier. Try to externalize the format mapping.
For Point # 3
in your case, A Order needs to be processed by only once. So i would go with Queue. and you can have multiple Message Listeners . Message listeners deliver order in asynchronous. So you can have multiple Listeners for an Queue. And each listeners would run separate thread.
Is there a problem to send the orders as soon as it received ? It would be good for you as well as the supplier to avoid heavy load at particular time.

Since you are the middleware, you should handle data quick at the point of contact, to get your hands free for more incoming requests. Therefore you must find a way to distinquish the incoming data as quick and memory low as possible. Leave the processing of the data to modules more specific to the problem. A receptionist just directs the guests in the right spot.
If you really have to read and understand the received data in your specialized worker later on, use a threadpool. This way you can process the data parallelly without too much worry about outofmem. Just choose your pool size smartly and use only one. You could use a listener pattern to signal new incoming data to the worker multiton. You should avoid jaxb or better the complete deserialization of the data if possible. It eats up memory like hell.
I would not use jmx because you "messages" are relevant for only one listener.
If it is possible send the mail as soon as the worker is done with its work. If not, use a storage. This way you can later proove you processed the data and if something went wrong or you have to update your software, you do not have to worry about volatile data.

Related

Kafka - Assign messages to specific Consumer Groups

I have an small question about Kafka Group IDs, I can use this Annotaiton in Java:
#KafkaListener(topics = "insert", groupId = "user")
There I can set an groupId which it wanna consume, but i does not consume just this group id maybe because of that I can't send to specific group id. How I can send just to an special groupid? For what I can use the GroupID or I need to set the Topic special for sending the Kafka Messages specific?
I tried already to find an answer online, but I find nothing, maybe I use google false haha
I hope all understand me, if not pls quest :)
Thx alot already!

Welcome to Kafka! First of all: You can't send to a consumer group, you send to a Topic.
Too much text below. Be aware of possible drowsiness while trying to read the entire answer.
If you are still reading this, I assume you truly want to know how to direct messages to specific clients, or you really need to get some sleep ASAP.
Maybe both. Do not drive afterwards.
Back to your question.
From that topic, multiple Consumer Groups can read. Every CG is independent from others, so each one will read the topic from start to end, by their own. Think of a CG as an union of endophobic consumers: they won't care about other groups, they won't ever talk to another group, they don't even know if others exist.
I can think of three different ways to achieve your goal, by using different methodologies and/or architectures. The only one using Consumer Groups is the first one, but the other two may also be helpful:
subscribe
assign
Multiple Topics
The first two ones are based on mechanisms to divide messages within a single topic. The third one should only be justified on certain cases. Let's get into these options.
1. Subscribe and Consumer Groups
You could create a new Topic, fill it with messages, and add some metadata in order to recognize who needs to process each message (to who that message is directed).
Messages stored in Kafka contain, among other fields , a KEY and a VALUE (the message itself).
So let's say you want only GROUP-A to process some specific messages. One simple solution could be including an identifier on the key, such as a suffix. One of your keys could look like: key#GA.
On the consumer side, you poll() the messages from that topic, and add a little extra conditional logic before processing it: you'll just read the key and check the suffix. If it corresponds with the specified consumer group, in this case, it contains GA, then the consumer from GROUP-A knows that it must process the message.
For example, your Topic stores messages of two different natures, and you want them to be directed to two groups: GROUP-A and GROUP-Z.
key value
- [11#GA][MESSAGE]
- [21#GZ][MESSAGE]
- [33#GZ][MESSAGE]
- [44#GA][MESSAGE]
Both consumer groups will poll those 4 messages, but only some of them will be processed by each group.
Group-A will discard the 2nd and 3rd messages. It will process the 1st and 4th.
Group-Z will discard the 1st and 4th messages. It will process the 2nd and 3rd.
This is basically what you are aiming, but using some extra logic and playing with Kafka's architecture. The messages with certain suffix will be "directed" to an specific consumer group, and ignored by the other ones.
2. Assign
The above solution is focused on consumer groups and Kafka's subscribe methodology. Another possible solution, instead of subscribing consumer groups, would be to use Kafka's assign method. No ConsumerGroup is involved here, so references to the previous groups will be quoted in order to avoid any confusion.
Assign allows you to directly specify the topic/partition from which your consumer must read.
In the producer side, you should partition your messages in order to divide them between the partitions within your topic, using your own logic. Some more deeper info about custom partitioners here (yeah the author from the link seems like a complete douche).
For example, let's say you have 5 different types of consumers. So you create a Topic with 5 partitions, one for each "group". Your producer's custom partitioner identifies the corresponding partition for each message, and the topic would present this structure after producing the messages from the previous example:
In order to direct the messages to their corresponding "groups" :
"Group-Z" is assigned the 5th partition.
"Group-A" is assigned the 1st partition.
The advantage of this solution is that less resources are wasted: each "group" just polls his own messages, and as every message is verified to be directed to the consumer which polled it, you avoid the discard/accept logic: less traffic on the wire, fewer objects in memory, fewer cpu work.
The disadvatange consists in a more complex Kafka producer mechanism, which involves a custom partitioner, that most surely should be constantly updated regarding changes on your data or topic structures. Moreover, this will also lead to update the defined assigments of your consumers as well, each time the producer side is altered.
Personal note:
Assign offers a better perfomance, but carries a high price: manual and constant control of producers, topics, partitions and consumers, hence being (possibly) more error-prone. I would call it the efficient solution.
Subscribe makes all the process much simpler, and possibly will involve fewer problems/error on the system, hence being more reliable. I would call it the effective solution.
Anyway, this is a totally subjective oppinion.
Not finished yet
3. . Multi-topic solution.
The previously proposed solutions assume that the messages share the same nature, hence will be produced in the same Topic.
In order to explain what I'm trying to say here, let's say a Topic is represented as a storage building.
<--laptops, tables, smartphones,...
The previous solutions assume that you store similar elements there, for example, electronic devices; Their end of life is similar, the storage method is similar regardless of the specific device type, the machinery you use is the same, etc. With this in mind, it's completely logical to store all those elements into the same warehouse, divided in different sections (into the same topic, divided in different partitions).
There is no real reason to build a new warehouse for each electronic-device family (one for tv, other for mobile phones,... unless you are wrapped in money). The previous solutions assume that your messages are different types of "electronic devices".
But time passes by and you are doing well, so decide to start a new business: fruits storage.
Fruits have fewer life (log.retention.ms anyone?), must be stored inside a range of temperature, and probably your device storing elements and techniques from the first warehouse will differ by a lot. Moreover, you fruit business could be closed on certain periods of the year, while electronic devices are received 365/24. Even if you open your device's warehouse daily, maybe the fruit storage is only working on mondays and tuesdays (and with luck is not temporaly closed because of the period).
As fruits and electronic devices need different types of storage management, you decide to build a new warehouse. Your new fruis topic.
<--bananas, kiwis, apples, chicozapotes,...
Creating a second topic is justified here, since each one could need different configuration values, and each one stores contents from very different natures. This leads to consumers with also very different processing logics.
So, is this a 3rd possible solution?
Well, it does make you forget about consumer groups, partitioning mechanisms, manual assignations, etc. You only have to decide which consumers subscribe to which Topic, and you're done: you effectively directed the messages to specific consumers.
But, if you build a warehouse and start storing computers, would you really build another warehouse to store the phones that just arrived? In real life, you'll have to pay for the construction of the second building, as well as pay two taxes, pay for the cleaning of two buildings, and so on.
laptops here-> <-tablets here
In kafka's world, this would be represented as extra work for the kafka cluster(twice replication petitions,zookeeper has a newborn with new ACLs and controllers, ...), extra time for the human assigned to this job, since now is responsible of the management of two topics: A worker spending time on something that could be avoided is synonym of €€€ lost by the company. Also, I am not aware if they already do this or ever will, but cloud providers are somehow fond to insert small taxes for certain operations, such as creatiing a topic (but this is just a possibility, and I may be wrong here).
To resume, this is not necessarilly a bad idea: it just needs a justified context. Use it if you are working with Bananas and Qualcomm chips.
If you are working with Laptops and Tablets, go for the consumer group and partition solutions previously shown.

Kafka Streams: Should we advance stream time per key to test Windowed suppression?

I learnt from This blog and this tutorial that in order to test suppression with event time semantics, one should send dummy records to advance stream time.
I've tried to advance time by doing just that. But this does not seem to work unless time is advanced for a particular key.
I have a custom TimestampExtractor which associates my preferred "stream-time" with the records.
My stream topology pseudocode is as follows (I use the Kafka Streams DSL API):
source.mapValues(someProcessingLambda)
.flatMap(flattenRecordsLambda)
.groupByKey(Grouped.with(Serdes.ByteArray(), Serdes.ByteArray()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(10)).grace(Duration.ZERO))
.aggregate(()->null, aggregationLambda)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
My input is of the following format:
1 - {"stream_time":"2019-04-09T11:08:36.000-04:00", id:"1", data:"..."}
2 - {"stream_time":"2019-04-09T11:09:36.000-04:00", id:"1", data:"..."}
3 - {"stream_time":"2019-04-09T11:18:36.000-04:00", id:"2", data:"..."}
4 - {"stream_time":"2019-04-09T11:19:36.000-04:00", id:"2", data:"..."}
.
.
Now records 1 and 2 belong to a 10 minute window according to stream_time and 3 and 4 belong to another.
Within that window, records are aggregated as per id.
I expected that record 3 would signal that the stream has advanced and cause suppress to emit the data corresponding to 1st window.
However, the data is not emitted until I send a dummy record with id:1 to advance the stream time for that key.
Have I understood the testing instruction incorrectly? Is this expected behavior? Does the key of the dummy record matter?

I’m sorry for the trouble. This is indeed a tricky problem. I have some ideas for adding some operations to support this kind of integration testing, but it’s hard to do without breaking basic stream processing time semantics.
It sounds like you’re testing a “real” KafkaStreams application, as opposed to testing with TopologyTestDriver. My first suggestion is that you’ll have a much better time validating your application semantics with TopologyTestDriver, if it meets your needs.
It sounds to me like you might have more than one partition in your input topic (and therefore your application). In the event that key 1 goes to one partition, and key 3 goes to another, you would see what you’ve observed. Each partition of your application tracks stream time independently.
TopologyTestDriver works nicely because it only uses one partition, and also because it processes data synchronously. Otherwise, you’ll have to craft your “dummy” time advancement messages to go to the same partition as the key you’re trying to flush out.
This is going to be especially tricky because your “flatMap().groupByKey()” is going to repartition the data. You’ll have to craft the dummy message so that it goes into the right partition after the repartition. Or you could experiment with writing your dummy messages directly into the repartition topic.
If you do need to test with KafkaStreams instead of TopologyTestDriver, I guess the easiest thing is just to write a “time advancement” message per key, as you were suggesting in your question. Not because it’s strictly necessary, but because it’s the easiest way to meet all these caveats.
I’ll also mention that we are working on some general improvements to stream time handling in Kafka Streams that should simplify the situation significantly, but that doesn’t help you right now, of course.

Streaming large number of small objects with Java

A client and a server application needs to be implemented in Java. The scenario requires to read large number of small objects from database on the server side and send them to client.
This is not about transferring large files rather it requires streaming large number of small objects to client.
The number of objects needs to be sent from server to client in a single request could be one or one million (let's assume the number of clients is limited for the sake of discussion - ignore throttling).
The total size of the objects in most cases will be too big to hold them in memory. A way to defer read and send operation on the server side until client requests the object is needed.
Based on my previous experience, WCF framework of .NET supports the scenario above with
transferMode of StreamedResponse
ability to return IEnumerable of objects
with the help of yield defer serialization
Is there a Java framework that can stream objects as they requested while keeping the connection open with the client?
NOTE: This may sound like a very general question, but I am hoping to give specific details that would hopefully lead to a clear answer benefiting me and possible others.

A standard approach is to use a form of pagination and get the results in chunks which can be accommodated temporarily in memory. How to do that specific it depends, but a basic JDBC approach would be to first execute a statement to find out the number of records and then get them in chunks. For example, Oracle has a ROWNUM column that you use in order to manage the ranges of records to return. Other databases have some other options.

You could use ObjectOutputStream / ObjectInputStream to do this.
The key to making this work would be to periodically call reset() on the output stream. If you don't do that, the sending and receiving ends will build a massive map that contains references to all objects sent / received over the stream.
However, there may be issues with keeping a single request / response (or database cursor) open for a long time. And resuming a stream that failed could be problematic. So your solution should probably combine the above with some kind of pagination.
The other thing to note is that a scalable solution needs to avoid network latency from becoming the bottleneck. It may be worth implementing a receiver thread that eagerly pulls objects from the stream and buffers them in a (bounded) queue.

What is an efficient way to read/write a priority queue to a text file?

I have a priority queue class that I implemented in Java as it being an array of queues. I need a good way (without using Serialization) of recording and storing the contents of the priority queue after each "transaction" or enqueue()/dequeue() of an object from the priority queue. It should serve as a backup in the event that the priority queue needs to be rebuilt by the program from the text file.
Some ideas I had and my problems with each:
After each "transaction", loop through the queues and write each one to a line in the file using delimiters between objects.
-- My problem with this is that it would require dequeueing and re-enqueueing all the objects and this seems highly inefficient.
After each enqueue or dequeue simply write that object or remove that object from the file.
-- My problem with this is: if this is the approach I should be taking, I am having a hard time coming up with a way to easily find and delete the object after being dequeued.
Any hints/tips/suggestions would be greatly appreciated!

To loop through a queue you can just iterate over it. This is non-destructive (but only loosely thread safe)
Writing the contents of the queue to disk every time is likely to be very slow. For a typical hard drive, a small queue will take about 20 ms to write. i.e. 50 times per second at best. If you use an SSD this will be much faster for a small queue, however you still have to marshal your data even if you don't use Serialisation.
An alternative is to use a JMS server which is designed to support transactions, queues and persistence. A typical JMS server can handle about 10,000 messages per second. There are a number of good free servers available.

I would implement your requirements as a log pattern. At the end of your file, append every enqueue and its priority, append every dequeue. If your messaging server crashes, you can replay the log file and you'll end up with the appropriate state.
Obviously, your log file will grow huge over time. To combat this, you'll want to rotate log files every so often. To do this, serialize your queue at a point in time, and then begin logging in a new file. You can even accomplish this without locking the state (freezing queu requests) by simultaneously logging transactions to the old and new logs while a snapshot of the data structure is written to disk. When the snapshot is complete, write a pointer indicating that to disk and you can delete your old log.
Write time and space is n, replays should be rare and are relatively fast.

To find objects easily in second approach...I've couple of suggestions ::
You can use your priority function to keep objects sorted in the file.
To manage newly added objects at different positions, keep some space between every inserted object in the text file and when an object is inserted, you can use some pointer like behavior to specify the offset or something else which can be easily managed.
Use a buffer since writing content evreytime can be very slow.
Deletion will be trivial if you use your priority function carefully.
Also sorting in small buckets pointed by pointers will be very fast and you can always use a garbage collection type of behavior by compacting all the objects after sometime.

one more suggestions: (to consider if usage one file exactly is not a must):
If your object number is not very large, store each object to a seperate file. Of'course, you will need to make a unique identifier for each object and you can use this identifier to be the file name too. this way, you always add or delete a single file based on the identifier stored in the object. If the objects are of various classes that can't be modified, you simply can store a hashmap that maps identifiers to objects. so before you add an object to a queue, you create an identifier and then add the object and the identifier to the map as a pair and you write a new file names as the identifier and containing the object. I leave what to do on delete and reload as it is nothing more than practice.
personally, I favour what was suggested by Robert Harvey in his comment on the question. consider the use of a database, especially if your project has one already. this will make storing objects and deleting objects easier and faster than locating positions within a file. because even if you find a location of the object in a file, most probably you will need to write the whole file again (only without that object). and that is not different from looping. using a database, you avoid all of this trouble.

Designing a process

I challenge you :)
I have a process that someone already implemented. I will try to describe the requirements, and I was hoping I could get some input to the "best way" to do this.
It's for a financial institution.
I have a routing framework that will allow me to recieve files and send requests to other systems. I have a database I can use as I wish but it is only me and my software that has access to this database.
The facts
Via the routing framework I recieve a file.
Each line in this file follows a fixed length format with the identification of a person and an amount (+ lots of other stuff).
This file is 99% of the time im below 100MB ( around 800bytes per line, ie 2,2mb = 2600lines)
Once a year we have 1-3 gb of data instead.
Running on an "appserver"
I can fork subprocesses as I like. (within reason)
I can not ensure consistency when running for more than two days. subprocesses may die, connection to db/framework might be lost, files might move
I can NOT send reliable messages via the framework. The call is synchronus, so I must wait for the answer.
It's possible/likely that sending these getPerson request will crash my "process" when sending LOTS.
We're using java.
Requirements
I must return a file with all the data + I must add some more info for somelines. (about 25-50% of the lines : 25.000 at least)
This info I can only get by doing a getPerson request via the framework to another system. One per person. Takes between 200 and 400msec.
It must be able to complete within two days
Nice to have
Checkpointing. If im going to run for a long time I sure would like to be able to restart the process without starting from the top.
...
How would you design this?
I will later add the current "hack" and my brief idea
========== Current solution ================
It's running on BEA/Oracle Weblogic Integration, not by choice but by definition
When the file is received each line is read into a database with
id, line, status,batchfilename and status 'Needs processing'
When all lines is in the database the rows are seperated by mod 4 and a process is started per each quarter of the rows and each line that needs it is enriched by the getPerson call and status is set to 'Processed'. (38.0000 in the current batch).
When all 4 quaters of the rows has been Processed a writer process startes by select 100 rows from that database, writing them to file and updating their status to 'Written'.
When all is done the new file is handed back to the routing framework, and a "im done" email is sent to the operations crew.
The 4 processing processes can/will fail so its possible to restart them with a http get to a servlet on WLI.

Simplify as much as possible.
The batches (trying to process them as units, and their various sizes) appear to be discardable in terms of the simplest process. It sounds like the rows are atomic, not the batches.
Feed all the lines as separate atomic transactions through an asynchronous FIFO message queue, with a good mechanism for detecting (and appropriately logging and routing failures). Then you can deal with the problems strictly on an exception basis. (A queue table in your database can probably work.)
Maintain batch identity only with a column in the message record, and summarize batches by that means however you need, whenever you need.

When you receive the file, parse it and put the information in the database.
Make one table with a record per line that will need a getPerson request.
Have one or more threads get records from this table, perform the request and put the completed record back in the table.
Once all records are processed, generate the complete file and return it.

if the processing of the file takes 2 days, then I would start by implementing some sort of resume feature. Split the large file into smaller ones and process them one by one. If for some reason the whole processing should be interrupted, then you will not have to start all over again.
By splitting the larger file into smaller files then you could also use more servers to process the files.
You could also use some mass loader(Oracles SQL Loader for example) to take the large amount of data form the file into the table, again adding a column to mark if the line has been processed, so you can pick up where you left off if the process should crash.
The return value could be many small files which at the end would be combined into large single file. If the database approach is chosen you could also save the results in a table which could then be extracted to a csv file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.