I'm in the process of migrating an ACID-based monolith to an event-based microservice architecture. In the monolith potentially large files are stored in a database and I want to share this information (including the file content) with the microservices.
My approach would be to split the file into numbered blocks and send several messages (e.g. 1 FileCreatedMessage with metadata and an id followed by n FileContentMessage containing the block and its sequence number). On the receiving side messages may not arrive in order. Therefore I'd store the blocks from messages, order and join them and store the result.
Is there any approach which allows me to stream the data through Kafka with one message or another approach without the overhead of implementing the spliting, order and join logic for several messages?
I noticed Kafka Streams. It seems to solve different problems than this one.
Kafka is not the right approach for sending the large files. First, you need to ensure that chunks of one message will come to the same partition, so that they will be processed by the one instance of the consumer. The weak point here is that your consumer may fail in the middle loosing the chunks, it gathered. If you store the chunks in some storage (database) until all of them arrive, then you will need the separate process to assemble them. Your will also need to think about what happens if you loose a chunk or have an error during the processing of the chunk. We were thinking about this question in our company and decided not to send files through Kafka at all, keep them in storage and send the reference to them inside the message.
This article summarizes pros and cons.
Kafka streams will not help you here as it is the framework, which contains high level constructs for working with streams, but it just works over Kafka.
I try not to use Kafka to hold large file content. Instead, I store the file on a distributed file system (usually HDFS, but there are other good ones) and then put the URI into the Kafka message along with any other meta data I need. You do need to be careful of replication times within the distributed file system if processing your Kafka topic on a distributed streaming execution platform (e.g. Storm or Flink). There may be instances where the Kafka message is processed before the DFS can replicate the file for access by the local system, but that's easier to solve than the problems caused by storing large file content in Kafka.
I have developed Document Management System (DMS) having OCR feature. However, it takes too much time to process, as well as high CPU usage.
My current process is synchronous, as below :
User upload his file
OCR process
Store document information in DB
Considering the real-time production load, I want to make above second step asynchronous, on a dedicated file processing separate server.
My questions are,
Is it the right way to do it?
How to send/retrieve that file to another server to process? I also found out to use message queue, but I can not add whole file in it.
Is there anyway, we can acknowledge process completion?
Just to close this question, I have separated OCR process successfully on separate file processing server, which really helps me to resolve high CPU usage, using FIFO method.
Followed below steps :
User uploads file
OCR status pending
Separate server process file, which is pending as per FIFO method once at a time.
Update OCR process status in the database.
Processing server can be increased later, as per need and load of the server.
Many people suggest that the good way for organizing IPC (ImicroservicesC) is asynchronous communication via queries like Kafka and JMS.
But what if I need to pass large data files between services?
Suppose I have a Video Microservice and a Publisher Microservice. The first one receives videos from the user, verifies and sends them to Publisher for converting and publishing. It's oblivious video can be a very large file and it can overload messaging system (Kafka is not suitable for big messages at all). Of course, I can share one database for them and send video_id via Kafka, but it couples these services and its not a real microservices architecture anymore.
Do you have similar situations in practice? How do you handle it?
Thanks
There is an Enterprise Integration Pattern from the book by Hohpe/Wolfe called the Claim Check Pattern that addresses these concerns.
Essentially the big blob is removed from the message and stored somewhere that both sender and receiver can access, whether that be a common file share, FTP server, an Amazon S3 blob, whatever. It leaves a "claim check" behind: some sort of address that describes how to find the blob back.
The tiny message can then be transmitted over Kafka/JMS, or some other message queue system, most of which are fairly bad at dealing with large data blobs.
Of course, a very simple implementation is to leave the files on a file share and only refer to them by file path.
It's more complex when it's preferable to have the blob integrated with the rest of the message, requiring a true Claim Check implementation. This can be handled at an infrastructure level so the message sender and receiver don't need to know any of the details behind how the data is transmitted.
I know that you're in the Java landscape, but in NServiceBus (I work for Particular Software, the makers of NServiceBus) this pattern is implemented with the Data Bus feature in a message pipeline step. All the developer needs to do is identify what type of message properties apply to the data bus, and (in the default file share implementation) configure the location where files are stored. Developers are also free to provide their own data bus implementation.
One thing to keep in mind is that with the blobs disconnected from the messages, you have to provide for cleanup. If the messages are one-way, you could clean them up as soon as the message is successfully processed. With Kafka (not terribly familiar) there's a possibility to process messages from a stream multiple times, correct? If so you'd want to wait until it was no longer possible to process that message. Or, if the Publish/Subscribe pattern is use, you would not want to clean up the files until you were sure all subscribers had a chance to be processed. In order to accomplish that, you'd need to set an SLA (a timespan that each message must be processed within) on the message and clean up the blob storage after that timespan had elapsed.
In any case, lots of things to consider, which make it much more useful to implement at an infrastructure level rather than try to roll your own in each instance.
I have an app which will generate 5 - 10 new database records in one host each second.
The records don't need any checks. They just have to be recorded in a remote database.
I'm using Java for the client app.
The database is behind a server.
The sending data can't make the app wait. So probably sending each single record to the remote server, at least synchronously, it's not good.
Sending data must not fail. My app doesn't need an answer from the server, but it has to be 100% secure that it arrives at the server correctly (which should be guaranteed using for example http url connection (TCP) ...?).
I thought about few approaches for this:
Run the send data code in separate thread.
Store the data only in memory and send to database after certain count.
Store the data in a local database and send / pulled by the server by request.
All of this makes sense, but I'm a noob on this, and maybe there's some standard approach which I'm missing and makes things easier. Not sure about way to go.
Your requirements aren't very clear. My best answer is to go through your question, and try to point you in the right direction on a point-by-point basis.
"The records don't need any checks," and "My app doesn't need an answer, but it has to be 100% secure that it arrives at the server correctly."
How exactly are you planning on the client knowing that the data was received without sending a response? You should always plan to write exception handling into your app, and deal with a situation where the client's connection, or the data it sends, is dropped for some reason. These two statements you've made seem to be in conflict with one another; you don't need a response, but you need to know that the data arrives? Is your app going to use a crystal ball to devine confirmation of the data being received (if so, please send me such a crystal ball - I'd like to use it to short the stock market).
"Run the send data code in a separate thread," and "store the data in memory and send later," and "store the data locally and have it pulled by the server", and "sending data can't make my app wait".
Ok, so it sounds like you want non-blocking I/O. But the reality is, even with non-blocking I/O it still takes some amount of time to actually send the data. My question is, why are you asking for non-blocking and/or fast I/O? If data transfers were simply extremely fast, would it really matter if it wasn't also non-blocking? This is a design decision on your part, but it's not clear from your question why you need this, so I'm just throwing it out there.
As far as putting the data in memory and sending it later, that's not really non-blocking, or multi-tasking; that's just putting off the work until some future time. I consider that software procrastination. This method doesn't reduce the amount of time or work your app needs to do in order to process that data, it just puts it off to some future date. This doesn't gain you anything unless there's some benefit to "batching" data sending into large chunks.
The in-memory idea also sounds like a temporary buffer. Many of the I/O stream implementations are going to have a buffer built in, as well as the buffer on your network card, as well as the buffer on your router, etc., etc. Adding another buffer in your code doesn't seem to make any sense on the surface, unless you can justify why you think this will help. That is, what actual, experienced problem are you trying to solve by introducing a buffer? Also, depending on how you're sending this data (i.e. which network I/O classes you choose) you might get non-blocking I/O included as part of the class implementation.
Next, as for sending the data on a separate thread, that's fine if you need non-blocking I/O, but (1) you need to justify why that's a good idea in terms of the design of your software before you go down that route, because it adds complication to your app, so unless it solves a specific, real problem (i.e. you have a UI in your app that shouldn't get frozen/unresponsive due to pending I/O operations), then it's just added complication and you won't get any added performance out of it. (2) There's a common temptation to use threads to, again, basically procrastinate work. Putting the work off onto another thread doesn't reduce the total amount of work needing to be done, or the total amount of I/O your app will consume in order to accomplish its function - it just puts it off on another thread. There are times when this is highly beneficial, and maybe it's the right decision for your app, but from your description I see a lot of requested features, but not the justification (or explanation of the problem you're trying to solve) that backup these feature/design choices, which is what should ultimately drive the direction you choose to go.
Finally, as far as having the server "pull" it instead of it being pushed to the server, well, all you're doing here is flipping the roles, and making the server act as a client, and the client the server. Realize that "client" and "server" are relative terms, and the server is the thing that's providing the service. Simply flipping the roles around doesn't really change anything - it just flips the client/server roles from one part of the software to the other. The labels themselves are just that - labels - a convenient way to know which piece is providing the service, and which piece is consuming the service (the client).
"I have an app which will generate 5 - 10 new database records in one host each second."
This shouldn't be a problem. Any decent DB server will treat this sort of work as extremely low load. The bigger concern in terms of speed/responsiveness from the server will be things like network latency (assuming you're transferring this data over a network) and other factors regarding your I/O choices that will affect whether or not you can write 5-10 records per second - that is, your overall throughput.
The canonical, if unfortunately enterprisey, answer to this is to use a durable message queue. Your app would send messages to the queue, and a backend app would receiver and store them in a database. Once the queue has accepted a message, it guarantees that it will be made available to the receiver, even if the sender, receiver, or the queue broker itself crash.
On my machine, using HornetQ, it takes ~1 ms to construct and send a short text message to a durable queue. That's quick enough that you can do it as part of handling a web request without adding any noticeable additional delay. Any good message queue will support your 10 messages per second throughput. HornetQ has been benchmarked as handling 8.2 million messages per second.
I should add that message queues are not that hard to set up and use. I downloaded HornetQ, and had it up and running in a few minutes. The code needed to create a queue (using the native HornetQ API) and send and receive messages (using the JMS API) is less than a hundred lines.
If you queue the data and send it in a thread, it should be fine if your rate is 5-10 per second and there's only one client. If you have multiple clients, to the point where your database inserts begin to get slow, you could have a problem; given your requirement of "sending data must not fail." Which is a much more difficult requirement, especially in the face of machine or network failure.
Consider the following scenario. You have more clients than your database can handle efficiently, and one of your users is a fast typist. Inserts begin to back up in-memory in their app. They finish their work and shut it down before the last ones are actually uploaded to the database. Or, the machine crashes before the data is sent - or while its sending; or worse yet, the database crashes while its sending, and due to network issues the client can't really tell that its transaction has not completed.
The easy way avoid these problems (most of them anyway), is to make the user wait until the data is committed somewhere before allowing them to continue. If you can make the database inserts fast enough then you can stick with a simpler scheme. If not, then you have to be more creative.
For example, you could locally write the data to disk when the user hits submit, and then upload it from another thread. This scenario needs to be smart enough to mark something that is persisted as sent (deleting it would work); and have the ability to re-scan at startup and look for unsent work to send. It also needs the ability to keep trying in the case of network or centralized server failure.
There also needs to be a way for the server side to detect duplicates. Because the client machine could send the data and crash before it can mark it as sent; and then upon restart it would send it again. The same situation could occur if there is a bad network connection. The client could send it and never receive confirmation from the server; time out and then end up retrying it.
If you don't want the client app to block, then yes, you need to send the data from a different thread.
Once you've done that, then the only thing that matters is whether you're able to send records to the database at least as fast as you're generating them. I'd start off by getting it working sending them one-by-one, then if that isn't sufficient, put them into an in-memory queue and update in batches. It's hard to say more, since you don't give us any idea what is determining the rate at which records are generated.
You don't say how you're writing to the database... JDBC? ORM like Hibernate? But the principles are the same.
I have a table in a central database that gets constantly appended to. I want to batch these appends every couple of minutes and have them sent to a bunch of "slave" servers. These servers will, in turn, process that data and then discard it (think distributed warehousing). Each "slave" server needs a different subset of the data and there is a natural partitioning key I can use for that.
Basically, I need this to be eventually consistent : every batch of data to be eventually delivered to every "slave" server (reliability), even if the "slave" was down at the moment the batch is ready to be delivered (durability). I don't care about the order in which the batches are delivered.
Possible solutions I have considered,
MySQL replication does not fit my requirement because I would have to replicate the whole table on each server.
ETL & ESB products are too bloated for this, I am not doing any data processing.
Plain JMS, I could use but I'm looking for something even simpler
JGroups is interesting but members that are left the group will not get the messages once they rejoin.
Pushing files & ack files across servers : can do but I don't know of any framework so would need to write my own.
Note : This question is about how to move the data from the central server to the N others with reliability & durability; not how to create or ingest it.
(Edited on Aug 24 to add durability requirement)
You may use JGroups for same. Its a toolkit for reliable multicast communication
I ended up finding "Spring integration" which includes plugins to poll directories via SFTP for example.
http://www.springsource.org/spring-integration