Need help to separate out file processing server

Need help to separate out file processing server - java

I have developed Document Management System (DMS) having OCR feature. However, it takes too much time to process, as well as high CPU usage.
My current process is synchronous, as below :
User upload his file
OCR process
Store document information in DB
Considering the real-time production load, I want to make above second step asynchronous, on a dedicated file processing separate server.
My questions are,
Is it the right way to do it?
How to send/retrieve that file to another server to process? I also found out to use message queue, but I can not add whole file in it.
Is there anyway, we can acknowledge process completion?

Just to close this question, I have separated OCR process successfully on separate file processing server, which really helps me to resolve high CPU usage, using FIFO method.
Followed below steps :
User uploads file
OCR status pending
Separate server process file, which is pending as per FIFO method once at a time.
Update OCR process status in the database.
Processing server can be increased later, as per need and load of the server.

Related

How to download files over 6MB

I have the most basic problem ever. The user wants to export some data which is around 20-70k records and can take from 20-40 seconds to execute and the file can be around 5-15MB.
Currently my code is as such:
User clicks a button which makes an API call to a Java Lambda
AWS Lambda Handler calls a method to get the data from DB and generate excel file using Apache POI
Set Response Headers and send the file as XLSX in the response body
I am now faced with two bottlenecks:
API Gateway times out after 29 seconds; if file takes longer to
generate it will not work and user get 504 in the browser
Response from lambda can only be 6MB, if file is bigger the user will
get 413/502 in the browser
What should be my approach to just download A GENERATED RUNTIME file (not pre-built in s3) using AWS?

If you want to keep it simple (no additional queues or async processing) this is what I'd recommend to overcome the two limitations you describe:
Use the new AWS Lambda Endpoints. Since that option doesn't use the AWS API Gateway, you shouldn't be restricted to the 29-sec timeout (not 100% sure about this).
Write the file to S3, then get a temporary presigned URL to the file and return a redirect (HTTP 302) to the client. This way you won't be restricted to the 6MB response size.

Here are the possible options for you.
Use Javascript skills to rescue. Accept the request from browser/client and immediately respond from server that your file preparation is in progress. Meanwhile continue preparing the file in the background (sperate job). Using java script, keep polling the status of file using separate request. Once the file is ready return it back.
Smarter front-end clients use web-sockets to solve such problems.
In case DB query is the culprit, cache the data on server side, if possible, for you.

When your script takes more than 30s to run on your server then you implement queues, you can get help from this tutorial on how to implement queues using SQS or any other service.
https://mikecroft.io/2018/04/09/use-aws-lambda-to-send-to-sqs.html
Once you implement queues your timeout issue will be solved because now you are fetching your big data records in the background thread on your server.
Once the excel file is ready in the background then you have to save it in your s3 bucket or hard disk on your server and create a downloadable link for your user.
Once the download link is created you will send that to your user via email. In this case, you should have your user email.
So the summary is Apply queue -> send a mail with the downloadable file.

Instead of some sophisticated solution (though that would be interesting).
Inventory. You will split the Excel in portions of say 10 k rows. Calculate the number of docs.
For every Excel generation called you have a reduced work load.
Whether e-mail, page with links, using a queue you decide.
The advantage is staying below e-mail limits, response time-outs, denial of service.
(In Excel one could also create a master document, but I have no experience.)

Reading huge file and writing in RDBMS

I have a huge text file which is continuously getting appended from a common place, which I need to read line by line from my java application and update in a SQL RDBMS such that if java application crashes, it should start from where it left and not from the beginning.
its a plain text file. Each row will contains:
<Datatimestamp> <service name> <paymentType> <success/failure> <session ID>
Also the data which is retrieved from database should also be real time without any performance, availability or availability issues in web application
Here is my approach:
Deploy application in two systems boxes with each contains heartbeat which pings the other system for service availability.
When you get a success response to heart beat,you also get the time stamp which is last successfully read.
When the next heartbeat response fails, application in another system can take over, based on:
1. failed response
2. Last successful time stamp.
Also, since the need for data retrieval is very real time and data is huge, can I crawl the database put that into Solr or Elastic search for faster retrieval, instead of making the database calls ?
There are various ways to do it, what is the best way.

I would put a messaging system in between the text file and the DB writing applications. (for example RabbitMQ) in this case, the messaging system functions as a queue. one application constantly reads the file and inserts the rows as messages to the broker. on the other side, multiple "DB writing applications" can read from the queue and write to DB.
the advantage of the messaging system is its support for multiple clients reading from the queue. the messaging system takes care of synchronizing between the clients, dealing with errors, dead letters, etc. the clients don't care about what payload was processed by other instances.
regarding maintaining multiple instances of "DB writing applications": I would go for ready made cluster solutions. perhaps docker cluster managed by kubernates?
another viable alternative is a streaming platform, like Apache Kafka.

You can use a software like FileBeat to read the file and direct the filebeat output to RabbitMQ or Kafka. From there a Java program can subscribe / consume the data and put it into a RDBMS system.

Throttle speed at which a servlet accepts an HTTP Post Body under Tomcat

I have a servlet that accepts large (up to 4GB) binary file uploads. The submitted file is transmitted as the body of an HTTP POST.
The servlet has to perform some time-consuming processing as it receives the file, and it has to finish doing that before sending the response. As a result, it can appear to a fast client that the server has hung because the client can be waiting for a minute or two after sending the the last few bytes before getting the response.
Is there a way either within Tomcat or within the servlet API to throttle back the speed at which the server accepts the file? I would like it to appear to the client that the server is accepting the file at (for example) 10MB/second rather than it accepting the file at 50MB/second and then taking a few minutes after receiving the body to return a response.
Thanks.

I'm extending on the comment of Mark Thomas here because I feel that this is worth being an answer (or the answer), rather than a comment. Mark, let me know if you want to convert the comment yourself and I'll happily delete mine.
John, you're trying to solve your problem in a way that imposes severe limitations: What's the bandwidth that you want to throttle to? What happens when the server is upgraded to a beefier CPU and can process more quickly? What if multiple uploads happen at the same time?
You probably want to have an upload of 4G in as quick a time as possible - imagine the connection going down in the middle - in a web application this typically means you'll have to restart the upload from the beginning. Thus you should decouple your processing from the upload procedure as much as possible.
You also don't mention the file format that gets uploaded: If it happens to be a zip file, note that the server can't do anything with the file until it's fully transmitted, as zip files have the directory of contents at their end. (this might be old knowledge, but at least the old spec had it this way. Someone correct me if this changed)
So: The proper way: Accept the file for processing, signal that you received it and are processing. If you like: Implement Ajax updates once you're done. In the simplest case: "click here to see if processing finished" or frequently reload the page. Anything works and everything is better than throttling throughput on this layer.

Understanding of back end file seeding to provide fast client downloads

The theme of my project is to implement a distributed server which provides several clients several files to download. The server is hosting several files and we want that the server should implement some best algorithms to quickly let the clients download data from it.
My idea of implementation of project:
Like the client generally downloads the file using some download managers, similarly there must exist some server side managers/codes/algorithms which upload/seed the file quickly to let client download the file. There must not be any action of client except the selection of the file to be downloaded!
How should I write the code for such a server on the back end, analogous to multi-threading based downloaded managers for clients on the front-end?
How should server seed/make avail the file to the client if the client only sends the path as a String to the server in Java for downloading?
Or, if I am missing something/my idea is totally wrong, please enlighten me with an alternative process/algorithm which I must implement on the server side. Please remember that the whole purpose of asking this question is the back end server seeding algorithm OR equivalent algorithms/methods.

I assume, this server of yours has a good internet connection with a broad upstream. If that is the case then the limiting factor when only few clients are downloading few files is the bandwith of these clients. So you will at most get as fast as the downstream bandwith of your clients. So simply taking an off-the-shelf HTTP server library to serve the downloads should be sufficient.
Where your backend implementation really matters and is able to improve download performance is then many users are connecting to your server and downloading many files. First off there are following points to consider:
TCP has a startup-time. When you first open an connection, the download rate slowly starts to increase until it hits the maximum. To minimize this time, when downloading multiple files the connection opened for one file download should be reused for the next file.
Downloading many files at once(on clientside) is not reasonable when bandwidth is the limiting factor, because the client has to start up many TCP connections and the data will be either fragmented, when written to Disk, or (when allocating beforehand) the disk will be pretty busy while jumping between sectors.
Your server should generally use a non-blocking IO library (eg. java.nio) and refrain from creating a thread per incomming connection since this leads to thrashing which again decreases your server's performance drastically.
If you have a really big amount of clients simultaneously downloading from your server, the limit you will probably hit will be either:
The upstream limit of your provider
The read speed of your Harddrive (SSD have ~ 500MB/s as far as I'm informed)
Your server can try to hold the most commonly requested files in his memory and serve the content from there (DDR3 RAM reaches speeds of 17GB/s). I doubt that you have only as few files on your server that you could cache them all in your server's RAM.
So the main engineering task lays in the clever selection of which content should be cached and which not. This could be done on a priority base by assigning higher priorities to certain files or by a metric which encodes the probability of a single file to be downloaded in the next few minutes. Or simply the files which are downloaded by the most clients at this point of time.
With such considerations you are able to push the limits of your download server until a certain point from which the only improvement can be achieved by distributing or replicating your files onto many servers.
If you are going into such a direction where serving millions of clients simultaneously must be possible, you should consider buying such a service from CDNs. They are specialized in fast delivery and have many upstream server in most ASes so that every client can download his files from the regional CDN server.
I know, I haven't given any algorithm or code examples, but I didn't intend to answer this question completely. I just wnated to give you some important guidelines and thoughts to that topic. I hope, you can at least use some of these thoughts for your project.

Restart webapplication from the point where it was stopped due to server crash

We have implemented one web-application as a scheduler which sends email campaigns for the configured mailing lists. It processes contacts one by one. how can I recover the crashing point to restart my campaign process from where it was stopped.
Ex: I have configured 100 emailIds to the mailing list.
after processing 50 emailIds, the server shuts down or crash occurred.
when I restart the server, again it starts from 1st emailId instead of 51st emailId.
We have tried some solutions based on our application logic but that created performance issues. Is there any common solution that can be handled at the server level?
Can u please suggest some solution?

For example you can save the number to a file and update it after each processed mail and read it on start-up. But what you really need to ask yourself is why the server is crashing..

Try to save the email IDs in the text file and read one by one. Within my knowledge it's best approach. Otherwise, store it in to the XML file and read that. If you are using XML parser it doesn't have the heavy weight and your server might not be hanged.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.