Understanding of back end file seeding to provide fast client downloads

Understanding of back end file seeding to provide fast client downloads - java

The theme of my project is to implement a distributed server which provides several clients several files to download. The server is hosting several files and we want that the server should implement some best algorithms to quickly let the clients download data from it.
My idea of implementation of project:
Like the client generally downloads the file using some download managers, similarly there must exist some server side managers/codes/algorithms which upload/seed the file quickly to let client download the file. There must not be any action of client except the selection of the file to be downloaded!
How should I write the code for such a server on the back end, analogous to multi-threading based downloaded managers for clients on the front-end?
How should server seed/make avail the file to the client if the client only sends the path as a String to the server in Java for downloading?
Or, if I am missing something/my idea is totally wrong, please enlighten me with an alternative process/algorithm which I must implement on the server side. Please remember that the whole purpose of asking this question is the back end server seeding algorithm OR equivalent algorithms/methods.

I assume, this server of yours has a good internet connection with a broad upstream. If that is the case then the limiting factor when only few clients are downloading few files is the bandwith of these clients. So you will at most get as fast as the downstream bandwith of your clients. So simply taking an off-the-shelf HTTP server library to serve the downloads should be sufficient.
Where your backend implementation really matters and is able to improve download performance is then many users are connecting to your server and downloading many files. First off there are following points to consider:
TCP has a startup-time. When you first open an connection, the download rate slowly starts to increase until it hits the maximum. To minimize this time, when downloading multiple files the connection opened for one file download should be reused for the next file.
Downloading many files at once(on clientside) is not reasonable when bandwidth is the limiting factor, because the client has to start up many TCP connections and the data will be either fragmented, when written to Disk, or (when allocating beforehand) the disk will be pretty busy while jumping between sectors.
Your server should generally use a non-blocking IO library (eg. java.nio) and refrain from creating a thread per incomming connection since this leads to thrashing which again decreases your server's performance drastically.
If you have a really big amount of clients simultaneously downloading from your server, the limit you will probably hit will be either:
The upstream limit of your provider
The read speed of your Harddrive (SSD have ~ 500MB/s as far as I'm informed)
Your server can try to hold the most commonly requested files in his memory and serve the content from there (DDR3 RAM reaches speeds of 17GB/s). I doubt that you have only as few files on your server that you could cache them all in your server's RAM.
So the main engineering task lays in the clever selection of which content should be cached and which not. This could be done on a priority base by assigning higher priorities to certain files or by a metric which encodes the probability of a single file to be downloaded in the next few minutes. Or simply the files which are downloaded by the most clients at this point of time.
With such considerations you are able to push the limits of your download server until a certain point from which the only improvement can be achieved by distributing or replicating your files onto many servers.
If you are going into such a direction where serving millions of clients simultaneously must be possible, you should consider buying such a service from CDNs. They are specialized in fast delivery and have many upstream server in most ASes so that every client can download his files from the regional CDN server.
I know, I haven't given any algorithm or code examples, but I didn't intend to answer this question completely. I just wnated to give you some important guidelines and thoughts to that topic. I hope, you can at least use some of these thoughts for your project.

Related

Large File Transfer between Java Servers over HTTPS

I have a central server, to which many distributed servers need to transmit data in the form of somewhat large files, 500MB - 10GB+. The servers are not on the same physical network and can't be connected to one another via a VPN. While we're trying to get other ports opened, currently we can only talk over 443, HTTPS, which is great for our REST services but terrible for file transfer between servers.
I know this isn't as specific a question as one would like for Stackoverflow, but I would like to know: what methods might work better than the ones I've tried?
Server A -> generate file -> transfer over https -> DMZ -> proxy pass -> receive at Server B
Both servers use Java 1.8, Tomcat, and Spring 4.1.4.RELEASE. The DMZ is just Apache and pretty much out of our control.
Things I've tried...
Make RPC calls to a service using Spring's HttpInvokerProxyFactoryBean (this works fine for smaller sites, but the larger sites often drop connections while transferring data)
Multipart form post using Apache HttpPost (this also works, but we have to configure file limits in apache/tomcat, plus its connection is unreliable as well)
Using a library called RMIIO, which basically simulates RMI over HTTP if configured properly. This seemed promising as it requests a stream from the server and writes to the stream from the remote server. I haven't really gotten this to work over HTTPS yet, and the library was written in 2007 (with some updates up through July 2016), but it feels very dated, not highly maintained and I suspect there are better ways to do this sort of thing now-a-days (not that I can find them)
Looked at gRPC but realized it's just a binary protocol and I'd have to basically handle chunking the files if I wanted to get a streaming effect.
Read an article about Developing non-blocking REST services with Spring MVC, http://callistaenterprise.se/blogg/teknik/2014/04/22/c10k-developing-non-blocking-rest-services-with-spring-mvc/ again looked interesting if we were receiving a lot of files at the same time, but I don't see how it helps with a single file transfer.
I've looked at a lot of other things and tried a few more, but it all seems wrong. When I read about big data and Spark streams or any of the million streaming options that I see, I feel like there should be something similar for transferring a single file from one server to another without a broker in the middle. Maybe there are, just not over HTTPS.
It would be nice to know the progress of the transfer (on both ends) and be able to recover should there be connectivity issues or transfer errors.
But any direction or thoughts would be immensely helpful. Thanks for your time and input.

Best approach for large file transfer to multiple clients using java sockets

I have researched a lot but could not find any thing proper on the topic and hence asking question here.
I want to build an application like dropbox using java sockets, with a social media website where the files we upload on our shared folder gets automatically downloaded or synced onto all of our friends system which we added in the social media website.
Till now what I have thought is I will have a server running and every time a client connects(logs in) I will start 2 handlers, 1 will be for uploads and 1 will be for download. The DOWNLOAD handler will check for a new files from all my friends(meaning they added new files in their shared directory) every 5 minutes and will sync it. and UPLOAD handler will upload the files on the server sending it as a byte array when the handler receives it from the client. Client sends the data to the server using directory watcher to track changes in the directory.
Now the question is starting 2 threads per client, is it feasible? I think it will slow down the server badly as I will imagine to have like 100 clients let say and it means 200 threads. Can you guys just point me in the right direction on as to what approach I should take, I read about NIO and IO and got confused. Also is there any particular library which can be helpful? I looked at Netty, apache mina but don't understand how they can be helpful.
Thanks in advance :)

I recommend you to take a look to this article about NIO: https://today.java.net/pub/a/today/2007/02/13/architecture-of-highly-scalable-nio-server.html. Also try to think about scalability, if your server send files... what's the speed of your hard drives? It's more important, I think, than number of threads; but keep an eye over thread lock.
Why you want to implement something that web makes so well? If I were you I try to think about a secure proxy better than all that stuff about dealing with bytes. Even If you want to transfer files in multiple parts, you may use multipart zip file and download each part programatically, and then rebuild the file. With this approach you can reuse your infrastructure for web and client; also you can benefit of high IO throughtput of modern web servers.

When you think you will have a large number of clients, using the standard Socket and ServerSocket isn't going to work. These require 2 threads per client, as you already pointed out. Eventually, this will eat up all your server resources. What you need is the java.nio package. In there you will find the SocketChannel and the ServerSocketChannel. Through these you can set up non-blocking socket communication. This type of communication is event based. Meaning you can have multiple clients using the same 2 threads on the server for reading and writing.
If you want to know more, check out my Socket programming tutorial. The third part goes into setting up a non-blocking variant and should give you everything you need to get started. If you still have questions, let me know, and I'll see if I can help you out further.

Write data fast to a remote database

I have an app which will generate 5 - 10 new database records in one host each second.
The records don't need any checks. They just have to be recorded in a remote database.
I'm using Java for the client app.
The database is behind a server.
The sending data can't make the app wait. So probably sending each single record to the remote server, at least synchronously, it's not good.
Sending data must not fail. My app doesn't need an answer from the server, but it has to be 100% secure that it arrives at the server correctly (which should be guaranteed using for example http url connection (TCP) ...?).
I thought about few approaches for this:
Run the send data code in separate thread.
Store the data only in memory and send to database after certain count.
Store the data in a local database and send / pulled by the server by request.
All of this makes sense, but I'm a noob on this, and maybe there's some standard approach which I'm missing and makes things easier. Not sure about way to go.

Your requirements aren't very clear. My best answer is to go through your question, and try to point you in the right direction on a point-by-point basis.
"The records don't need any checks," and "My app doesn't need an answer, but it has to be 100% secure that it arrives at the server correctly."
How exactly are you planning on the client knowing that the data was received without sending a response? You should always plan to write exception handling into your app, and deal with a situation where the client's connection, or the data it sends, is dropped for some reason. These two statements you've made seem to be in conflict with one another; you don't need a response, but you need to know that the data arrives? Is your app going to use a crystal ball to devine confirmation of the data being received (if so, please send me such a crystal ball - I'd like to use it to short the stock market).
"Run the send data code in a separate thread," and "store the data in memory and send later," and "store the data locally and have it pulled by the server", and "sending data can't make my app wait".
Ok, so it sounds like you want non-blocking I/O. But the reality is, even with non-blocking I/O it still takes some amount of time to actually send the data. My question is, why are you asking for non-blocking and/or fast I/O? If data transfers were simply extremely fast, would it really matter if it wasn't also non-blocking? This is a design decision on your part, but it's not clear from your question why you need this, so I'm just throwing it out there.
As far as putting the data in memory and sending it later, that's not really non-blocking, or multi-tasking; that's just putting off the work until some future time. I consider that software procrastination. This method doesn't reduce the amount of time or work your app needs to do in order to process that data, it just puts it off to some future date. This doesn't gain you anything unless there's some benefit to "batching" data sending into large chunks.
The in-memory idea also sounds like a temporary buffer. Many of the I/O stream implementations are going to have a buffer built in, as well as the buffer on your network card, as well as the buffer on your router, etc., etc. Adding another buffer in your code doesn't seem to make any sense on the surface, unless you can justify why you think this will help. That is, what actual, experienced problem are you trying to solve by introducing a buffer? Also, depending on how you're sending this data (i.e. which network I/O classes you choose) you might get non-blocking I/O included as part of the class implementation.
Next, as for sending the data on a separate thread, that's fine if you need non-blocking I/O, but (1) you need to justify why that's a good idea in terms of the design of your software before you go down that route, because it adds complication to your app, so unless it solves a specific, real problem (i.e. you have a UI in your app that shouldn't get frozen/unresponsive due to pending I/O operations), then it's just added complication and you won't get any added performance out of it. (2) There's a common temptation to use threads to, again, basically procrastinate work. Putting the work off onto another thread doesn't reduce the total amount of work needing to be done, or the total amount of I/O your app will consume in order to accomplish its function - it just puts it off on another thread. There are times when this is highly beneficial, and maybe it's the right decision for your app, but from your description I see a lot of requested features, but not the justification (or explanation of the problem you're trying to solve) that backup these feature/design choices, which is what should ultimately drive the direction you choose to go.
Finally, as far as having the server "pull" it instead of it being pushed to the server, well, all you're doing here is flipping the roles, and making the server act as a client, and the client the server. Realize that "client" and "server" are relative terms, and the server is the thing that's providing the service. Simply flipping the roles around doesn't really change anything - it just flips the client/server roles from one part of the software to the other. The labels themselves are just that - labels - a convenient way to know which piece is providing the service, and which piece is consuming the service (the client).
"I have an app which will generate 5 - 10 new database records in one host each second."
This shouldn't be a problem. Any decent DB server will treat this sort of work as extremely low load. The bigger concern in terms of speed/responsiveness from the server will be things like network latency (assuming you're transferring this data over a network) and other factors regarding your I/O choices that will affect whether or not you can write 5-10 records per second - that is, your overall throughput.

The canonical, if unfortunately enterprisey, answer to this is to use a durable message queue. Your app would send messages to the queue, and a backend app would receiver and store them in a database. Once the queue has accepted a message, it guarantees that it will be made available to the receiver, even if the sender, receiver, or the queue broker itself crash.
On my machine, using HornetQ, it takes ~1 ms to construct and send a short text message to a durable queue. That's quick enough that you can do it as part of handling a web request without adding any noticeable additional delay. Any good message queue will support your 10 messages per second throughput. HornetQ has been benchmarked as handling 8.2 million messages per second.
I should add that message queues are not that hard to set up and use. I downloaded HornetQ, and had it up and running in a few minutes. The code needed to create a queue (using the native HornetQ API) and send and receive messages (using the JMS API) is less than a hundred lines.

If you queue the data and send it in a thread, it should be fine if your rate is 5-10 per second and there's only one client. If you have multiple clients, to the point where your database inserts begin to get slow, you could have a problem; given your requirement of "sending data must not fail." Which is a much more difficult requirement, especially in the face of machine or network failure.
Consider the following scenario. You have more clients than your database can handle efficiently, and one of your users is a fast typist. Inserts begin to back up in-memory in their app. They finish their work and shut it down before the last ones are actually uploaded to the database. Or, the machine crashes before the data is sent - or while its sending; or worse yet, the database crashes while its sending, and due to network issues the client can't really tell that its transaction has not completed.
The easy way avoid these problems (most of them anyway), is to make the user wait until the data is committed somewhere before allowing them to continue. If you can make the database inserts fast enough then you can stick with a simpler scheme. If not, then you have to be more creative.
For example, you could locally write the data to disk when the user hits submit, and then upload it from another thread. This scenario needs to be smart enough to mark something that is persisted as sent (deleting it would work); and have the ability to re-scan at startup and look for unsent work to send. It also needs the ability to keep trying in the case of network or centralized server failure.
There also needs to be a way for the server side to detect duplicates. Because the client machine could send the data and crash before it can mark it as sent; and then upon restart it would send it again. The same situation could occur if there is a bad network connection. The client could send it and never receive confirmation from the server; time out and then end up retrying it.

If you don't want the client app to block, then yes, you need to send the data from a different thread.
Once you've done that, then the only thing that matters is whether you're able to send records to the database at least as fast as you're generating them. I'd start off by getting it working sending them one-by-one, then if that isn't sufficient, put them into an in-memory queue and update in batches. It's hard to say more, since you don't give us any idea what is determining the rate at which records are generated.
You don't say how you're writing to the database... JDBC? ORM like Hibernate? But the principles are the same.

Mechanism to maintain service when internet connection is lost with application server

We currently have a centralised web app and database (running on glassfish and oracle) which is accessed from multiple stations distributed about the country.
At the stations there is data entered into and read from the system (through the browser).
When the (external) connection goes down between the station and the centralized web app we would like for the stations to continue to function - store and present data, then when the connection returns the data is pushed back into the central server maintaining database integrity.
Given that we would be willing to change our app server or database if it was worth it, how is this best handled, is there any out of the box solution for this?

Install the servers at the individual locations, replicate what you want to share across them "routinely", and leave all of the other centralized, but non-vital tasks (like, say, reporting) on the central system.
There is not "out of the box" solution. You system is centralized for whatever reason it's centralized. You're asking for it to be decentralized. By doing so you need to reconsider why it's centralized in the first place, and what dependencies there are because of that centralization (such as each site having instant access to data at all of the other sites).
Address those issues of what you can do without, for how long, and how to share it, and then you can set up autonomous sites. The magnitude and complexity of this process is dependent upon you application and the services it supplies to the remote users.

If you can tolerate losing the current sessions I would point you to look for a distributed database (replication). Oracle probably supports it. In each office you would have a glassfish server
But it is going to cost a lot:
Licences
Hardware (servers)
Properly securing the server
(Lots of) tuning/rewritting to avoid new bottlenecks
Maybe it would be easier / cheaper if you chose to just use redundant internet access for all of your offices.

If you are willing to go cutting edge, then look into HTML 5 with Local Storage. Note that the local storage specification in HTML 5 is still in transition. The second link I included has a good fallback option for when HTML 5 local storage is unavailable. With the fallback option of Store.js, you won't even need to require your clients to use a modern browser, though it definitely helps.
Another option, if you are open to moving in that direction, is to use Adobe Flex 3 for your UI, talking through LiveCycle to your application hosted on Glassfish. There will be more moving parts and a steeper learning curve though.

Availability Issue

Architecture:
A bunch of clients send out messages to a server which is behind a VIP. Obviously this server poses an availability risk.
The client monitors a resource and the server is responsible to take action based on the what status the majority of the clients report to it and hence the need for only 1 server/leader.
I am thinking of adding another server as a backup on the VIP, which gets turned on only when the first server fails. However when the backup comes up it would have no information to process and would lose time waiting for clients to report and waiting for the required thresholds etc.
Problem:
What is the best and easiest way to have two servers share client state information with only one receiving client traffic?
Solution1:
I thought of have having the server forward client state information to backup server and in the event of a failure when the backup server comes up, it can take it from there.
Is there any other way to do this? I thought of having a common/shared place to store state information where both servers can read client state information from. But this doesn't work well as the shared space is a single point of failure too.

One option is to use a write-ahead log. Essentially, any modification you make to your state gets sent over to the backup server, which replays the change on its own copy of the state. As long as it can keep up with the streaming log, the backup is always up-to-date.
This is the approach generally used by most databases; if you use one as your backend, you may be able to get support for this with little work.
Be careful to have a plan to recover from communication failure - either save the log to disk and resend the missing portion, or send a snapshot of the state, plus all log entries since the snapshot on reconnect.

There are various distributed caching products which do the kind of thing you're talking about here. Some are supplied with App Servers, such as WebSphere's dynacache and Object Grid. In fact ObjectGrid can be used in JSE, no need for an App Server.
Those distributed cache products use various push and pull models with pub-sub messaging to achieve consistency across the instances. Working for IBM I'm a fan of ObjectGrid, but more impartant, I'm fan of not reinventing wheels. My take is that this stuff can get quite complex and hence finding something off-the shelf might save a load of work - there are links to various Open Source solutions here.

The is very much dependent on how available your solution needs to be (how many 9's). There is a spectrum of solution.
A lightweight one could be crafted around Memcache: extremely fast distributed state facility. As example, it is used extensively on Google AppEngine.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.