What are the recommended steps for creating scalable web/enterprise applications in Java and .NET? I'm more interested in what it takes to go from a low volume app to a high volume one (assuming no major faults in the original architecture). For example, what are the options in each platform for clustering, load-balancing, session management, caching, and so on.
Unfortunately there are a lot of aspects of your question that are going to be context dependant. Scalability can mean different things depending on your requirements and the application. The questions that need answering as you come up with your architecture (regardless of the technology you are using) include:
For example do you want to support users working with very large amounts of data or do you simply need to support a large number users each working with relative modest amounts of data?
Does your application do more reading than writing, since reads are cheap and writes are expensive a read heavy application can be simpler to scale than a write heavy one.
Does the data in your application have be always consistent or will eventual consistency suffice? Think posting a message to a social networking site versus withdrawing money from a bank account).
How available does you application need to be? Strict high availability requirements will probably require you to have smooth fail-over to other servers when one server crashes.
There are many other questions like this that need to be answered about your specific application before a proper discussion on your application architecture can take place.
However so I don't just leave you questions and not a single answer here is what I think is the single most important thing to take into account when designing a web application for scalability: Reduce (down to zero if possible) the amount of shared session state in your application (global application counters, cached primary key blocks and the like). Replicating shared state across a cluster has a very high cost, when you are trying to scale up
Perhaps there is so many options, I am not sure anyone to considered all options in full. Usually, the choice of language comes first and then you decide how to achieve the results you want.
My guess is you have good options in just about any major general purpose language.
This would have been an issue 10 years ago, but not today IMHO.
A good comparison of available open source technologies in Java is http://java-sources.org/ As you drill down into each category, you can see this is fairly extensive and it does even consider commercial products or C#
For any scalable system, you need to consider hardware solutions as well, such as network components.
My suggestion for turning a low volume app into a high volume one, is start with something fairly high volume in the first place and expect to re-engineer it as you go.
What sort of volumes are you talking about? Everyone has different expectations of high and low volume. e.g. When google have lower than average volumes they switch of whole data centres to save power. ;)
Just came across this explanation of how MySpace scaled up, really interesting and exactly the kind of information I'm looking for. However, there is no discussion about the technologies used to create the partitioning or caching and so on...
Short answer - it depends. What (if any) business goals are involved? What is the project budget and timeframe? Which industry (is it regulated)?
Related
hello i am going to develop an EHR (Electronic Health Record System) i am new to this field and want to discuss and get the suggestion about what tool technology i have to use for this purpose:
here is my research about EHR and available tools for it
1) i am going to discuss about java as an EHR is a web based system so J2EE will be the solution for java at core level and as i am going to globalize my system so i need some standard protocol for it the most useful and appreciated one is HL7 CDA 2.0 one thing about java i like is it provides JAVA CAPS with full implementation of HL7 protocol. it make my work bit easy the second thing about java is it is very efficient for DATA CENTERED application as mine one is but the problem is with the scalability of system that is much expensive and time taking. and java is bit slower on client side that can effect downtime that should me very less approaching to zero for my system. and at last i need a attractive user interface. and the most wanted thing is privacy and security.
2) the other option is PHP for doing all as above described it is less expensive and less time taking for scalability,may contribute to achieve a good interface and a faster client side but question mark on data centric environment and security.
3) the last one is the MS's ASP.NET no doubt about security and privacy but very much expensive to develop and maintain and no platform independence and what about speed that is response and down times?
i have discussed the possibilities upto my best knowledge hope u people advise me which one will be the best to attain privacy, security, speed and scalability on best cost.
thanx in advance.
As an EHR implementor since 1983, I'd suggest you look for a language that provides a user interface that allows multiple inputs including keyboard, mouse, touch, voice and stylus, and potentially runs on multiple devices including phones and PCs. The server side if written correctly should not offer scaleability issues.
As for HL7, you're going to use that server side anyway so I don't think it's relevant to how you write your clients.
I'll start by answering your actual question. Any of those 3 languages could in principle be used for a large scale data-intensive application such as what you are considering. You may find it marginally more expensive in PHP, but it will make little difference at the scale you are describing.
If I were developing a EHR system I would start by looking at the legal aspects of it, then build the actual requirements, which I imagine would be a massive undertaking. Finally, I would just build it using whatever technology gave me access to a wide pool of skilled talent. The language will only really effect what talent you can hire in on short notice.
I hope I'm not being presumptious here, but it would appear, just from your question, that you have little experience with programming, the design of large systems, healthcare, the applicable legal framework or running an ISV. Do you have a compelling reason for entering this market?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
It seems the hype about cloud computing cannot be avoided, but the actual transition to that new platform is subject to many discussions...
From a Theoretical viewpoint, the following can be said:
Cloud:
architectural change (you might not install anything you want)
learning curve (because of the above)
no failover (since failure is taken care of)
granular cost (pay per Ghz or Gbyte)
instantaneous scalability (not so instantaneous, but at least transparent?) ? lower latency
Managed:
failover (depends on provider)
manual scalability (requires maintenance)
static cost (you pay the package , whether you use it fully or not)
lower cost (for entry- packages only)
data ownership ( you do )
liberty ( you do ) ? lower latency ( depends on provider)
Assuming the above is correct or not; Nevertheless, a logical position is "it depends.." .. on the application itself.
Now comes the hidden question: how would you profile your j2ee app to determine if it is a candidate to cloud or not; knowing that it is
a quite big app in number of services/ functions (i.e.; servlets)
relies on a complex database (ie. num. tables)
doesn't need much media resources, mostly text based
"Now comes the hidden question: how would you profile your j2ee app to determine if it is a candidate to cloud or not; knowing that it is"
As an aside, make that the Explicit question. Make it the TITLE of this question. Put it in the beginning of the question. If possible, delete all of your assumptions, and focus on the question.
Here's what we do.
Call some vendors for your "cloud" or "managed service" arrangement. Not too many. One or two of each.
Ask them what they support. More importantly, what they don't support.
Then, given a short list of features that aren't supported, look at your code for those features. If they don't support things you need, you have some architecture work to do. Or cross them off your preferred vendor list.
For good vendors, write a pilot contract that gives you free (or cheap) access for a few months to install and test. If it doesn't work, you haven't paid much.
"But why go through the expense of trying to host it when it may not work?"
What expense? You can spend months "studying" your code. Or you can try to host it. Usually, the "try to host it" will turn up an answer within a few days. It's less effort to just do it.
What sort of Cloud Service are you talking about? IaaS, PaaS, DaaS ?
architectural change (you might not install anything you want)
Depends: moving from a "managed server" to a Platform (e.g. GAE) might be.
learning curve (because of the above)
Amazon EC2 might not be a big learning curve if you are used to running your own server
no failover (since failure is taken care of)
Depends: EC2 -> you have to roll your own
instantaneous scalability (not so instantaneous, but at least transparent?) ? lower latency
Depends: EC2 -> you have to plan for this / use an adjunct service
There are numerous cloud providers and as far as I've seen there are two main types of them:
Ones that support a cloud computing platform (e.g. Amazon E2C, MS Azure)
Virtual instance providers providing you ability to create numerous running instances (e.g. RightScale)
As far as the platforms, be aware that relational database support is still quite poor and the learning curve is quite long.
As for virtual instance providers the learning curve is really little (you just have to fire up your instances), but instances need some way of synchronizing... for a complex application this might not work.
As for your original question: I don't think there's any standard way you could profile wether an application should / could be moved to the cloud. You probably need to familiarize yourself with the options, narrow down to a few providers and see if the benefits you would get to them would be of any significant win over managed hosting (which you're probably currently doing).
In some ways comparing google app engine (gae) and amazon ec2 is like comparing apples and oranges.
With ec2 you get an operating system, with or without installed server software (tomcat, database, etc.; your choice, depending on which ami you choose). With ec2 you need a (or be a) system administrator to keep things running smoothly. Load balancing on ec2 is something you'll have to figure out and implement; I've never done that part. The big advantage with ec2 is that you can spin up and down new instances programatically, and compared to a regular web server provider, only pay for when your instance is up and running. You use this "auto spin up/down" to implement your load balancing and failover. But you have to do the implementation (again, I have no experience with this part).
With google app engine (gae), all of the system administration is taken care of for you. It also automatically scales out as needed, both on the app side and the database side. You also only pay for what you use; an idle app that gets no hits incurs no costs. The downsides to gae are that you're restricted in the languages you can use; python and java (or things that run on the jvm, like jruby). An even bigger downside is that the database is not sql (they don't call it a database; they call it the datastore) and will likely require reworking your ddl if you have an existing database; it's a definite cost in programmer time to understand how it works and how to use it effectively.
So if you're starting from scratch (or willing to rewrite) and you have the resources and time to learn its ways, gae may be the way to go. If you have a sysadmin or sysadmin skills and have the time or know how to set up load balancing and failover then ec2 could be the way to go. If your app is going to be idle a lot then ec2 is expensive; last time I checked it was something like $70 a month to leave a small instance up.
I think the points you have to focus are:
granular cost (pay per Ghz or Gbyte)
instantaneous scalability (not so instantaneous, but at least transparent?) ? lower latency
Changing your application to run on a cloud will take a good amount of time, but it will not really matter if the cloud don't lower your costs and/or you don't really need instantaneous/fast scalability (the classic example are eCommerce app)
After considering these 2 points. The one IMO you should think about is relies on a complex database (ie. num. tables), since depending on its "conplexity", changing to a cloud environment can be really troublesome.
What's a good method for assigning work to a set of remote machines? Consider an example where the task is very CPU and RAM intensive, but doesn't actually process a large dataset. The language of choice would be Java. I was thinking Hadoop would be a good option, but the dataset passed between remote machines is fairly small, and Hadoop seems to focus mainly on the distribution of data rather than distribution of work.
What are some good technologies that can help?
EDIT: I'm mainly interested in load balancing. There will be a series of jobs with a small (< 3MB) dataset, but significant processing and memory needs.
MPI would probably be a good choice, there's even a JAVA implementation.
MPI may be part of your answer, but looking at the question, I'm not sure if it addresses the portion of the problem you care about.
MPI provides a communication layer between processing components. It is low level requiring you to do a fair amount of work, but from what I saw in an introduction presentation, it also comes with some common matrix data manipulation functions.
In your question, you seem to be more interested in the load balancing/job processing aspects of the problem. If that really is your focus, maybe a small program hosted in a Servlet or an RMI server might be sufficient. Let each program go to the server for their next unit of work and then submit the results back (you might even be able to use a database/file share, but pay attention to locking issues). In other words, a pull mechanism versus a push mechanism.
This approach is fairly simple to implement and gives you the advantage of scaling up by just running more distributed clients. Load balancing isn't too important if you intend to allow your process to take full control of the machine. You can experiment with running multiple clients on a machine that has multiple cores to see if you can improve overall through-put for the node. A multi-threaded client would be more efficient, but can increase complexity depending on the structure of the code you are using to solve the problem.
I am often asked to perform sizing and capacity planning for our clients. When our clients buy our products (basically J2EE web applications), they often ask what hardware they will need to run those products. Our recommendations often result in high-cost hardware acquisitions.
So far, the best heuristics I developed is to compare the utilization projections (number of registered and concurrent users that application should attend) with the data gathered at our existing installations. Something like: If installation A attends 100 concurrent users with X hardware, then installation B will need 2*X hardware to attend 200 concurrent users.
This approach, however, has a number of problems. Clients often use different hardware and software platforms. The set of products they buy from us is generally never the same and generally parts of application are built on order for specific client. Put into account that software versions are changing etc. and there are so many parameters that can make the task of sizing very difficult.
I studied some books on the subject and some propose using complex mathematical models. The number of parameters these approaches require as input (for example detailed classification of application features) makes me think these are hardly useful. Hardware is generally ordered before even the basic requirements are defined not to mention that these will vary throughout application development and lifecycle.
So, how do you go about sizing and capacity planning? Any tips and how-tos appreciated.
There is no easy or mathematical formula to predict scale in the description you gave, if you (or your company) are serious about this then the best way is to build a performance & scalability test environment where you can easily setup & tear down various client configurations and send load at them to see how they will do. Because you are building custom components, one poorly written component or missing index can mess everything up, so having an environment like this is where you can iron these things out before giving to the client. Once you have this type of environment, you can add memory & cpu to the app servers & databases to see how your application scales.
I would suggest a VM environment where they can easily add cpu & memory based on needs of the application, coupled with some realistic external load/scale testing using a service like watchmouse or browsermob.
If hardware must be ordered before the basic reqs are defined, well, about the best you can do is to ballpark the capacity by looking at your installed base for a similar set of projects (as you are doing now). Keep track of your existing customers experience in scaling and capacity needs as they grow their installations, and if you have a large enough base, you can probably do rough curve fitting by grouping similar projects with similar hardware and looking at capacity needs. Watch how existing customers capacity needs change during growth as well for additional data points.
Ideally, the initial HW/SW buy is for a pilot installation, and you benchmark the pilot setup once it is up and meeting spec. Use those results to project capacity needs for the move from pilot to production. Of course, this requires time in the pilot-to-production schedule to benchmark the app then order and take delivery of the equipment. But it will give a more accurate capacity estimate than doing it all upfront.
If the app scales horizontally in a gracefull way, a rough initial estimate is ok as a starting point. Adding or removing additional boxes as required should be easy once the app runs in production.
I have an application that's a mix of Java and C++ on Solaris. The Java aspects of the code run the web UI and establish state on the devices that we're talking to, and the C++ code does the real-time crunching of data coming back from the devices. Shared memory is used to pass device state and context information from the Java code through to the C++ code. The Java code uses a PostgreSQL database to persist its state.
We're running into some pretty severe performance bottlenecks, and right now the only way we can scale is to increase memory and CPU counts. We're stuck on the one physical box due to the shared memory design.
The really big hit here is being taken by the C++ code. The web interface is fairly lightly used to configure the devices; where we're really struggling is to handle the data volumes that the devices deliver once configured.
Every piece of data we get back from the device has an identifier in it which points back to the device context, and we need to look that up. Right now there's a series of shared memory objects that are maintained by the Java/UI code and referred to by the C++ code, and that's the bottleneck. Because of that architecture we cannot move the C++ data handling off to another machine. We need to be able to scale out so that various subsets of devices can be handled by different machines, but then we lose the ability to do that context lookup, and that's the problem I'm trying to resolve: how to offload the real-time data processing to other boxes while still being able to refer to the device context.
I should note we have no control over the protocol used by the devices themselves, and there is no possible chance that situation will change.
We know we need to move away from this to be able to scale out by adding more machines to the cluster, and I'm in the early stages of working out exactly how we'll do this.
Right now I'm looking at Terracotta as a way of scaling out the Java code, but I haven't got as far as working out how to scale out the C++ to match.
As well as scaling for performance we need to consider high availability as well. The application needs to be available pretty much the whole time -- not absolutely 100%, which isn't cost effective, but we need to do a reasonable job of surviving a machine outage.
If you had to undertake the task I've been given, what would you do?
EDIT: Based on the data provided by #john channing, i'm looking at both GigaSpaces and Gemstone. Oracle Coherence and IBM ObjectGrid appear to be java-only.
The first thing I would do is construct a model of the system to map the data flow and try to understand precisely where the bottleneck lies. If you can model your system as a pipeline, then you should be able to use the theory of constraints (most of the literature is about optimising business processes but it applies equally to software) to continuously improve performance and eliminate the bottleneck.
Next I would collect some hard empirical data that accurately characterises the performance of your system. It is something of a cliché that you cannot manage what you cannot measure, but I have seen many people attempt to optimise a software system based on hunches and fail miserably.
Then I would use the Pareto Principle (80/20 rule) to choose the small number of things that will produce the biggest gains and focus only on those.
To scale a Java application horizontally, I have used Oracle Coherence extensively. Although some dismiss it as a very expensive distributed hashtable, the functionality is much richer than that and you can, for example, directly access data in the cache from C++ code .
Other alternatives for horizontally scaling your Java code would be Giga Spaces, IBM Object Grid or Gemstone Gemfire.
If your C++ code is stateless and is used purely for number crunching, you could look at distributing the process using ICE Grid which has bindings for all of the languages you are using.
You need to scale sideways and out. Maybe something like a message queue could be the backend between the frontend and the crunching.
Andrew, (in addition to modeling as a pipeline etc), measuring things is important. Have you ran a profiler over the code and got metrics of where most of the time is spent?
For the database code, how often does it change ? Are you looking at caching at the moment ? I assume you have looked at indexes etc over the data to speed up the Db ?
What levels of traffic do you have on the front end ? Are you caching web pages ? (It isn't too hard to say use a JMS type api to communicate between components. You can then put Web Page component on one machine (or more), and then put the integration code (c++) on another, and for many JMS products there are usually native C++ api's ie. ActiveMQ comes to mind), but it really helps to know how much of the time is in Web (JSP ?) , C++, Database ops.
Is the database storing business data, or is it being also used to pass data between Java and C++ ? You say you are using shared mem not JNI ? What level of multi-threading currently exists in the APP? Would you describe the code as being synchronous in nature or async?
Is there a physical relationship between the Solaris code and the devices that must be maintained (ie. do all the devices register with the c++ code, or can that be specified). ie. if you were to put a web load balancer on the frontend, and just put 2 machines up today is the relationhip of which devices are managed by a box initialized up front or in advance?
What are the HA requirements ? ie. just state info ? Can the HA be done just in the web tier by clustering Session data ?
Is the DB running on another machine ?
How big is the DB ? Have you optimized your queries ie. tried using explicit inner/outer joins sometimes helps versus nested sub queries (sometmes). (again look at the sql stats).