We have a web application that receives incoming data via RESTful web services running on Jersey/Tomcat/Apache/PostgreSQL. Separately from this web-service application, we have a number of repeating and scheduled tasks that need to be carried out. For example, purging different types of data at different intervals, pulling data from external systems on varying schedules, and generating reports on specified days and times.
So, after reading up on Quartz Scheduler, I see that it seems like a great fit.
My question is: should I design my Quartz-based scheduling application to run in Tomcat (via QuartzInitializerListener), or build it into a standalone application to run as a linux daemon (e.g., via Apache Commons Daemon or the Tanuk Java Service Wrapper).
On the one hand, it strikes me as counterintuitive to use Tomcat to host an application that is not geared towards receiving http calls. On the other hand, I haven't used Apache Commons Daemon or the Java Service Wrapper before, so maybe running inside Tomcat is the path of least resistance.
Are there any significant benefits or dangers with either approach that I should be aware of? Our core modules already take care of data access, logging, etc., so I don't see that those services are much of a factor either way.
Our scheduling will be data driven, so our Quartz-based scheduler will read the relevant data from PostgreSQL. However, if we run the scheduling application within Tomcat, is it possible/reasonable to send messages to our application via http calls to Tomcat? Finally, fwiw, since our jobs will be driven by our existing application data, I don't see any need for the Quartz JDBCJobStore.
To run a Java standalone application as linux daemon, simply end the java-command with an & -sign so that it runs in the background and put it in an Upstart-script for example.
As for the design: in this case I would go for whatever is easier to maintain. And it looks like running an app in Tomcat is already familiar. One benefit that comes to mind is that configuration files (for the database for example) can be shared/re-used so that only one set of configuration files needs to be maintained.
However, if you think the scheduled tasks can have a significant impact on resource usage, then you might want to run the tasks on a separate (virtual) machine. Since the timing of the tasks is data driven, it is hard to predict the exact load. E.g. it could happen that all the different tasks are executed at the same time (worst case/highest load scenario). Also consider the complexity of the software for the scheduled tasks and the related risk of nasty bugs: if you think there is a low chance of nasty bugs, then running the tasks in Tomcat next to the web-service is a good option, if not, run the tasks as a separate application. Lastly, consider the infrastructure in general: production line systems (providing (a continuous flow of) data processing critical to business) should be separate from non-production line systems. E.g. if the reports are created an hour later than usual and the business is largely unaffected, then this is non-production line. But if the web-service goes down and business is (immediatly) affected, then this is production line. Purging data and pulling updates is a bit gray: depends on what happens if these tasks are not performed, or later.
Related
I developed jobs in Spring Batch to replace a data loading process that was previously done using bash scripts.
My company's scheduler of choice is control-m. The old bash scripts were triggered from control-m on file arrival using a file watcher.
For reasons beyond my control, we still need to use control-m. Using Spring Boot or any other framework tied to a webserver is not a possibility.
The safe approach seems to be to package the spring batch application as a jar and trigger from control-m the job using "java -jar", but this doesn't seem the right way considering we have 20+ jobs.
I was wondering if it's possible to trigger the app once (like a deamon) and communicate with it using JMS or any other approach. In this way we wouldn't need to spawn multiple jvms (considering jobs might run simultaneously).
I'm open to different solutions. Feel free to teach me the best way to solve this use case.
The safe approach seems to be to package the spring batch application as a jar and trigger from control-m the job using "java -jar", but this doesn't seem the right way considering we have 20+ jobs.
IMO, running jobs on demand is the way to go, because it is the most efficient way to use resources. Having a JVM running 24/7 and make it run a batch job once a while is a waste of resource as the JVM will be idle between run schedules. If your concern is the packaging side of things, you can package all jobs in a single jar and use the spring boot property spring.batch.job.names at launch time to specify which job to run.
I was wondering if it's possible to trigger the app once (like a deamon) and communicate with it using JMS or any other approach. In this way we wouldn't need to spawn multiple jvms (considering jobs might run simultaneously).
I would recommend to expose a REST endpoint in your JVM and implement a controller that launches batch jobs on demand. You can find an example in the Running Jobs from within a Web Container section. In this case, the job name and its parameters could passed in as request parameters.
Another way is to use a combination between Spring Batch and Spring Integration to launch jobs using JMS requests (JobLaunchRequest). This approach is explained in details with code examples in the Launching Batch Jobs through Messages.
In addition to the helpful answer from Mahmoud, Control-M isn't great with daemons. Sure, you can launch a daemon but anything running for a substantial length of time (i.e. into several weeks and beyond) is prone to error and you can often end up with daemons that are running that Control-M is no longer "aware" of (e.g. if the system has issues that cause the Control-M Agent to assume the daemon job has failed and then launches another one).
When I had no other methods available I used to add daemons as batch jobs in Control-M but there was an overhead in additional jobs that checked for multiple daemons, did the stop/starts and various housekeeping tasks. Best avoided if possible.
The problem is that you need to launch separate JVM to create separate session with different number of RAM per job.
How to submit few Spark applications simultaneously without manually spawning separate JVMs?
My app is run on single server, within single JVM. That appears a problem with Spark session per JVM paradigm. Spark paradigm says:
1 JVM => 1 app => 1 session => 1 context => 1 RAM/executors/cores config
I'd like to have different configurations per Spark application without launching extra JVMs manually. Configurations:
spark.executor.cores
spark.executor.memory
spark.dynamicAllocation.maxExecutors
spark.default.parallelism
Usecase
You have started long running job, say 4-5 hours to complete. The job is run within a session with configs spark.executor.memory=28GB, spark.executor.cores=2. Now you want to launch 5-10 seconds job on user demand, without waiting 4-5 hours. This tinny job need 1GB of RAM. What would you do? Submit tinny job from behalf of long-running-job-session? Than it will claim 28GB ((
What I've found
Spark allow you to configure number of CPU and executors only on the session level. Spark scheduling pool allow you to slide and dice only number of cores, not a RAM or executors, right?
Spark Job Server. But they does't support Spark newer than 2.0, not an option for me. But they actually solve the problem for versions older than 2.0. In Spark JobServer features they said Separate JVM per SparkContext for isolation (EXPERIMENTAL), which means spawn new JVM per context
Mesos fine-grained mode is deprecated
This hack, but it's too risky to use it in production.
Hidden Apache Spark REST API for job submission, read this and this. There is definitely way to specify executor memory and cores there, but still what is the behavior on submitting two jobs with different configs? As I understand this is Java REST client for it.
Livy. Not familiar with it, but looks they have Java API only for batch submission, which is not an option for me.
With a use case, this is much clearer now. There are two possible solutions:
If you require shared data between those jobs, use the FAIR-scheduler and a (REST-)frontend (as does SparkJobServer, Livy, etc.). You don't need to use SparkJobServer either, it should be relatively easy to code, if you have a fixed scope. I've seen projects go in that direction. All you need is an event loop and a way to translate your incoming queries into Spark queries. In a way, I would expect there to be demand for a library to cover this use case, since it's pretty much always the first thing you have to build, when you work on a Spark-based application/framework.
In this case, you can size your executors according to your hardware, Spark will manage scheduling of your jobs. With Yarn's dynamic resource allocation, Yarn will also free resources (kill executors), should your framework/app be idle.
For more information, read here: http://spark.apache.org/docs/latest/job-scheduling.html
If you don't need shared data, use YARN (or another resource manager) to assign resources in a fair manner to both jobs. YARN has a fair scheduling mode, and you can set the resource demands per application. If you think this suits you, but you need shared data, then you might want to think about using Hive or Alluxio to provide a data interface. In this case you would run two spark-submits, and maintain multiple drivers in the cluster. Building additional automation around spark-submit can help you make this less annoying and more transparent to end users. This approach is also high-latency, since resource allocation and SparkSession initialization take up a more or less constant amount of time.
tl;dr I'd say it's not possible.
A Spark application is at least one JVM and it's at spark-submit time when you specify the requirements of the single JVM (or a bunch of JVMs that act like executors).
If however you want to have different JVM configurations without launching separate JVMs, that does not seem possible (even outside Spark but assuming JVM is in use).
I'm working on a Ruby on Rails app, currently hosted on Heroku.
We have about 5 web dynos and about 2 worker process running on average. But because we're using adeptscale these can change a lot, and the cost is increasing from month to month.
We're thinking about changing the process and the infrastructure (using our own, off of amazon/google etc). And also because of the performance, access to java libraries and other gains we're planning to go with jRuby.
I haven't got much experience with jRuby at all, but I do have Java experience. So I have a few questions:
Question intro: Since rails philosophy/approach differs from Javas, i.e ruby webserver uses far less memory but can only process one request at a time, and so having multiple servers sort of compensates the inability to process multiple requests.
If we go with jRuby (and have our rails project packaged as a war file and deployed to any servlet container i.e Tomcat or Jboss(more than just container)), will we be able to process multiple requests then?
Question intro: Currently we got some application logic running in the workers(instead of blocking the webserver, and not being able to serve other clients/browser clients). i.e when users submit some form and then our app needs to contact the 3rd party service to return the response, we simply let the worker do the workload of getting back from the 3rd party service and updating the ui (which reports waiting status) via websockets that the 3rd party service returned x/y or whatever status.
If we switch to jRuby, how will we achieve the similar logic? I mean do we go with the java code which has some kind of thread pool of workers and then free workers do the workload of contacting the 3rd party service etc? How would we go about this if we decide to go with jRuby?
1) You can serve multiple requests at a time in jruby with nearly any container, but you can also serve multiple requests at a time with mri-ruby. You only have to have a threadsafe app (config.threadsafe! is default in rails4). Different rack servers have different approaches to serve multiple requests at a time. For example unicorn uses multiple processes while passenger or puma go for a multi-threaded approach.
In my experience jruby containers like jboss or tomcat are more complicated to configure properly. But there are things like tourquebox, trinidad that help you with this. But you can even still go for some of the ruby servers (e.g. puma) that dont use c extensions.
2) If I understand you correctly you are looking for some background-processing library? You can use sidekiq or resque with ruby or jruby (while jruby will be faster in general, and its easier to debug memory leaks). You can even use ruby for your rack servers and jruby for your workers (can even be run in parallel with things like rvm/rbenv)
In general I would only go for the jruby option if you know what you are doing and need better performance for your app servers or if you want to speed up your worker servers. If I was you I would probably stay in the ruby world and use puma for your app and sidekiq as a background service. Both are very elegant and need not so much configuration.
Yes, JRuby uses Java threads and is really multithreaded. And I can say that it's really good in integration with Java, even using classes for JNI.
I can recommend next servers (some have already been mentioned):
puma (https://github.com/puma/puma)
any servlet container (even IBM WebSphere Application Server!) - just use warbler (https://github.com/jruby/warbler)
The 'simplest' way to run application on servlet container is make .war with warbler. Usually resulting .war file includes all dependencies and JRuby interpreter, so resulting file usually is 30 Mb. But I think that it is not so easy to setup warbler, then I wouldn't recommend this way if you don't really need to run Rails in enterprise Java environment.
And I would just remind that Rails opens DB connection for any request, then default size of DB connection pool of 5 isn't enough - don't forget to increase it before load testing :) (e.g. default thread pool for puma is 16, IBM WAS is 50, Tomcat - 200 threads).
I agree with smallbutton.com that puma is good choice. Finally, with puma you can switch between JRuby and other interpreter almost easy (in my experience there is one difference - gem's names)
First of all, I have a conceptual question, Does the word "distributed" only mean that the application is run on multiple machines? or there are other ways where an application can be considered distributed (for example if there are many independent modules interacting togehter but on the same machine, is this distributed?).
Second, I want to build a system which executes four types of tasks, there will be multiple customers and each one will have many tasks of each type to be run periodically. For example: customer1 will have task_type1 today , task_type2 after two days and so on, there might be customer2 who has task_type1 to be executed at the same time like customer1's task_type1. i.e. there is a need for concurrency. Configuration for executing the tasks will be stored in DB and the outcomes of these tasks are going to be stored in DB as well. the customers will use the system from a web browser (html pages) to interact with system (basically, configure tasks and see the outcomes).
I thought about using a rest webservice (using JAX-RS) where the html pages would communicate with and on the backend use threads for concurrent execution.
Questions:
This sounds simple, But am I going in the right direction? or i should be using other technologies or concepts like Java Beans for example?
2.If my approach is fine, do i need to use a scripting language like JSP or i can submit html forms directly to the rest urls and get the result (using JSON for example)?
If I want to make the application distributed, is it possible with my idea? If not what would i need to use?
Sorry for having many questions , but I am really confused about this.
I just want to add one point to the already posted answers. Please take my remarks with a grain of salt, since all the web applications I have ever built have run on one server only (aside from applications deployed to Heroku, which may "distribute" your application for you).
If you feel that you may need to distribute your application for scalability, the first thing you should think about is not web services and multithreading and message queues and Enterprise JavaBeans and...
The first thing to think about is your application domain itself and what the application will be doing. Where will the CPU-intensive parts be? What dependencies are there between those parts? Do the parts of the system naturally break down into parallel processes? If not, can you redesign the system to make it so? IMPORTANT: what data needs to be shared between threads/processes (whether they are running on the same or different machines)?
The ideal situation is where each parallel thread/process/server can get its own chunk of data and work on it without any need for sharing. Even better is if certain parts of the system can be made stateless -- stateless code is infinitely parallelizable (easily and naturally). The more frequent and fine-grained data sharing between parallel processes is, the less scalable the application will be. In extreme cases, you may not even get any performance increase from distributing the application. (You can see this with multithreaded code -- if your threads constantly contend for the same lock(s), your program may even be slower with multiple threads+CPUs than with one thread+CPU.)
The conceptual breakdown of the work to be done is more important than what tools or techniques you actually use to distribute the application. If your conceptual breakdown is good, it will be much easier to distribute the application later if you start with just one server.
The term "distributed application" means that parts of the application system will execute on different computational nodes (which may be different CPU/cores on different machines or among multiple CPU/cores on the same machine).
There are many different technological solutions to the question of how the system could be constructed. Since you were asking about Java technologies, you could, for example, build the web application using Google's Web Toolkit, which will give you a rich browser based client user experience. For the server deployed parts of your system, you could start out using simple servlets running in a servlet container such as Tomcat. Your servlets will be called from the browser using HTTP based remote procedure calls.
Later if you run into scalability problems you can start to migrate parts of the business logic to EJB3 components that themselves can ultimately deployed on many computational nodes within the context of an application server, like Glassfish, for example. I don think you don't need to tackle this problem until you run it to it. It is hard to say whether you will without know more about the nature of the tasks the customer will be performing.
To answer your first question - you could get the form to submit directly to the rest urls. Obviously it depends exactly on your requirements.
As #AlexD mentioned in the comments above, you don't always need to distribute an application, however if you wish to do so, you should probably consider looking at JMS, which is a messaging API, which can allow you to run almost any number of worker application machines, readying messages from the message queue and processing them.
If you wanted to produce a dynamically distributed application, to run on say, multiple low-resourced VMs (such as Amazon EC2 Micro instances) or physical hardware, that can be added and removed at will to cope with demand, then you might wish to consider integrating it with Project Shoal, which is a Java framework that allows for clustering of application nodes, and having them appear/disappear at any time. Project Shoal uses JXTA and JGroups as the underlying communication protocol.
Another route could be to distribute your application using EJBs running on an application server.
I am working on a Java server side application that needs to provide a SOAP service. For this, we are using Axis2 and deploy in a Tomcat 6 installation.
We have the following issue: we need to run a couple of background threads; one to periodically query another web service for changes in provided data and a second one to monitor and consume data in an MQ.
My question is, what is the best, Java EE, practice to run these background tasks? Should we just run those as background threads that we'll somehow need to tell Tomcat to run at startup? Is there a better way than spawning threads from the web app container?
The system is not large enough to argue breaking it to smaller parts (e.g. run the background tasks in a system deamon with the webservice part being a separate stateless component querying that system deamon). For the same reason we do not have the option to run within a full app server like JBoss (would that make any difference?).
Thanks!
UPDATE:
On a supplementary question, if we just spawned new threads for these tasks (and assuming that this is not common practice), would Tomcat (or Axis) be made more unstable or have any other issues?
I would suggest to use quartz-scheduler for such kind of things. It's simpler than to threads itself and of course more flexible to use. There are interceptors during the start of Tomcat or Axis2 so you can start the scheduler there.