I am developing an application on google app engine and was just checking out cron jobs.
Looking at this document it seems pretty easy to schedule the jobs with config files and so on. My question is related to what I should actually put in the url the scheduled task triggers.
I was thinking a jsp which triggers a servlet that does whatever I need done, but not having a lot of experience with this technology I was wondering if there is a standard/better way of achieving this.
How are people doing things such as this?
Any help, pointers appreciated!
Your approach is just fine, and in line with Google's App Engine Cron Service documentation.
The Google App Engine is somewhat unusual place to run code, with all of its restrictions and such. The other 'normal' approaches for scheduling tasks in Java (external cron system, jcrontab, never ending loop with Thread.sleep(), etc.) are not possible.
Related
I am starting a web crawler and made a choice to use crawler4j. Due to specific customer requirement, all code must run under a Java EE war. Since most of the crawlers in Java runs at standalone mode, I would like to know the best approaches to run typical long time processing jobs such as crawlers, - specific crawler4j, that has multi-threading capabilities in a web context.
I've seen possible solutions like Spring Batch, Quartz but I could not figure which one fits better this scenario.
edit: I found a similar question: How to schedule crawler4j crawl control to run periodically? that was not answered either
How do I get started with Compute Engine and setup a Java batch job that; runs continuously with very small intervals (constantly), reads from Google Datastore, processes the data and writes to Google Datastore?
Right now I have a game application running on GAE. When users initiate a game an entity is stored in the Datastore. The game is someway time-based and I wanna be able to frequently and efficiently check the games and make notifications if necessary. At the moment this is done by a task queue that runs for 10 minutes and schedules itselves when it is finish. However I do not feel that this is the correct way to handle this, and will therefore migrate to GCE for better performance and scaling opportunities.
I have read the GCE “get-started-guide”, but this only tells how to connect via SSH and install programs and how to make a very simple website. Where can I find a guide that explains how to create an initial Java project aimed for GCE and using some of Google APIs like Datastore etc. Any advices on how to get started is highly appreciated
Google Cloud DevRel has started some guides to provide some clarification on this exact topic, like http://cloud.google.com/python, http://cloud.google.com/nodejs, etc, but Java won't be finished for a few months.
If you like fully controlling your infrastructure, you can definitely use GCE, but if I were you, I would stick to App Engine, since it automates a lot of scaling you would have to do manually. GCE provides auto-scaling features, but they are more involved than App Engine. But if you want to see what they look like, the Python GCE section isn't especially specific to Python:
https://cloud.google.com/python/getting-started/run-on-compute-engine#multiple_instances
If you're finding App Engine limiting, you can look into migrating instead to Managed VMs, which is similar to App Engine but lets you do things like install custom libraries using a Dockerfile.
As far as Task Queues, they are still officially supported, but if you are interested in massive scalability, you can checkout Cloud Pub/Sub as well and see if it fits your needs.
If your data size is getting large, Cloud Dataflow lets you run real-time streaming or batch jobs that read from Datastore and do some calculations on it. Cloud Dataflow can read from both Datastore and Pub/Sub queues.
If you want to interact with APIs like Pub/Sub or Datastore outside of the context of App Engine, the traditional client library is here:
https://developers.google.com/api-client-library/java/
Although there is a newer project to provide more friendly, easier to use client libraries. They are still in an early state, but you can check them out here:
https://github.com/googlecloudplatform/gcloud-java
Overall, if your current App Engine and Task Queue solution works, I would stick with it. Based on what you're telling me, the biggest change I would make is instead of your batch job polling every ten minutes, I would have the code that stores the entity in Datastore immediately kick off a Task Queue job or a Pub/Sub message that starts the background processing job.
If you're interested in where the platform is heading, you can check out some of the links here. While you can roll your own solutions on GCE, to me the more interesting parts of the platform our products like Managed VMs, and Cloud Dataflow since they allow you to solve a lot of these problems at a much higher level and save you a lot of headaches of setting up your infrastructure. However, most of these are still in a Beta stage, so they might have a few rough edges for a little bit.
If this doesn't answer your question, comment any more questions and I will try to edit in the answers. And stay tuned for a much better guide to the whole platform for Java.
I need to develop a Java platform to download and process information from Twitter. The basic idea is to have a centralized controller to generate tasks (id and keywords basically) and send this tasks to remote workers (one per computer). I need to receive an status report periodically to know about the status of both, the task and the worker. I'll have at least 60 workers (ten times more in a near future).
My initial idea was to use RMI but I need to communicate in both directions and I don't feel comfortable with RMI. The other approach was to use SSLSockets to send serialized objects but I would have to control a lot of errors and add a lot of code to monitor tasks and workers. Some people told me about use a framework like Spring Batch, Gigaspaces or Quartz.
What do you think would be the best option for this project? By the time being I've read a lot of good things about Gigaspaces but I don't find a good tutorial about how to implement it and Quartz seems promising. What do you think? Is it worth using any of them?
It's not easy to tell you to go for a technology based on your question. GigaSpaces is certainly up to the job but so is Spring Batch. Quartz is just the scheduling part of your question and not so much the remoting and the distribution of workload.
GigaSpaces is a fully fledged application platform to handle scenario's where parallelism, high throughput and scalability is a factor. Spring Batch can definitely also do the job, but unlike GigaSpaces, it is not an application platform. So you would still need to deploy your application somewhere.
However, GigaSpaces is a commericial product (free version available) but there are other frameworks that can help you such as Storm Project (http://storm-project.net/) and Hazelcast (www.hazelcast.com) also come to mind.
So without clarifying your use case it's hard to give a single answer. It all depends on what exactly you want and how you want to use it, now and in the future.
This app must perform connection to a web service, grab data, save it in the database.
Every hour 24/7.
What's the most effective way to create such an app in java?
How should it be run - as a system application or as a web application?
Keep it simple: use cron (or task scheduler)
If that's all what you want to do, namely to probe some web service once an hour, do it as a console app and run it with cron.
An app that starts and stops every hour
cannot leak resources
cannot hang (may be you lose one cycle)
consumes 0 resources 99% of the time
look at quartz, its a scheduling library in java. they have sample code to get you started.
you'd need that and the JDBC driver to your database of choice.
no web container required - this can be easily done using a stand alone application
Try the ScheduledExecutorService.
Why not use cron to start the Java application every hour? No need to soak up server resources keeping the Java application active if it's not doing anything the rest of the time, just start it when needed,
If you are intent on doing it in java a simple Timer would be more than sufficient.
Create a web page and schedule its execution with one of many online scheduling services. The majority of them are free, very simple to use and very reliable. Some allows you to create schedules of any complexity just like in cron, SqlServer job UI, etc. Saves you a LOT of headache creating/debugging/maintaining your own scheduling engine, even if it's based on some framework like Ncron, Quartz, etc. I'm speaking from my own experience.
I'm drawing a design for a system to do daily business functions for my company. It will consist of a Oracle 10g database with Pl/SQL packages and a Java-based web application. All of this is running on a Solaris 10 server. Aside from handling transactions from the web interface, scheduled tasks need to run on the database to run calculations and load data etc.
This is a redesign of a legacy system that currently controls everything with a plethora of cron jobs. Given the task of redesigning it, would you do it differently? I know Oracle has its own task scheduler, but the DBA argues that he would rethink using it because if the database is down or offline for some reason, it can't send alerts or log errors of any kind. The cron jobs currently have the ability to send SMS messages or emails should one of the tasks fail. Another option would be to have the web application do it somehow.
What do you suggest?
Are all the scheduled tasks related to the database? If so, then your DBA's objection is irrelevant: you don't want to run the jobs when the database is offline for planned downtime, and the DBA ought to have something in place to alert them if the database is down for unplanned reasons, rather than relying on a signal from a failing cron job.
If you have jobs which run on other parts of the architecture without touching the database then certainly an external scheduler makes sense. There are plenty of commercial products, but if you want to go for FOSS then you probably ought to look at Quartz.
Having used both cron and the Oracle job scheuler - I have always found it a lot more reliable and easier to user and understand cron. It has more things that it can do (interface with the entire OS, not just Oracle). I would choose cron.
My rule of thumb for scheduled jobs is consistency. Since you've already got a lot of infrastructure in place like alerting I'd stick with cron.