Batch processing on Google Compute Engine in Java

Batch processing on Google Compute Engine in Java - java

How do I get started with Compute Engine and setup a Java batch job that; runs continuously with very small intervals (constantly), reads from Google Datastore, processes the data and writes to Google Datastore?
Right now I have a game application running on GAE. When users initiate a game an entity is stored in the Datastore. The game is someway time-based and I wanna be able to frequently and efficiently check the games and make notifications if necessary. At the moment this is done by a task queue that runs for 10 minutes and schedules itselves when it is finish. However I do not feel that this is the correct way to handle this, and will therefore migrate to GCE for better performance and scaling opportunities.
I have read the GCE “get-started-guide”, but this only tells how to connect via SSH and install programs and how to make a very simple website. Where can I find a guide that explains how to create an initial Java project aimed for GCE and using some of Google APIs like Datastore etc. Any advices on how to get started is highly appreciated

Google Cloud DevRel has started some guides to provide some clarification on this exact topic, like http://cloud.google.com/python, http://cloud.google.com/nodejs, etc, but Java won't be finished for a few months.
If you like fully controlling your infrastructure, you can definitely use GCE, but if I were you, I would stick to App Engine, since it automates a lot of scaling you would have to do manually. GCE provides auto-scaling features, but they are more involved than App Engine. But if you want to see what they look like, the Python GCE section isn't especially specific to Python:
https://cloud.google.com/python/getting-started/run-on-compute-engine#multiple_instances
If you're finding App Engine limiting, you can look into migrating instead to Managed VMs, which is similar to App Engine but lets you do things like install custom libraries using a Dockerfile.
As far as Task Queues, they are still officially supported, but if you are interested in massive scalability, you can checkout Cloud Pub/Sub as well and see if it fits your needs.
If your data size is getting large, Cloud Dataflow lets you run real-time streaming or batch jobs that read from Datastore and do some calculations on it. Cloud Dataflow can read from both Datastore and Pub/Sub queues.
If you want to interact with APIs like Pub/Sub or Datastore outside of the context of App Engine, the traditional client library is here:
https://developers.google.com/api-client-library/java/
Although there is a newer project to provide more friendly, easier to use client libraries. They are still in an early state, but you can check them out here:
https://github.com/googlecloudplatform/gcloud-java
Overall, if your current App Engine and Task Queue solution works, I would stick with it. Based on what you're telling me, the biggest change I would make is instead of your batch job polling every ten minutes, I would have the code that stores the entity in Datastore immediately kick off a Task Queue job or a Pub/Sub message that starts the background processing job.
If you're interested in where the platform is heading, you can check out some of the links here. While you can roll your own solutions on GCE, to me the more interesting parts of the platform our products like Managed VMs, and Cloud Dataflow since they allow you to solve a lot of these problems at a much higher level and save you a lot of headaches of setting up your infrastructure. However, most of these are still in a Beta stage, so they might have a few rough edges for a little bit.
If this doesn't answer your question, comment any more questions and I will try to edit in the answers. And stay tuned for a much better guide to the whole platform for Java.

Related

Back end suggestions for android application

I am creating an image/graphics intensive application on android. Thus I have decided to keep images at server side and fecth them in batches when needed for each user. Apart from this I would like to manage some minor user data at backend for any future extension to the app or dynamic loading of some content.
For this I am looking out for the easiest but not a very rigid back-end solution. After some research I have boiled down to below mentioned options(In the order of priority):-
Amazon SDK for android :- It looks like this provides a lot of pre-built components but I am not sure how flexible it is when doing some custom back-end coding/feature implementation.
Parse :- Easy to understand and use but not flexible when it comes to custom feature development.
Amazon EC2 Java Backend:- I will have to do all the server side coding from scratch here but this will provide complete independence in feature implementations. Though I would love if I can find some code samples relates to user management, backend db management and java restful web services.
Any suggestions or pointers that you guys have in the above choice would be great
Thanks in advance

I have been using Parse but I haven't explored the other 2. So, this may not be a comprehensive answer but I would try to give you some pointers based on my experience with Parse.
I have been into Android development for quite some time now but I do not have any significant expertise (I would say very minimal) on the backend. Also, you mentioned you wish to work on graphics/image intensive application. As far as the application I use Parse for is more of user data and minimal images, (requiring extensive relational database).
Parse makes it really simple to create the backend structure. And the client SDK is also very powerful. Their API's are very straight-forward and doesn't require you to worry about writing complex queries, caching them and saving the data. Given my background as I mentioned above, I would say there is no learning curve involved into getting started with the dev. You can simply start building your app right away!
Also, Parse uses AWS S3 on the backend with Mongo-DB. So, I believe computation on the server side should not be a problem. Server side logic can be implemented using ParseCloud (requires some javascript). But, if you plan to write some complex algorithms, I am not very sure how much can that be done.
Documentation of Parse on Android is quite good to get through most of the dev. Extensive doc for iPhone dev.
As far as cost structure goes, it allows 1 million free API requests per month and this is very much sufficient to get through quite a number of users. In your case, the storage should be of more concern. Parse allows 1GB free and some 20 cents above per GB.
Hope this helps!

I am looking out for the easiest but not a very rigid back-end solution
Have you considered AppEngine? Here's a tutorial about how to get app engine working for you fast
You can store up to 5 GB of blob storage for free, should be more than enough for experimenting. If you go over you can pay the $0.13/GB/mo extra for blob storage, which is more than reasonable.

I don't know what kind of app you are doing, but I'll propose one approach.
Use https://imageshack.com/ for images.
Create your user saving data application with a lightweight webservice (REST+JSON)
and expose it at heroku (https://www.heroku.com/) with your prefered language/plataform.
It could be java or ruby.
Using imageshack for images will save cloud space for you and the service is quite fast.

Monitor Web application

I made a web based application by using the java language, and I would like to monitor its performance periodically (e.g. response time). Also I want to display this information on the homepage of my application. Is that possible? Can I have any idea about how this can be made.
Thanks.

You can take a look at stagemonitor. It is a open source java web application performance monitor. It captures response time metrics, JVM metrics, request details (including a call stack captured by the request profiler) and more. The overhead is very low.
Optionally, you can use the great timeseries database graphite with it to store a long history of datapoints that you can look at with fancy dashboards.
Example:
Take a look at the github page to see screenshots, feature descriptions and documentation.
Note: I am the developer of stagemonitor

Depending on your environment, I would use a cron job or task that measures the response time to request your app using something like HttpClient. Then drop that information into a database table accessible by your app.
The answer here is the simplest way you can measure the time: How do I time a method's execution in Java?

Why not checkout Munin monitoring? The website says
Munin the monitoring tool surveys all your computers and remembers
what it saw. It presents all the information in graphs through a web
interface. Its emphasis is on plug and play capabilities. After
completing a installation a high number of monitoring plugins will be
playing with no more effort.
SLAC at the Stanford university also keeps a large, quite well sorted list with various solutions for network monitoring among other things. SLACs list of Network Monitoring Tools, check for instance "Public domain or free network monitoring tools".

You can also consider to create your own custom web application monitor. Therfore, use the ProxyPattern and and create a concreate monitor. By using Spring framework you can easily swich on and off the monitor during runtime without re- deployment or restart of the web application. Furthermore you can create a lot of different specific monitors by yourself and are able to control what is beeing monitored. This gives you a maximum of flexibility, but requires a bit of work.

It is possible.
The clearest way to go about it, providing true numbers is to simulate a client that performs some sort of activity that mimics the real usage. Then have that client periodically use the website.
This presupposes that your website has a means to accept inputs that do not impact the real back end business. Crafting such interfaces requires some thought, but is not beyond the ability of a person who could put together the web site in the first place. The key points are to attempt to emulate as much using the real website as possible, but guard against real business impact. Basically it is designing for a special user (the tester).
So you might have a special user that when logged in, all purchases are bound to a special account that actually is filtered out to appropriately not demand payment and not ship goods. Provided the systems you integrate with all share an understanding of this live testing account, you can simultaneously test alongside of real production post-deployment.
Such a structure provides a huge benefit. You get performance of the real, live running system. Performance tends to change over time, and is subject to the environment. By fetching your performance numbers on the live system, in the same environment, you get a much better view of what real users might be encountering. Also, you can differentiate and track performance for different activities.
Yes, it is a lot more to design and set up; however, if you are in it for the long run, the benefits are huge.

I guess JavaMelody is the most appropriate solution for you. It can be built into a Java application and due to this feature, it monitors the functionality inside the app. Using this platform, it’s possible to get much more specific parameters for your Java app, than via external monitoring. In addition, it allows you displaying some statistics on your app homepage. Moreover, you can build in the app the graphs from JavaMelody, that essentially facilitates the app monitoring.
Take a look at the detailed overview of JavaMelody: http://cases.azoft.com/enterprise-system-monitoring-solutions-business-apps/

Hadoop, Mahout real-time processing alternative

I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.
What libraries/technologies I can use for this purposes?

You are right, Hadoop is designed for batch-type processing.
Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
(from: InfoQ post)
However, I have not worked with it yet, so I really cannot say much about it in practice.
Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm

Given the fact that you want a real-time response in de "seconds" area I recommend something like this:
Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.
Use a real-time system to do the last few things that cannot be precomputed.
For this you should look at either using the mentioned s4 or the recently announced twitter storm.
Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.
Perhaps having a NoSQL backend for your realtime component helps.
There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...
It all depends on your real application.

Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.

What you're trying to do would be a better fit for HPCC as it has both, the back end data processing engine (equivalent to Hadoop) and the front-end real-time data delivery engine, eliminating the need to increase complexity through third party components. And a nice thing of HPCC is that both components are programmed using the same exact language and programming paradigms.
Check them out at: http://hpccsystems.com

Is it easier to scrape data for a gae app in dev and upload it to prod or should you scrape in prod?

I have to run a scraping task to collect data for my App Engine (Java) app.
I'm not sure which is best - scrape data in development mode and upload it to prod or scrape it while the app is running in production.
Does it make a difference?
Are there any difficulties with bringing large quantities of data from one environment to the other (dev->prod or prod->dev)?

The dev server itself probably isn't a great scraping tool; it's single-threaded and (at least for python; the java implementation might be drastically different) the datastore is fairly horrible when storing large amounts of data.
However, depending on what you're scraping, the production servers might not be well-suited to the task; if the sites can take longer than 10 seconds to respond to a request, the urlfetch API will timeout. If you can be sure that this won't be a problem, it's probably more convenient to do the scraping in production and write directly to the datastore.
If not, it might make sense to do the scraping with a standalone tool and then put the data into the production datastore either with a RESTful web service or the remote API.
EDIT: The production servers can now set a 10 minute timeout on urlfetches initiated from taskqueue or cron jobs, so these objections might not apply anymore.

I find that spiders running in production often time out. Your solution of using the dev server is a good one, but also consider implementing each fetch through taskqueue.

Look at this question how to configure remore API for Java to use Python bulk data loader. You can also write a custom loader.

performance monitoring tools for multi-tenant web application

We have a need to monitor performance of our java web app. We are looking for some tolls which can help us with this task. The major difficulty is that we are SaaS provider with multi-tenant server architecture with hundreds of customers running on the same hardware. So far we tried commercial products like DynaTrace and Coradinat but unfortunately they don't get the job done so far. What we need is a simple report which would tell us if we had performance problems on each customer site in a specified period of time. Mostly it will be response time per customer but also we will need some more specifics based on the URLs.
please let me know if someone had any experience with setting up such monitoring.
Thanks!

Take a look at stagemonitor. It is an open source java web application performance monitoring library capable of multi-tenancy. It captures response time metrics, JVM metrics, request details and more. The overhead is very low. It uses the great timeseries database graphite that automatically downsamples historical datapoints which leads to a low storage overhead.
Here is a screenshot. You can find more on the project site.
Note: I am the developer of stagemonitor

HypericHQ is nice for this because, being written in Java itself, it integrates quite nicely with all the MBean properties already exposed on your APP server. You can set up administrator alerts/charts based on JVM properties/app server MBean properties that most non-Java tools can't get at.
On the downside, it does like to run a relatively heavy (as these things go) agent on your server.
-I am not in any way affiliated with Hyperic Inc ;)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.