Hot to run crawler4j in a web container?

Hot to run crawler4j in a web container? - java

I am starting a web crawler and made a choice to use crawler4j. Due to specific customer requirement, all code must run under a Java EE war. Since most of the crawlers in Java runs at standalone mode, I would like to know the best approaches to run typical long time processing jobs such as crawlers, - specific crawler4j, that has multi-threading capabilities in a web context.
I've seen possible solutions like Spring Batch, Quartz but I could not figure which one fits better this scenario.
edit: I found a similar question: How to schedule crawler4j crawl control to run periodically? that was not answered either

Related

Should I be using a distributed system like Mesos?

I have a project which briefly is as follows: Create an application that can accept tasks written in Java that perform some kind of computation and run the tasks on multiple machines* (the tasks are separate and have no dependency on one another).
*the machines could be running different OSs (mainly Windows and Ubuntu)
My question is, should I be using a distributed system like Apache Mesos for this?
The first thing I looked into was Java P2P libraries/ frameworks and the only one I could find was JXTA (https://jxta.kenai.com/) which has been abandoned by Oracle.
Then I looked into Apache Mesos (http://mesos.apache.org/) which seems to me like a good fit, an underlying system that can run on multiple machines that allows it to share resources while processing tasks. I have spent a little while trying to get it running locally as an example however it seems slightly complicated and takes forever to get working.
If I should use Mesos, would I then have to develop a Framework for my project that takes all of my java tasks or are there existing solutions out there?
To test it on a small scale locally would you install it on your machine, set that to a master, create a VM, install it on that and make that a slave, somehow routing your slave to that master? The documentation and examples don't show how to exactly hook up a slave on the network to a master.
Thanks in advance, any help or suggestions would be appreciated.

You can definitely use Mesos for the task that you have described. You do not need to develop a framework from scratch, instead you can use a scheduler like Marathon in case you have long running tasks, or Chronos for one-off or recurring tasks.
For a real-life setup you definitely would want to have more than one machine, but you might as well just run everything (Mesos master, Mesos slave, and the frameworks) off of a single machine if you're only interested in experimenting. Examples section of Mesos Getting Started Guide demonstrates how to do that.

Ruby on rails with jRuby

I'm working on a Ruby on Rails app, currently hosted on Heroku.
We have about 5 web dynos and about 2 worker process running on average. But because we're using adeptscale these can change a lot, and the cost is increasing from month to month.
We're thinking about changing the process and the infrastructure (using our own, off of amazon/google etc). And also because of the performance, access to java libraries and other gains we're planning to go with jRuby.
I haven't got much experience with jRuby at all, but I do have Java experience. So I have a few questions:
Question intro: Since rails philosophy/approach differs from Javas, i.e ruby webserver uses far less memory but can only process one request at a time, and so having multiple servers sort of compensates the inability to process multiple requests.
If we go with jRuby (and have our rails project packaged as a war file and deployed to any servlet container i.e Tomcat or Jboss(more than just container)), will we be able to process multiple requests then?
Question intro: Currently we got some application logic running in the workers(instead of blocking the webserver, and not being able to serve other clients/browser clients). i.e when users submit some form and then our app needs to contact the 3rd party service to return the response, we simply let the worker do the workload of getting back from the 3rd party service and updating the ui (which reports waiting status) via websockets that the 3rd party service returned x/y or whatever status.
If we switch to jRuby, how will we achieve the similar logic? I mean do we go with the java code which has some kind of thread pool of workers and then free workers do the workload of contacting the 3rd party service etc? How would we go about this if we decide to go with jRuby?

1) You can serve multiple requests at a time in jruby with nearly any container, but you can also serve multiple requests at a time with mri-ruby. You only have to have a threadsafe app (config.threadsafe! is default in rails4). Different rack servers have different approaches to serve multiple requests at a time. For example unicorn uses multiple processes while passenger or puma go for a multi-threaded approach.
In my experience jruby containers like jboss or tomcat are more complicated to configure properly. But there are things like tourquebox, trinidad that help you with this. But you can even still go for some of the ruby servers (e.g. puma) that dont use c extensions.
2) If I understand you correctly you are looking for some background-processing library? You can use sidekiq or resque with ruby or jruby (while jruby will be faster in general, and its easier to debug memory leaks). You can even use ruby for your rack servers and jruby for your workers (can even be run in parallel with things like rvm/rbenv)
In general I would only go for the jruby option if you know what you are doing and need better performance for your app servers or if you want to speed up your worker servers. If I was you I would probably stay in the ruby world and use puma for your app and sidekiq as a background service. Both are very elegant and need not so much configuration.

Yes, JRuby uses Java threads and is really multithreaded. And I can say that it's really good in integration with Java, even using classes for JNI.
I can recommend next servers (some have already been mentioned):
puma (https://github.com/puma/puma)
any servlet container (even IBM WebSphere Application Server!) - just use warbler (https://github.com/jruby/warbler)
The 'simplest' way to run application on servlet container is make .war with warbler. Usually resulting .war file includes all dependencies and JRuby interpreter, so resulting file usually is 30 Mb. But I think that it is not so easy to setup warbler, then I wouldn't recommend this way if you don't really need to run Rails in enterprise Java environment.
And I would just remind that Rails opens DB connection for any request, then default size of DB connection pool of 5 isn't enough - don't forget to increase it before load testing :) (e.g. default thread pool for puma is 16, IBM WAS is 50, Tomcat - 200 threads).
I agree with smallbutton.com that puma is good choice. Finally, with puma you can switch between JRuby and other interpreter almost easy (in my experience there is one difference - gem's names)

How to build a distributed java application?

First of all, I have a conceptual question, Does the word "distributed" only mean that the application is run on multiple machines? or there are other ways where an application can be considered distributed (for example if there are many independent modules interacting togehter but on the same machine, is this distributed?).
Second, I want to build a system which executes four types of tasks, there will be multiple customers and each one will have many tasks of each type to be run periodically. For example: customer1 will have task_type1 today , task_type2 after two days and so on, there might be customer2 who has task_type1 to be executed at the same time like customer1's task_type1. i.e. there is a need for concurrency. Configuration for executing the tasks will be stored in DB and the outcomes of these tasks are going to be stored in DB as well. the customers will use the system from a web browser (html pages) to interact with system (basically, configure tasks and see the outcomes).
I thought about using a rest webservice (using JAX-RS) where the html pages would communicate with and on the backend use threads for concurrent execution.
Questions:
This sounds simple, But am I going in the right direction? or i should be using other technologies or concepts like Java Beans for example?
2.If my approach is fine, do i need to use a scripting language like JSP or i can submit html forms directly to the rest urls and get the result (using JSON for example)?
If I want to make the application distributed, is it possible with my idea? If not what would i need to use?
Sorry for having many questions , but I am really confused about this.

I just want to add one point to the already posted answers. Please take my remarks with a grain of salt, since all the web applications I have ever built have run on one server only (aside from applications deployed to Heroku, which may "distribute" your application for you).
If you feel that you may need to distribute your application for scalability, the first thing you should think about is not web services and multithreading and message queues and Enterprise JavaBeans and...
The first thing to think about is your application domain itself and what the application will be doing. Where will the CPU-intensive parts be? What dependencies are there between those parts? Do the parts of the system naturally break down into parallel processes? If not, can you redesign the system to make it so? IMPORTANT: what data needs to be shared between threads/processes (whether they are running on the same or different machines)?
The ideal situation is where each parallel thread/process/server can get its own chunk of data and work on it without any need for sharing. Even better is if certain parts of the system can be made stateless -- stateless code is infinitely parallelizable (easily and naturally). The more frequent and fine-grained data sharing between parallel processes is, the less scalable the application will be. In extreme cases, you may not even get any performance increase from distributing the application. (You can see this with multithreaded code -- if your threads constantly contend for the same lock(s), your program may even be slower with multiple threads+CPUs than with one thread+CPU.)
The conceptual breakdown of the work to be done is more important than what tools or techniques you actually use to distribute the application. If your conceptual breakdown is good, it will be much easier to distribute the application later if you start with just one server.

The term "distributed application" means that parts of the application system will execute on different computational nodes (which may be different CPU/cores on different machines or among multiple CPU/cores on the same machine).
There are many different technological solutions to the question of how the system could be constructed. Since you were asking about Java technologies, you could, for example, build the web application using Google's Web Toolkit, which will give you a rich browser based client user experience. For the server deployed parts of your system, you could start out using simple servlets running in a servlet container such as Tomcat. Your servlets will be called from the browser using HTTP based remote procedure calls.
Later if you run into scalability problems you can start to migrate parts of the business logic to EJB3 components that themselves can ultimately deployed on many computational nodes within the context of an application server, like Glassfish, for example. I don think you don't need to tackle this problem until you run it to it. It is hard to say whether you will without know more about the nature of the tasks the customer will be performing.

To answer your first question - you could get the form to submit directly to the rest urls. Obviously it depends exactly on your requirements.
As #AlexD mentioned in the comments above, you don't always need to distribute an application, however if you wish to do so, you should probably consider looking at JMS, which is a messaging API, which can allow you to run almost any number of worker application machines, readying messages from the message queue and processing them.
If you wanted to produce a dynamically distributed application, to run on say, multiple low-resourced VMs (such as Amazon EC2 Micro instances) or physical hardware, that can be added and removed at will to cope with demand, then you might wish to consider integrating it with Project Shoal, which is a Java framework that allows for clustering of application nodes, and having them appear/disappear at any time. Project Shoal uses JXTA and JGroups as the underlying communication protocol.
Another route could be to distribute your application using EJBs running on an application server.

Background tasks in Axis2 - Tomcat stack

I am working on a Java server side application that needs to provide a SOAP service. For this, we are using Axis2 and deploy in a Tomcat 6 installation.
We have the following issue: we need to run a couple of background threads; one to periodically query another web service for changes in provided data and a second one to monitor and consume data in an MQ.
My question is, what is the best, Java EE, practice to run these background tasks? Should we just run those as background threads that we'll somehow need to tell Tomcat to run at startup? Is there a better way than spawning threads from the web app container?
The system is not large enough to argue breaking it to smaller parts (e.g. run the background tasks in a system deamon with the webservice part being a separate stateless component querying that system deamon). For the same reason we do not have the option to run within a full app server like JBoss (would that make any difference?).
Thanks!
UPDATE:
On a supplementary question, if we just spawned new threads for these tasks (and assuming that this is not common practice), would Tomcat (or Axis) be made more unstable or have any other issues?

I would suggest to use quartz-scheduler for such kind of things. It's simpler than to threads itself and of course more flexible to use. There are interceptors during the start of Tomcat or Axis2 so you can start the scheduler there.

Daemon in Java: simple schedule application?

This app must perform connection to a web service, grab data, save it in the database.
Every hour 24/7.
What's the most effective way to create such an app in java?
How should it be run - as a system application or as a web application?

Keep it simple: use cron (or task scheduler)
If that's all what you want to do, namely to probe some web service once an hour, do it as a console app and run it with cron.
An app that starts and stops every hour
cannot leak resources
cannot hang (may be you lose one cycle)
consumes 0 resources 99% of the time

look at quartz, its a scheduling library in java. they have sample code to get you started.
you'd need that and the JDBC driver to your database of choice.
no web container required - this can be easily done using a stand alone application

Try the ScheduledExecutorService.

Why not use cron to start the Java application every hour? No need to soak up server resources keeping the Java application active if it's not doing anything the rest of the time, just start it when needed,

If you are intent on doing it in java a simple Timer would be more than sufficient.

Create a web page and schedule its execution with one of many online scheduling services. The majority of them are free, very simple to use and very reliable. Some allows you to create schedules of any complexity just like in cron, SqlServer job UI, etc. Saves you a LOT of headache creating/debugging/maintaining your own scheduling engine, even if it's based on some framework like Ncron, Quartz, etc. I'm speaking from my own experience.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.