I'm currently working on a web application which needs to import data and do some processing. This can take some time (probably in the "several minutes" range, once the data sets grow), so we're running it in the background - and now the time has come to show status in the frontend, instead of tailing log files :)
The frontend is using Angular, hooked up to REST endpoints (JAX-RS) calling services in EJBs that do persistance via JPA. Running in JBoss EAP 6.4 / AS 7.5 (EE6). Standard stuff, but this is the first time I'm dealing with Java EE.
With regards to querying status, polling a REST endpoint periodically is fine - we don't need fancy stuff like websockets. We do need to support multiple background jobs, though, and information consisting of runstate (running/done/error), progress and list of errors.
So, I current have two questions:
1) Is there a more suitable way of launching a background task than calling a #Asynchronous EJB method?
2) Which options do I have for keeping track of the background tasks, and which is most suitable?
My first idea was to keep a HashMap, but that quckly ended up looking like too much manual (and fragile-looking) code with concurrency and lifetime concerns - and I prefer not reinventing the wheel. The safe choice seems to be JPA persisting it, but that seems somewhat clumsy for volatile status information.
I'm obviously not the first person facing these issues, but my google-fu seems to be lacking at the moment :)
The tasks could be launched using #Asynchronous or by using JMS #MessageDriven
From java-ee-7 ManagedExecutorService is also an option.
The tasks would then update their own state that is stored in a ConcurrentHashMap inside a #Singleton EJB.
If you are in a clustered environment, state of tasks is better stored using JPA, as #Singleton is not for whole cluster
I am trying to determine the best way to implement handling long running batch jobs in Spring MVC. I come across Akka in my searching as a non blocking framework for aync processing, which is preferred because I don't want the batch processing to eat up all the threads from the thread pool.
Essentially what I will be doing is have a job that needs to run on some set schedule that will go out and call various web services, process the data, and persist it.
I have seen some code example with using it with Spring, but I've never seen it used with a CRON type scheduler. It always seems to be using a fixed time period.
I'm not sure if this is even the best approach to handling large scale batch processing within Spring. Any suggestions or links to good Akka Spring resources are welcome.
I would suggest you to look into Spring Integration and Spring Batch projects. The first one allows you configure chains of services using EIP. We used it in or project to fetch files from FTP, deserialize and process them, import into DB, send emails if required etc. - all by schedule. The second one is more straightforward and basically provides a framework to work on rows of data. Both are configurable with Quartz and integrate into Spring MVC project nicely.
Context
I'm in the process of drawing a solution to migrate a huge PL/SQL system to Java. The initial step is migrating some ETL jobs that:
Reads CSV, XML, (XLS, which is a new requirement) and Positional files from several ftp / sftp sources
Process the files according to rules stored in the database and write the results to a database table.
Currently this is done by several store procedures and Jobs.
My company is open to suggestions (if it can run in GlassFish 4 and share its logging and connection pool mechanisms, as well as the admin console, it is a plus).
I've done a little bit of research and the following options caught my eye:
Java EE 7 Batch Processing, sounds simple and particularly well fitted for GlassFish 4.
Spring Batch somewhat more mature and very similar to the Java EE 7 standard (which was probably based on it).
Apache Camel, sounds powerful and would spare us from a lot of fiddling with libraries such a Apache POI, but it also looks somewhat complex. Also I'm not sure if it is the best fit for the job (ETL over huge files).
Cook everything by myself. I could create a Application Client to run a Quartz / Spring Scheduler or even EJB Timers
While I'm still open to suggestions (recommendations would be nice), the best fit so far seems to be Java EE 7 Batch Processing.
One more thing, the infrastructure team have a solution to move files from every ftp source to a local directory, so FTP is really not an issue.
Problem
I've read several tutorials about Java EE Batch Processing and, in all of them, some kind of Servlet or EJB Timer is responsible for starting the Jobs:
JobOperator jobOperator = BatchRuntime.getJobOperator();
jobOperator.start("job", properties);
I could easily upload a web / ejb project and keep pooling for changes. But I was thinking about a push model:
Application client console application
Main class watches directories for new files
When there is a new file it would start a new job.
My doubts are:
Is this strategy possible/ advisable?
Will I need a JMS queue or some kind of producer / consumer strategy in the middle or should I just call jobOperator.start for every file and trust the batch processing layer to manage the application resources? In other words, if a thousand files are delivery to my folder at once and I call jobOperator.start a thousand times, will GlassFish 4 do some kind of smart enqueuing or should I create some kind of Gate so that no more than n jobs run simultaneously?
I've already implemented a project with Batch Processing in Wildfly (Jboss AS). I'm not familiar with configuration details on Glassfish (not using it anymore because the've dropped enterprise support), however I can give you some insights and guidelines according to my experience. Also, please note that Spring and the Batch spec. on EE 7 are quite similar, and your decision to use either technology must depend on "what else" you want to achieve with your application besides the batching. Do you want an easily maintained web interface? Do you want to depelop a REST api?, etc.
The ETL jobs you're describing fit pefeclty with the steps and chunks model in the EE 7 spec, so If you've already tried to develop some tests, you may have noticed that you still need to code the file readers and mappers for each file specification. Your reading sources are quite standard, and you will easily find a library to read/stream them and process their data.
The project I've implemented is quite simple. Customers uplodad files that need to be processed in order to feed a data warehouse. This service is on the "cloud". Files have a defined spec and must be in CSV format. Most processing results are dimentional "Upserts" and fact "erasing prior inserting". The user has a Web interface on which files and batch processing metadata must be shown (processing state, dates, rejected items, etc.). Because it is a cloud service, the files must not reside locally on each server (using S3).
So the first thing to design are the chunk steps. I didn't want to have an implementation for each file spec., So what I did is to design a "fit all cases" implemetation that process files according to the metadata contained in them and also the job configuration itself. This is the easy part. The second thing to think about is the processing and metadata administration. Here, I developed a REST api and a Web interface that uses it. After all this, Will it scale? Wilfly has thread configuration parameters for the Batch Processing, and you can increase or decrease the thread availability for the JobOperator. Jobs are not submitted if there are not enough threads available. So what happends to those requests? Well, they can reside on memory, a backed up stateful session can be developed, you can definitely implement MQ listener of queued processing requests. What I did was much simpler. The company doesn't have the resources to maintain a cluster, so whe did an elastic configuration that will expand accoding to cpu consumption and requests volume. So far, the application has processed 10 TB of data, from 15 customers, and at max request/processing peak, 3 elastic instances have fired up.
A file listener is an interesting idea. You can listen to a directory and drop a processing request to a queue or inmediately to the BatchRuntime. It will depend on how you want to scale it, your needed response time, the available resources, etc.
Feel free to ask me anything.
Regards.
EDIT: forgot to mention. I don't really recommend using the Application client unless you've already got something deployed on your organization. The recent security constraints and java SE updates mechanism has made a real hassle to maintain those kind of deployments. Think web.
I would approach it this way.
My hammers for this use case would be the Java Watch Service, a Servlet, a JMS queue, and the Batch service.
First, the Watch Service is the Java 7 go to place to handle the file system monitoring.
I would write a Watch Service implementation, and I would run it on a thread.
Where does the thread run you ask?
Officially, you should probably be using JCA for this. But, JCA is flat out a pain to work with, underutilized, thus under documented. There are solid examples, but it's simply not a common technology in the Java EE stack.
Another place is an asynchronous Session Bean invocation. There's nothing that suggests these can not be long lived invocations. You could stand up a #Singleton Session Bean, with #Startup, call the async method from a #PostConstruct method, and let it go. Then, in #PreDestroy signal the long running method to stop, so it can cleanly shut down. This should all be to spec, portable, and according to Hoyle.
The third place is to you a ServletContextListener, which is the pre-Java EE 6 go to place for tying code in to the life cycle of the application. Here, you would create the thread yourself in the contextInitialized method, and then tear it down in the contextDestroyed method.
Creating threads here is "less defined", but I've done it for years and never had a problem.
Now that you have your service running, the service (IMHO), will do two things.
1) It'll sense when a new file has arrived in the directory, and when it does, it will MOVE (mv, rename) the file to a parallel "processing" directory. The reason is that this tells you that a file has moved from incoming to processing, that the file is a work in progress. It's obvious from a directory listing, regardless of what the backend thinks it's doing. Remember, the system can go down mid way through a file.
2) Once moved, post the file name, and any other meta data on to a JMS queue and have an MDB do tool up the batch job.
Why add the JMS queue? It brings a couple of features to the party. First, it's great way to get stuff "from outside" the happy transactional context that EJB likes, to inside one. Second, it's transactional. You can, depending on your ETL use case, have the MDB directly process the job. And by doing so, you simply do not acknowledge the message from the queue until the processing is done (and the file is deleted or moved from the "processing" directory). In an ideal world, the message queue has messages matching the files in the processing directory. When the processing is done, the method returns, the message fetch "commits", and you're done. If the system crashes, this will restart from the beginning automatically (since the message is still on the queue and was never removed).
The MDB, by configuring it's instances, can gate the number of simultaneous jobs also. Configure 10 instances, only 10 files can be processed at the same time. But this can be a little too simple, too coarse. There's no priority for example (first come first serve). But it might work for you.
But either way, the MDB is a great gateway into system, since each one starts with it's own little bit of transactional context. Unlike the long running servlet thread or the long running async thread. The servlet thread has a questionable (if any) transactional status, the long running thread inherits it's state from the #Startup method, and retains it for it's life time. The MDB gets a new one each time. Much of this can be shenaniganed away calling methods with new transactions.
But I like the demarcation of the MDB. Even if it's entire task is to create the Batch entry for a file name, the MDB is a good gatekeeper.
And that's pretty much it.
The key parts are being a good citizen and tearing down your thread properly tied to the lifecycle of the application, understanding your transactional state at the various components, and understanding how all the moving parts fit together.
If you use the #Startup technique, make sure you invoke your async method via injecting another instance of your session bean. Otherwise the invocation will be a local call, and not asynchronous. You'll stare at it wondering why your server is hanging and not starting up. All of the EJB annotations only work when invoked through an injected or looked up proxy.
Have fun, share and enjoy.
Addenda to the question:
There's really no value to having an external process manage the watch service. One tied to the lifecycle of the server is easier to maintain. Two things come to mind. If the server is down, file will simply stack up in the file system until the server is started again, so you don't lose data. If you have an external service, then you either have it sending messages to a dead server, or you have to stage and manage the JMS server separate from the app server. In that case you now have 3 processes to manage: Watch service, JMS Server, and app server, rather than just the app server.
I agree with the other poster that should you decide to go with an external service anyway, a simple Java SE app posting simple messages to a JAX-RS REST service on the server, or even a trivial Servlet is much, MUCH more easy to maintain, stage and deploy than an app client. If you do it that way, you could write the watch service in something completely different.
But since the server (ostensibly) has direct access to the file system with the file, there's really no motivation to break this service outside of the container. Put the whole kit in to an EAR and have at it. Just flat easier management.
I read somewhere use of webservcies in apps. After a lot of research I am able to create one Webservice which will accept Json and JsonP both format as request and response accordingly. I developed the webservcies using Java, Apache Axis2, Hibernate and MySQL as database. there are few problems and I dont know how to solve ?
Insert or delete option, sometimes if at a time more than two users call that service that is insert or delete any row the queries go in sleep mode and next time someone tries to fetch that service he couldnt. Accroding to server log it says error SQL Lockout State. If I checks Processlist in MYSQL it is showing that query in Sleep, I have to kill to resume.
The performance of webservice doesnt seems to be upto mark, it takes time some more time as what i experienced it shouldn't. In simple words how to obtain better performance by the services
How to implement security feature such that if a user logins he/she can be provided an id and validation of that id so that unauthorized access can be prevented
Or just guide me what should be the most appropriate and optmized Webservice methodology that can be used using Java
Answer to this question is not specific to Android. Below are my investigations which might be useful for you.
For the point about MySQL connections going to sleep mode, you can do the following.
Debug the datasource used by Hibernate, try to increase the pool size & check for any issues in it.
Define a timeout period for connections. JBoss has several configurations related to this like blocking-timeout-millis, idle-timeout-minutes etc.
Declare a mechanism to validate periodically the connection resources in the pool for activeness. You can explore OracleStaleConnectionChecker for options.
Configure miniumn connections in the pool. This is important because when all the stale connections are discarded, empty pool needs to be pre-filled & ready with active connections.
Coming to performance of Insert/Delete operations & SQL Lockout State, please try to re-order the sequence of the queries which you are firing to DB at every request. This may not be a deadlock situation but sequencing DB queries correctly will definitely lead to less lockout time and better performance.
This answer may be of use for you. Hibernate: Deadlock found when trying to obtain lock
Web-services which you have developed may require some performance optimization to make them upto the mark. Below are first few steps you can take to bring the performance up.
Avoid nested loops. Every extra parameter in the iterated lust increase the order of the lopp exponentially.
Remove early initialization of objects. This may lead to long unwanted GC cycles.
Apart from above optimizations, there are several frameworks & tools at your service to evaluate the code quality & its performance. PMD, FindBugs, JMeter, Java profiler are few of them to name.
Shishir
You are going to have to profile your server and see where the time is spent. I really like YourKit for doing thread profile. visualvm which comes with the JDK can help also.
There are all sorts of reasons your web service can be slow:
Latency from client to server
Handling the HTTP request on the server
Handling the HTTP response on the client
Making the database call (sounds like you already have some kind of locking / blocking going on there)
You are going to have to get markers to tell you how long it took to go from A to B to C to D back to C back to B back to A kind of thing. We would be speculating heavily from here on what is exactly going on in your program, but we can give you the ideas / tools to figure it out.
If you use YourKit, connect it to your server process. Have nothing else connecting to your server (for instance your client is not sending requests). Try it with your client requesting, you should see your accepting threads receive the HTTP request and then delegate to either your processing thread or do the processing itself. You can use YourKit to see how much time is spent in different functions during that call time.
Try it with your client making the call.
Try it using a simple HTTP request tool like wget or maybe your IDE has a webservice test tool (for instance intellij does), or you can download a simple HTTP test tool.
By testing it in a simple tool that just outputs the response, you can eliminate any client processing issues. You can also achieve a similar test in Chrome or Firefox and use the developer tools to see time to fulfill request.
In my experience, the framework for handling the requests and delegating can introduce some performance issues. I ripped Grails out of a production environment because of its performance issues (before any Grails / Groovy flames come my way, we were operating at a much higher rate than typical web applications, and I am sure Grails has made some headway in the last couple years... alas, it was not for my need at that time)
BTW, I doubt you are operating a load where you will be critiquing the web service framework you chose to use. I have been happy with Spring MVC and DropWizard (Jersey JAX-RS), and Grails is easy to use too.
You should make a simple static content response in your webservice and see how quickly that returns vs a request that makes a database call.
Also, what kind of table are you using in MySQL? InnoDB? MyISAM? They have different locking schemes. That could be causing your MySQL issue.
The key to all of it, break the problem up into parts, and measure each and eliminate parts one by one till you go, everytime I do X it is slower (like everytime I make a database call its slower)
In Java the the way you will be able to find more support online via documentation/forums is to develop the web service as a REST web service using Spring MVC.
You can base yourself on this resource and take it from there:
Spring MVC REST Hello World Web Service
Using Spring you can create a RestFul webservice easily and spring does all the ground work you needed. As others had mentioned you can consume the webservice in any type of client - including Android.
A detailed guide available here:
https://spring.io/guides/gs/rest-service/
Here are my suggestions:
Make APIs only read or write database. If an API combines reading and writing, it is possible to cause deadlock;
Use a light-weight HTTP server. Powerful HTTP server is possibly consuming more.
Make use of thread. Have more threads could be helpful when you are facing a ton of users.
Make more things static. You could avoid unnecessary queries.
I think mhoglan's answer is detailed enough.
I'm building a web service that executes a database process (SQL code to run several queries , then move data between two really large tables), I'm assuming some processes might take 2 to 10 hours to execute.
What are the best practices for executing a long running database process from within a Java web service (it's actually REST-based using JAX-RS and Spring)? The process would be executed upon 1 web service call. It is expected that this execution would be done once a week.
Thanks in advance!
It's gotta be asynchronous.
Since your web service call is an RPC, best to have the implementation validate the request, put it on a queue for processing, and immediately send back a response that has a token or URL to check on progress.
Set up a JMS queue and register a listener that takes the message off the queue and persists it.
If this is really taking 2-10 hours, I'd recommend looking at your schema and queries to see if you can speed it up. There's an index missing somewhere, I'd bet.
Where I work, I am currently evaluating different strategies for this exact situation, only times are different.
With the times you state, you may be better served by using Publish/Subscribe message queuing (ActiveMQ).