We have configured storm cluster with one nimbus server and three supervisors. Published three topologies which does different calculations as follows
Topology1 : Reads raw data from MongoDB, do some calculations and store back the result
Topology2 : Reads the result of topology1 and do some calculations and publish results to a queue
Topology3 : Consumes output of topology2 from the queue, calls a REST Service, get reply from REST service, update result in MongoDB collection, finally send an email.
As new bee to storm, looking for an expert advice on the following questions
Is there a way to externalize all configurations, for example a config.json, that can be referred by all topologies?
Currently configuration to connect MongoDB, MySql, Mq, REST urls are hard-coded in java file. It is not good practice to customize source files for each customer.
Wanted to log at each stage [Spouts and Bolts], Where to post/store log4j.xml that can be used by cluster?
Is it right to execute blocking call like REST call from a bolt?
Any help would be much appreciated.
Since each topology is just a Java program, simply pass the configuration into the Java Jar, or pass a path to a file. The topology can read the file at startup, and pass any configuration to components as it instantiates them.
Storm uses slf4j out of the box, and it should be easy to use within your topology as such. If you use the default configuration, you should be able to see logs either through the UI, or dumped to disk. If you can't find them, there are a number of guides to help, e.g. http://www.saurabhsaxena.net/how-to-find-storm-worker-log-directory/.
With storm, you have the flexibility to push concurrency out to the component level, and get multiple executors by instantiating multiple bolts. This is likely the simplest approach, and I'd advise you start there, and later introduce the complexity of an executor inside of your topology for asynchronously making HTTP calls.
See http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html for the canonical overview of parallelism in storm. Start simple, and then tune as necessary, as with anything.
Related
I have a question related to Apache Spark. I work with Java language for writing client code but my question can be answered in any language.
The title of the question may seem like there is already a general question in Google that can be found by a simple search, but the problem is that my question is something else and unfortunately every time I search, I didn't find something about this topic and my requirement. Similar topics that are usually found by searching but not my question is:
Multiple SparkSession for one SparkContext
Multiple SparkSessions in single JVM
...
My question is not the above questions at all, although it seems similar. I will first explain my question. In the following, after stating the question, I will say my requirement in a higher level because of which I asked the question. My goal is a requirement that will be solved if the question is answered or another solution to the requirement is provided.
The problem I am trying to solve
I wrote a rest server component in which I used Spark Java library. This rest server can receive a series of requests in a specific format and then form a query based on the requests and submit a job through the Spark library functions to the Spark cluster. (My own cluster) Also return the query answer in the form of a asynchronous response (when it is ready and user request it).
I use some code like this to create spark session (summary of it):
SparkConf sparkConf = new SparkConf()
.setMaster("spark://localhost:7077")
.setAppName("test");
SparkSession session = SparkSession.builder()
.config(sparkConf)
.getOrCreate();
...
As far as I know, we I run above code, spark create application test for me and allocate some of resources from my spark cluster. (I use Spark in standalone mode) for example assume it use all my resources. (So there is no resource for extra new application)
Now I have just one rest server, it can not be scaled at all, and if it goes down, the user can no longer work with the rest server API. So I want to scale it to two instance (at least) on different machines and on different JVMs. (This is the part where my question differs from the others)
If I bring another instance of my rest sever with same code as above, then it will create new Spark session (because it is different JVM on another machine) and it also creates another application with test name in Spark. But since I said all my resources have been used by the first Spark session, this application is on standby and can do nothing. (until resources become free)
Notes about problem:
I do not want to split the cluster resources and add some to the first rest server and some to the second rest server.
I want both versions (or any other numbers of instance if I mentioned) have a single Spark application. In other words, I want same SparkContext across different JVMs. Also note that I submit my spark query as cluster mode in Spark so my application is not worker and one of nodes in cluster becomes driver.
Requirement
As it is clear in the above description, I want my rest server to be HA of type active-active, so that both spark clients are connected to an same application, and the request to the rest servers can be given to each of them. This is my need at a higher level, which may be another way to meet it.
I would be very grateful if there would be a similar application or special documentation or experience, because my searches always ended with questions that I showed at the beginning, while they had nothing to do with my problem. Shame if there is a typo in some parts due to my weakness in English. Thanks.
I like your idea a lot (probably because I had to implement quite a few similar things in the past).
In short, I am 95% sure that there is no way to share JVM, SparkContext between machines, executions, etc. I tried to share dataframes between SparkContext and this was a huge fiasco ;).
The way I would approach that:
If your REST server connects to a cluster, once the Spark session is available, register the server to a load balancer.
If you submit your REST server as a Spark job, you can have it register to the load balancer.
You can submit multiple job/start multiple server. They can pick any advertise port, which they will share with the load balancer.
Your REST client would interact with the load balancer, not directly with the Spark REST server. Your REST server will have to have healthcheck endpoints so that the load balancer can do its job.
If one of your REST server goes down, the load balancer could start a new one. You will lose the dataframes of your application, but not multiple applications.
If multiple REST servers need to exchange data, I would use Delta as a "cache" or staging zone.
Does that make sense? It should not be too hard to implement and provide a good HA.
I have a requirement to create around 10 Spring Batch jobs, which will consists of a reader and a writer. All readers read data from some different Oracle DB and write into a different Oracle Db(Source and destination servers are different). And the Spring jobs are implemented using Spring Boot. Also all 10+ jobs would be packaged into a single Jar File. So far fine.
Now the client also wants some UI to monitor the job status and act as a job organizer. I gone through the Spring Data flow Server documentation for UI requirement. But I'm not sure whether it'll serve the purpose, or is there any other alternative option available for monitoring the job status, stop and start the jobs whenever required from the UI.
Also how could I separate the the 10+ jobs inside a single Jar in the Spring Data Flow Server if it's the only option for an UI.
Thanks in advance.
I don't have reputation to add a comment. So, I am posting answer here. Although I know this is not the way to share reference link as an answer.
This might help you:
spring-batch-job-monitoring-with-angular-front-end-real-time-progress-bar
Observability of spring batch jobs is given by data that are persisted by the framework in a relational database... instances..executions..timestamps...read count..write count....
You have different way to exploit these data. SQL client, JMX, spring batch api (JobExplorer, JobOperator), spring admin (deprecated in favor of cloud data flow server).
Data flow is an orchestrator allowing you to execute data pipelines with streams and tasks(finite and short lived/monitored services). For your jobs we can imagine wrap each jobs in tasks and create a multitask pipeline. Data flow gives you status of each executions.
You can also expose your monitoring data by pushing them as metrics in an influxDb for instance...
We are planning to create a new processing mechanism which consists of listening to a few directories e.g: /opt/dir1, /opt/dirN and for each document create in these directories, start a routine to process, persist it's registries in a database (via REST calls to an existing CRUD API) and generate a protocol file to another directory.
For testing purposes, I am not using any modern (or even decent) framework/approach, just a regular SpringBoot app with WatchService implementation that listens to these directories and poll the files to be processed as soon as they are created. It works but, clearly I am most definitely having some performance implications at some time when I move to production and start receiving dozens of files to be processed in parallel, which isn't a reality in my example.
After some research and some tips from a few colleagues, I found Spring Batch + Spring Cloud Data Flow to be the best combination for my needs. However, I have never dealt with neither of Batch or Data Flow before and I'm kinda confuse on what and how I should build these blocks in order to get this routine going in the most simple and performatic manner. I have a few questions regarding it's added value and architecture and would really appreciate hearing your thoughts!
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
Also, if any of you have any guide or sample about how to launch a task programmaticaly, I would really thank you! I am still searching for that, but doesn't seem I'm doing it right.
Thank you for your attention and any input is highly appreciated!
UPDATE
I managed to launch my task via SCDF REST API so I could keep my original SpringBoot App using WatchService launching a new task via Feign or XXX. I still know this is far from what I should do here. After some more research I think creating a stream using file source and sink would be my way here, unless someone has any other opinion, but I can't get to set the inbound channel adapter to poll from multiple directories and I can't have multiple streams, because this platform is supposed to scale to the point where we have thousands of particiants (or directories to poll files from).
Here are a few pointers.
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If you'd have to launch it automatically upon an upstream event (eg: new file), yes, you could do that via a stream (see example). If the events are coming off of a message-broker, you can directly consume them in the batch-job, too (eg: AmqpItemReader).
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
Hopefully, the above example clarifies it. If you want to programmatically launch the Task (not via DSL/REST/UI), you can do so with the new Java DSL support, which was added in 1.3.
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
The recommended approach is to use Config Server. Depending on the platform where this is being orchestrated, you'd have to provide the config-server credentials to the Task and its sub-tasks including batch-jobs. In Cloud Foundry, we simply bind config-server service instance to each of the tasks and at runtime the externalized properties would be automatically resolved.
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Ad a replacement for Spring Batch Admin, SCDF provides monitoring and management for Tasks/Batch-Jobs. The executions, steps, step-progress, and stacktrace upon errors are persisted and available to explore from the Dashboard. You can directly also use SCDF's REST endpoints to examine this information.
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
This is implementation specific. We do not have any benchmarks to share. However, if performance is a requirement, you could explore remote-partitioning support in Spring Batch. You can partition the ingest or data processing Tasks with "n" number of workers, so that way you can achieve parallelism.
Context
I'm in the process of drawing a solution to migrate a huge PL/SQL system to Java. The initial step is migrating some ETL jobs that:
Reads CSV, XML, (XLS, which is a new requirement) and Positional files from several ftp / sftp sources
Process the files according to rules stored in the database and write the results to a database table.
Currently this is done by several store procedures and Jobs.
My company is open to suggestions (if it can run in GlassFish 4 and share its logging and connection pool mechanisms, as well as the admin console, it is a plus).
I've done a little bit of research and the following options caught my eye:
Java EE 7 Batch Processing, sounds simple and particularly well fitted for GlassFish 4.
Spring Batch somewhat more mature and very similar to the Java EE 7 standard (which was probably based on it).
Apache Camel, sounds powerful and would spare us from a lot of fiddling with libraries such a Apache POI, but it also looks somewhat complex. Also I'm not sure if it is the best fit for the job (ETL over huge files).
Cook everything by myself. I could create a Application Client to run a Quartz / Spring Scheduler or even EJB Timers
While I'm still open to suggestions (recommendations would be nice), the best fit so far seems to be Java EE 7 Batch Processing.
One more thing, the infrastructure team have a solution to move files from every ftp source to a local directory, so FTP is really not an issue.
Problem
I've read several tutorials about Java EE Batch Processing and, in all of them, some kind of Servlet or EJB Timer is responsible for starting the Jobs:
JobOperator jobOperator = BatchRuntime.getJobOperator();
jobOperator.start("job", properties);
I could easily upload a web / ejb project and keep pooling for changes. But I was thinking about a push model:
Application client console application
Main class watches directories for new files
When there is a new file it would start a new job.
My doubts are:
Is this strategy possible/ advisable?
Will I need a JMS queue or some kind of producer / consumer strategy in the middle or should I just call jobOperator.start for every file and trust the batch processing layer to manage the application resources? In other words, if a thousand files are delivery to my folder at once and I call jobOperator.start a thousand times, will GlassFish 4 do some kind of smart enqueuing or should I create some kind of Gate so that no more than n jobs run simultaneously?
I've already implemented a project with Batch Processing in Wildfly (Jboss AS). I'm not familiar with configuration details on Glassfish (not using it anymore because the've dropped enterprise support), however I can give you some insights and guidelines according to my experience. Also, please note that Spring and the Batch spec. on EE 7 are quite similar, and your decision to use either technology must depend on "what else" you want to achieve with your application besides the batching. Do you want an easily maintained web interface? Do you want to depelop a REST api?, etc.
The ETL jobs you're describing fit pefeclty with the steps and chunks model in the EE 7 spec, so If you've already tried to develop some tests, you may have noticed that you still need to code the file readers and mappers for each file specification. Your reading sources are quite standard, and you will easily find a library to read/stream them and process their data.
The project I've implemented is quite simple. Customers uplodad files that need to be processed in order to feed a data warehouse. This service is on the "cloud". Files have a defined spec and must be in CSV format. Most processing results are dimentional "Upserts" and fact "erasing prior inserting". The user has a Web interface on which files and batch processing metadata must be shown (processing state, dates, rejected items, etc.). Because it is a cloud service, the files must not reside locally on each server (using S3).
So the first thing to design are the chunk steps. I didn't want to have an implementation for each file spec., So what I did is to design a "fit all cases" implemetation that process files according to the metadata contained in them and also the job configuration itself. This is the easy part. The second thing to think about is the processing and metadata administration. Here, I developed a REST api and a Web interface that uses it. After all this, Will it scale? Wilfly has thread configuration parameters for the Batch Processing, and you can increase or decrease the thread availability for the JobOperator. Jobs are not submitted if there are not enough threads available. So what happends to those requests? Well, they can reside on memory, a backed up stateful session can be developed, you can definitely implement MQ listener of queued processing requests. What I did was much simpler. The company doesn't have the resources to maintain a cluster, so whe did an elastic configuration that will expand accoding to cpu consumption and requests volume. So far, the application has processed 10 TB of data, from 15 customers, and at max request/processing peak, 3 elastic instances have fired up.
A file listener is an interesting idea. You can listen to a directory and drop a processing request to a queue or inmediately to the BatchRuntime. It will depend on how you want to scale it, your needed response time, the available resources, etc.
Feel free to ask me anything.
Regards.
EDIT: forgot to mention. I don't really recommend using the Application client unless you've already got something deployed on your organization. The recent security constraints and java SE updates mechanism has made a real hassle to maintain those kind of deployments. Think web.
I would approach it this way.
My hammers for this use case would be the Java Watch Service, a Servlet, a JMS queue, and the Batch service.
First, the Watch Service is the Java 7 go to place to handle the file system monitoring.
I would write a Watch Service implementation, and I would run it on a thread.
Where does the thread run you ask?
Officially, you should probably be using JCA for this. But, JCA is flat out a pain to work with, underutilized, thus under documented. There are solid examples, but it's simply not a common technology in the Java EE stack.
Another place is an asynchronous Session Bean invocation. There's nothing that suggests these can not be long lived invocations. You could stand up a #Singleton Session Bean, with #Startup, call the async method from a #PostConstruct method, and let it go. Then, in #PreDestroy signal the long running method to stop, so it can cleanly shut down. This should all be to spec, portable, and according to Hoyle.
The third place is to you a ServletContextListener, which is the pre-Java EE 6 go to place for tying code in to the life cycle of the application. Here, you would create the thread yourself in the contextInitialized method, and then tear it down in the contextDestroyed method.
Creating threads here is "less defined", but I've done it for years and never had a problem.
Now that you have your service running, the service (IMHO), will do two things.
1) It'll sense when a new file has arrived in the directory, and when it does, it will MOVE (mv, rename) the file to a parallel "processing" directory. The reason is that this tells you that a file has moved from incoming to processing, that the file is a work in progress. It's obvious from a directory listing, regardless of what the backend thinks it's doing. Remember, the system can go down mid way through a file.
2) Once moved, post the file name, and any other meta data on to a JMS queue and have an MDB do tool up the batch job.
Why add the JMS queue? It brings a couple of features to the party. First, it's great way to get stuff "from outside" the happy transactional context that EJB likes, to inside one. Second, it's transactional. You can, depending on your ETL use case, have the MDB directly process the job. And by doing so, you simply do not acknowledge the message from the queue until the processing is done (and the file is deleted or moved from the "processing" directory). In an ideal world, the message queue has messages matching the files in the processing directory. When the processing is done, the method returns, the message fetch "commits", and you're done. If the system crashes, this will restart from the beginning automatically (since the message is still on the queue and was never removed).
The MDB, by configuring it's instances, can gate the number of simultaneous jobs also. Configure 10 instances, only 10 files can be processed at the same time. But this can be a little too simple, too coarse. There's no priority for example (first come first serve). But it might work for you.
But either way, the MDB is a great gateway into system, since each one starts with it's own little bit of transactional context. Unlike the long running servlet thread or the long running async thread. The servlet thread has a questionable (if any) transactional status, the long running thread inherits it's state from the #Startup method, and retains it for it's life time. The MDB gets a new one each time. Much of this can be shenaniganed away calling methods with new transactions.
But I like the demarcation of the MDB. Even if it's entire task is to create the Batch entry for a file name, the MDB is a good gatekeeper.
And that's pretty much it.
The key parts are being a good citizen and tearing down your thread properly tied to the lifecycle of the application, understanding your transactional state at the various components, and understanding how all the moving parts fit together.
If you use the #Startup technique, make sure you invoke your async method via injecting another instance of your session bean. Otherwise the invocation will be a local call, and not asynchronous. You'll stare at it wondering why your server is hanging and not starting up. All of the EJB annotations only work when invoked through an injected or looked up proxy.
Have fun, share and enjoy.
Addenda to the question:
There's really no value to having an external process manage the watch service. One tied to the lifecycle of the server is easier to maintain. Two things come to mind. If the server is down, file will simply stack up in the file system until the server is started again, so you don't lose data. If you have an external service, then you either have it sending messages to a dead server, or you have to stage and manage the JMS server separate from the app server. In that case you now have 3 processes to manage: Watch service, JMS Server, and app server, rather than just the app server.
I agree with the other poster that should you decide to go with an external service anyway, a simple Java SE app posting simple messages to a JAX-RS REST service on the server, or even a trivial Servlet is much, MUCH more easy to maintain, stage and deploy than an app client. If you do it that way, you could write the watch service in something completely different.
But since the server (ostensibly) has direct access to the file system with the file, there's really no motivation to break this service outside of the container. Put the whole kit in to an EAR and have at it. Just flat easier management.
I currently have a tomcat container -- servlet running on it listening for requests. I need the result of an HTTP request to be a submission to a job queue which will then be processed asynchronously. I want each "job" to be persisted in a row in a DB for tracking and for recovery in case of failure. I've been doing a lot of reading. Here are my options (note I have to use open-source stuff for everything).
1) JMS -- use ActiveMQ (but who is the consumer of the job in this case another servlet?)
2) Have my request create a row in the DB. Have a seperate servlet inside my Tomcat container that always runs -- it Uses Quartz Scheduler or utilities provided in java.util.concurrent to continously process the rows as jobs (uses thread pooling).
I am leaning towards the latter because looking at the JMS documentation gives me a headache and while I know its a more robust solution I need to implement this relatively quickly. I'm not anticipating huge amounts of load in the early days of deploying this server in any case.
A lot of people say Spring might be good for either 1 or 2. However I've never used Spring and I wouldn't even know how to start using it to solve this problem. Any pointers on how to dive in without having to re-write my entire project would be useful.
Otherwise if you could weigh in on option 1 or 2 that would also be useful.
Clarification: The asynchronous process would be to screen scrape a third-party web site, and send a message notification to the original requester. The third-party web site is a bit flaky and slow and thats why it will be handled as an asynchronous process (several retry attempts built in). I will also be pulling files from that site and storing them in S3.
Your Quartz Job doesn't need to be a Servlet! You can persist incoming Jobs in the DB and have Quartz started when your main Servlet starts up. The Quartz Job can be a simple POJO and check the DB for any jobs periodically.
However, I would suggest to take a look at Spring. It's not hard to learn and easy to setup within Tomcat. You can find a lot of good information in the Spring reference documentation. It has Quartz integration, which is much easier than doing it manually.
A suitable solution which will not require you to do a lot of design and programming is to create the object you will need later in the servlet, and serialize it to a byte array. Then put that in a BLOB field in the database and be done with it.
Then your processing thread can just read the contents, deserialize it and work with the ressurrected object.
But, you may get better answers by describing what you need your system to actually DO :)