Getting hadoop cluster information from within a job - java

I am writing a hadoop job that should collect the start and finish times of all jobs that ran in a cluster, and upload this data to a blob. However, I'm not sure how to get this information, as a job doesn't seem to have access to the job tracker. Any ideas?

You could make use of getLaunchTime() and getFinishTime() methods provided by the Class JobInProgress. The API also has a JobTracker Class that provides getJobsFromQueue(String queue) method that can be used to get all the jobs submitted to the particular Queue.
Apart from these methods these classes also have some other very useful methods which you might find helpful.
HTH

Related

Approaches to split scheduled jobs from microservices without code duplication

Here're goals i'm trying to achieve:
Take the scheduled jobs out of microservice because it can and would harm timings/performance
Execute jobs in a separate computation cluster aka workers
Avoid code duplication: i want to keep all my business logic in one Service, all DB-related operations in one Dao, do not write additional services/daos for jobs
Avoid dependency management problems: different jobs may require different libs/versions/etc. For instance, job from ServiceA may use javax.annotation-api while job originated from ServiceB may use jakarta.annotation-api. Making a worker depend both on ServiceA and ServiceB will cause build or runtime problems.
Are there any approaches/libraries/solutions to achieve all the goals at the same time?
UPD:
Both Temporal.io and quartz are not quite what I need - they both require worker to depend on workflow tasks.
I can imagine that I’m approaching the issue I face in incorrect way, so architectural advises are also appreciated
From architectural perspective, expose service (business logic) via API.
Have schedulers run on separate instance or if you are using some of the popular cloud solutions have their FaaS (function as a service in your case scheduler) trigger service API via HTTP (any or dedicated instance).
Azure -> user azure functions
AWS -> lambda functions
Google Cloud -> Google Cloud Functions
All of the above have comprehensive guide how to create scheduled function aka trigger.
Hope this helps and I'm not off topic.
From my perspective you have the option to use one of three possible solutions:
Most straight forward - Ensure that service logic which is required in the jobs also implements a local API (programming API).
As such it can act as and be imported as a library and reused in jobs without code duplication.
If you have a larger development organization you also want to make sure that such libraries are correctly version managed and version releases are pre-planned, which allows the teams using the libraries to treat them like they would third party libraries.
Also there is no magic - You would have to work through any build/dependency problems if there would be conflicts. (Since your question sounds like this is a deal breaker, let's take a look at the other solutions.)
The second solution would be to provide a wrapper for each service logic that allows to access functionality via CLI. That means you don't have to import the libraries, but rather execute them as jars/executables through the CLI. This would allow you to use the same code but avoid dependency problems.
(You will still have to deal with version management and version upgrades, etc.)
In case you use containerized deployments/hosting you can also consider to bundle up multiple containers together just for your jobs, where each job gets its own private service container instances for use during the job. Kubernetes and Docker Compose for example have options to run such multi-container deployments/jobs.
That solution would allow you to reuse the same services as they run for other purposes, but you have to make sure that they are configurable enough to work in this scenario.
One problem that all of the approaches have is that you have to make sure there are no runtime conflicts between your jobs and the deployed regular services. (For example state conflicts)
In terms of how to execute jobs it will depend on your deployment scenario. Kubernetes has an option to run containers as jobs natively, which makes it easy to bundle multiple jars, etc. But it is always an option to deploy a dedicated scheduler or workflow tool like Apache Airflow to run your jobs.

How to monitor Executors or other Task Executing Threads in Spring MVC

I am creating a web application and in this I will be creating many services and executors to do some tasks.I have extended dispacter servlet and started executors or other threads in init method.Is this the right approach?
Now Suppose if any request comes and that executor or similar task executing thread dies after throwing Exception.
1.I suppose that it will affect other requests also.So what shall I do in such cases?
2.How can I create a monitor thread which will check if all critial tasks executing thread and executors are properly running?
3.Should I keep another backup executor prepared and deferred to takeover the failed executor in such situations?If so then how?
It is an old one, but maybe it would help someone :)
For ExecutorService there is a nice example on how to approach the problem in Codahale Metrics: https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/InstrumentedExecutorService.java
I did not find anything as good for the Spring AsyncTaskExecutors :/
Its been a while since I have used an Executor, are you using one of the built-in executors from Java, Spring, or are you rolling your own? Spring has a bunch, and I think Java gives you two or three concrete implementations.
Anyhow, I think the answer would be to roll some sort of monitoring service, like maybe something using JMX if you have that available. If you wanted to wire your Executors auto-magically you could use the ApplicationContextAware interface to get a reference to the ApplicationContext, which has a method called getBeansOfType(). If you want to take the more straight-forward approach, then you can simply write your monitoring service to inject the executors in there directly.
Another option would be to get an external monitoring framework/app. Something like Dynatrace, which attaches to the JVM process and monitors things or, if you don't mind switching app servers, SpringSource's tcServer, which has optional instrumented Spring JARs and provides a ton of out-of-the-box monitoring.

Scheduled Tasks with Quartz - JDBC

I'm trying to make a mini web application for reminders, I deploy Quartz Scheduler to handle the issue of launch events reminder, I have understood the tasks (Jobs) and programmers (Schedulers) can be configured from a Database with JDBC, I have searched and can not find an example where I show what information should I put on the tables and I run java code to start operating scheduled tasks. If someone can have an example or something that I can serve this purpose, they are grateful.
You have understood wrong. You can use any JobStore (including the JdbcJobStore to store your jobs/triggers/etc. but creating them manually in the database is a bad idea™.
Depending on how you are using Quartz you can set it up, either using SPRING or using the Fluent syntax (which I believe is the preferred method these days).
Further reading: http://quartz-scheduler.org/documentation/quartz-2.1.x/tutorials/tutorial-lesson-09

Asynchronous Scheduling in Java using Quartz

I need a mechanism for implementing asynchronous job scheduling in Java and was looking at the Quartz Scheduler, but it seems that it does not offer the necessary functionality.
More specifically, my application, which runs across different nodes, has a web-based UI, via which users schedule several different jobs. When a job is completed (sometime in the future), it should report back to the UI, so that the user is informed of its status. Until then, the user should have the option of editing or cancelling a scheduled job.
An implementation approach would be to have a Scheduler thread running constantly in the background in one of the nodes and collecting JobDetail definitions for job execution.
In any case, there are two the questions (applicable for either a single-node or multi-node scenario):
Does Quartz allow a modification or a cancellation of an already scheduled job?
How can a "callback" mechanism be implemented so that a job execution result is reported back to the UI?
Any code examples, or pointers, are greatly appreciated.
Does Quartz allow a modification or a cancellation of an already scheduled job?
You can "unschedule" a job:
scheduler.unscheduleJob(triggerKey("trigger1", "group1"));
Or delete a job:
scheduler.deleteJob(jobKey("job1", "group1"));
As described in the docs:
http://quartz-scheduler.org/documentation/quartz-2.x/cookbook/UnscheduleJob
Note, the main difference between the two is unscheduleing a job removes the given trigger, while deleting the job removes all triggers to the given job.
How can a "callback" mechanism be implemented so that a job execution result is reported back to the UI?
Typically, in the web world, you will poll the web server for changes. Depending upon your web framework, there may be a component available (Push, Pull, Poll?) that makes this easy. Also, what you will need to do is store some state about your job on the server. Once the job finishes you may update a value in the database or possibly in memory. This, in turn, would be picked up by the polling and displayed to the user.
To complete the answer of #johncarl with Quarz also have this
handle event's before and after the excution implementing SchedulerListener interface
Groups your jobs
Support with Enterprise Edition
Most important the cron expressions here's the documentation

How to run multiple jobs with spring quartz and the jobs are feched from database

How to run multiple jobs with spring quartz and the jobs are feched from database.
Please provide any example code.
There are several parts to this question:
How to run Quartz.
How to connect to a database.
How to create a schema to describe "jobs".
How to create and execute a "job" from the schema.
Marry all these together and you'll answer your own question.
Computer science is about decomposition: breaking large problems into smaller ones that you can handle. I'd recommend taking that approach here.
Well you can start by reading the documentation here and here. If you then have a more specific question, then come back and ask it.

Categories