Luigi vs Spring Batch

Luigi vs Spring Batch - java

I have to load txt files into oracle tables. Nowadays the process is being done using bash scripting, sql loader and command line tools for validations.
I'm trying to find more robust alternatives. The two options I came up with are Luigi (Python framework) and Spring Batch.
I made a little POC using Spring Batch, but I believe it has a lot of boilerplate code and might be an overkill. I also prefer Python over Java. The good thing about Batch is the job tracking schema that comes out of the box with the framework.
Files contain from 200k to 1kk records. No transformations are performed, only datatype and lenght validations. First steps of the job consist on checking header, trailer, some dates, making queries to parameters table and truncating the staging table.
Could you give me some pro and cons of each framework for this use case?

I would argue they are not equivalent technologies. Luigi is more of a workflow/process management framework that can help organize and orchestrate many different batch job
The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else. https://luigi.readthedocs.io/en/stable/
Spring Batch gives you a reusable framework for structuring a batch job. It gives you a lot of things out of the box, like being able to read input from text files and write output to databases.
A lightweight, comprehensive batch framework designed to enable the development of robust batch applications vital for the daily operations of enterprise systems.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management https://spring.io/projects/spring-batch
You could theoretically run Spring Batch jobs with Luigi.
Based on the brief description of your use case, it sounds like the bread and butter of what inspired Spring Batch in the first place. In fact, their 15 minute demo application covers the use case of reading from a file and loading records into a JDBC database https://spring.io/guides/gs/batch-processing/.

Related

Given this (see inside) situation, is it worth it to use Hibernate?

We are about to create a Java standard project which is actually a batch process that executes at console.
Every "batch" uses only select statements on multiple tables from different DBs. But we'll be doing around thousands of selects.
I'm not really familiar with the "whole" of Hibernate but is it worth using it in this situation?

Have you taken a look at Spring Batch:
Spring Batch is a lightweight, comprehensive batch framework designed
to enable the development of robust batch applications vital for the
daily operations of enterprise systems. Spring Batch builds upon the
productivity, POJO-based development approach, and general ease of use
capabilities people have come to know from the Spring Framework, while
making it easy for developers to access and leverage more advanced
enterprise services when necessary.

Not necessary. As your description, your db operation is quite simple, so why not just use jdbc directly or some simple libs such as spring jdbc template (http://docs.spring.io/spring/docs/3.0.x/spring-framework-reference/html/jdbc.html)
There's no need to import a huge dependency like Hibernate in my opinion. The time of learning and configuring Hibernate is uncertain, so why not just focus on the main project requirements and make it simple at the first beginning.

If you will perform select on many different tables and/or need to manipulate the data, so the easiest way is retrieve rows from database as instances of objects i strongly recommend hibernate.

Design options for Spring Batch implementation

We are to replace a legacy scheduler implementation written in PL/SQL in a large enterprise environment and I'm evaluating Spring Batch for the job.
Although the domain is pretty close to the one provided by Spring Batch, there are few subtleties we have to address before deciding to embark on Spring Batch.
One of them has to do with jobs. In our domain, a job can be one of two things: a shell script or a stored procedure. Jobs and steps definitions are stored in a table in the database. Before each execution, they are instanciated into another table similiar to job_execution table in Spring Batch.
There are hundreds of jobs and thousands of steps. Keeping and managing these in xml files don't seem like a viable option. We'd like to continue to manage jobs description in tables.
So what are our options to CRUD jobs definitions at runtime and in a database?

AFAIK you're out of luck. Following suggestions are hardly worth the effort for you in my opinion.
Your jobs are brought to limited amount of parameterized templates that are predefined and filled from DB on (parallel) execution.
You develop a tool that creates configuration from DB on request.
e.g. Spring configuration from database

Multithreaded batch processing to write to and read from database

I am supposed to design a component which should achieve the following tasks by using multiThreading in Java as the files are huge/multiple and the task has to happen in a very short window:
Read multiple csv/xml files and save all the data in database
Read the database and write the data in separate files csv & xmls as per the txn types. (Each file may contain different types of records life file header, batch header, batchfooter, file footer, different transactions, and checksum record)
I am very new to multithreading & doing some research on Spring Batch in order to use it for the above tasks.
Please let me know what you suggest to use traditional multithreading in Java or Spring Batch. The input sources are multiple here and output sources are also multiple.

I would recommend going with something from framework rather than writing whole threading part yourself. I've quite successfully used Sping's tasks and scheduling for scheduled tasks that involved reaching data from DB, doing some processing, and sending emails, writing data back to database).

Spring Batch is ideal to implement your requirement. First of all you can use the builtin readers and writers to simplify your implementation - there is very good support for parsing CSV files, XML files, reading from database via JDBC etc. You also get the benefit of features like retrying in case of failure, skipping input that is invalid, restarting the whole job if something fails in between - the framework will track the status & restart from where it left off. Implementing all this by yourself is very complex and doing it well requires a lot of effort.
Once you implement your batch jobs with spring batch it gives you simple ways of parallelizing it. A single step can be run in multiple threads - it is mostly a configuration change. If you have multiple steps to be performed you can configure that as well. There is also support for distributing the processing over multiple machines if required. Most of the work to achieve parallelism is done by Spring Batch.
I would strongly suggest that you prototype a couple of your most complex scenarios with Spring Batch. If that works out you can go ahead with Spring Batch. Implementing it on your own especially when you are new to multi threading is a sure recipe for disaster.

Migrating A Java Application to Hadoop : Architecture/Design Roadblocks?

Alrite.. so.. here's a situation:
I am responsible for architect-ing the migration of an ETL software (EAI, rather) that is java-based.
I'll have to migrate this to Hadoop (the apache version). Now, technically this is more like a reboot and not a migration - coz I've got no database to migrate. This is about leveraging Hadoop, such that, the Transformation phase (of 'ETL') is parallel-iz-ed. This would make my ETL software,
Faster - with transformation parallel-iz-ed.
Scalable - Handling more data / big data is about adding more nodes.
Reliable - Hadoop's redundancy and reliability will add to my product's features.
I've tested this configuration out - changed my transformation algos into a mapreduce model, tested it out on a high end Hadoop cluster and bench-marked the performance. Now, I'm trying to understand and document all those things that could stand in the way of this application redesign/ rearch / migration. Here's a few I could think of:
The other two phases: Extraction and Load - My ETL tool can handle a variety of datasources - So, do I redesign my data adapters to read data from these data sources, load it to HDFS and then transform it and load it into the target datasource? Could this step act as a huge bottleneck to the entire architecture?
Feedback: So my transformation fails on a record - how do I let the end user know that the ETL hit an error on a particular record? In short, how do I keep track of what is actually going on at the app level with all the maps/reduces/merges and sorts happening - The default Hadoop web interface is not for the end-user - its for admins. So should I build a new web app that scrapes from the Hadoop web interface? (I know this is not recommended)
Security: How do I handle authorization at Hadoop level? Who can run jobs, who are not allowed to run 'em - how to support ACL?
I look forward to hearing from you with possible answers to above questions and more questions/facts I'd need to consider, based on your experiences with Hadoop / problem analysis.
Like always, I appreciate your help and thank ya all in advance.

I do not expect loading to the HDFS to be a bottlneck, since the load is distributed among datanodes - so the network interface will be only bottleneck. Loading data back to the database might be a bottlneck but I think it is no worse then now. I would design jobs to have their input and their output to sit in the HDFS, and then run some kind of bulk load of results into the database.
Feedback is a problematic point, since actually MR have only one result - and it is transformed data. All other tricks, like write failed records into HDFS files will lack "functional" reliability of the MR, because it is a side effect. One of the ways to mitigate this problem you should design you software in the way to be ready for duplicated failed records. There is also scoop = the tool specific for migrating data between SQL databases and Hadoop. http://www.cloudera.com/downloads/sqoop/
In the same time I would consider usage of HIVE - if Your SQL transformations are not that complicated - it might be practical to create CSV files, and make initial preaggregation with Hive, therof reducing data volumes before going to (perhaps single node) database.

Usecase for Workflow Engine

We have an issue where a Database table has to be updated on the status for a particular entity. Presently, its all Java code with a lot of if conditions and an update to the status. I was thinking along lines of using a Workflow engine since there can be multiple flows in future. Is it an overkill to use a Workflow Engine here... where do you draw the line ?

It depends on the complexity of your use case.
In a simple use case, we have a database column updated by multiple consumers for each stage in an Order lifecycle. This is done by a web service calling into the database.
The simple lifecycle goes from ACKNOWLEDGED > ACCEPTED/REJECTED > FULFILLED > CLOSED. All of these are in the same table on the same column. This is executed in java classes with no workflow.
A workflow engine is suited in a more complex use case which involves actions on multiple data providers eg: database or Content Mgmt or Document Mgmt or search engine, multiple parallel processes, forking based on the success/failure of a previous step, sending an email at a certain step, offline error alerting.
You can look at Apache ODE to implement this.

We have an issue where a Database table has to be updated on the status for a particular entity. Presently, its all Java code with a lot of if conditions and an update to the status.
Sounds like something punctual, no need for orchestrating actions among workflow participants.
Maybe a rule engine is better suited for this. Drools could be a good candidate. When X then Y.

If you're using Spring, this is a good article on how to implement your requirement
http://www.javaworld.com/javaworld/jw-04-2005/jw-0411-spring.html

I think you should consider a workflow engine. Workflow should be separated from application logic.
Reasons:
Maintainable: Easier to modify, add new flows and even easier to replace by another workflow engine.
Business Process management: Workflows are mostly software representations of BPM. So it is usually designed by process designers (Non-tech people). So it is not a good idea to code inside the application. Instead BPM products such as ALBPM or JPBM should be used which support graphical workflow designs.
Monitoring business flows: They are often monitored by the Top level managers and used to make strategic decisions.
Easier for Data mining/Reports/Statistics.
ALBPM(Now Oracle BPM): is a commercial tool from Oracle suitable for large scope projects.
My recommendation is JBPM. Open source tool from JBOSS. Unlike ALBPM which requires separate DB and application server, it can be packaged with your application and runs as another module in your application. I think suitable for your project.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.