ETL architecture

ETL architecture - java

I've been asked to make an ETL-style application that transfers information from one data source to another. At the moment, I've decided to use a three-layer architecture but I would like to find out more about the best practices as well as the life cycle described on this wikipedia page:
http://en.wikipedia.org/wiki/Extract,_transform,_load
Four-layered approach for ETL architecture design
Functional layer: Core functional ETL processing (extract, transform, and load).
Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
Utility layer: Common components supporting all other layers.
Real-life ETL cycle
The typical real-life ETL cycle consists of the following execution steps:
Cycle initiation
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
Stage (load into staging tables, if used)
Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
Publish (to target tables)
Archive
Clean up

I don't know what your situation is or what your requirements are, but you're likely over thinking the problem.
The name alone is "the" architecture:
Extract
Transform
Load
Exporting a DB table to a CSV can be considered "ET" while loading the CSV is the "L". Most ETL problems are simply not complicated.
Beyond that, you should grab any of the 1 or 2 million ETL and ESB packages already available in Java, free and commercial, libraries and full boat processing systems, and simply adopt one of them that you like best.
Get a white board, string some bubbles together with lines and turn that in to code.

To answer the question, "What's the best practice?" the answer depends on what you are trying to accomplish.
To simplify let's assume you are doing one of the following:
You are building a data warehouse that will restructure the data in some way
You are moving data from point A to point B, but you are not restructuring the data
When I use the word "restructuring", I mean changing the grain or lowest level of detail of a table.
For 1. The ten steps outlined in your question is generally followed. General best practices:
As much transformation logic as possible is pushed onto database resources, not ETL software (ETL software is generally slower)
Validate, Transform, and Audit steps are used to employ whatever Master Data Management (MDM) standards your organization uses
For 2. This is much more straightforward so either method outlined in your question can be used.

Related

Server design suggestions to support multiple workflows managed by multiple teams

Looking for inputs to a design problem. We are redesigning our existing server which has served well so far but won't scale in the future.
Current design: It is one server(we run multiple instances of the same server) which has many workflows. Lets call these workflows A,B,C,D handled inside the server. Until now, we have one development team working on this server which made handling releases easy. Performance is also decent because we leverage in memory caching.
Future design: We now have multiple teams(each team handling one workflow, Team A handling workflow A, Team B handling workflow B and so on). With this new team structure and current design, we are unable to scale our releases(Since its one server, we have only one team releasing at any given time thus reducing overall team efficiency). There is a need for isolation so teams can release their changes to workflows independent of each other. Also, we expect more workflows to be on-boarded into this server.
Any design ideas on how we can solve this problem of ever increasing workflows
My current solution: Split the server into 4 servers so each team can manage the workflows individually. The disadvantage with this approach is the code management. Most of these workflow share common code base. Also, splitting the server causes us to lose out on cache(which is not an issue with current design)
Look forward to hearing your suggestions.

Splitting into different workflow according to different teams makes complete sense. Some advantages that come to mind are:
Independent releases like you mentioned.
Crashes/Memory leaks/resource hogging from workflow A won't affect workflow B
Each Workflow server can be scaled independently. A popular workflow A could be scaled to say more servers while a rarely used workflow B could be working with just 1 server.
There could be more, just pointing out the obvious ones supporting the split.
How to handle disadvantages - let us try to understand with the example of Library Management System. Let us say we need workflows for member borrowing a book, member returning a book, registering a new member.
Most of these workflow share common code base
To resolve this we identify the core common part, in my example I will take definition of book(id, name, field), member(id, name, email). Besides the definitions, I can also have common functions that work on them, like serialisers, parsers, validators.
Now my workflow/s will depend on this common repo. The borrow book workflow will completely be different from add a member workflow, but they will use the same building blocks.
the server causes us to lose out on cache
Exactly what needs to cached and what is the behaviour of the cache is very important.
A fairly static cache(say member cache) can be setup on distributed cache like redis. Say there is a workflow which will identify the close deadlines for the borrowed books and send reminder emails to those members. Once the member ids are identified, their emails could be looked up in the redis cache.
A workflow can have a personalised cache as well. For example during searching for books in the library with the name, the result can be cached in the workflow server only in memory with a TTL, and can be served if same query is asked in near future.
To conclude, the disadvantages you have are nothing but design challenges. I hope that with this random example I was able to give you a few points to wonder upon. Depending on your actual use case, my answer might completely be irrelevant. If so, sincere apologies. :)

Is there a mature Java Workflow Engine for BPM backed by NoSQL?

I am researching how to build a general application or microservice to enable building workflow-centric applications. I have done some research about frameworks (see below), and the most promising candidates share a hard reliance upon RDBMSes to store workflow and process state combined with JPA-annotated entities. In my opinion, this damages the possibility of designing a general, data-driven workflow microservice. It seems that a truly general workflow system can be built upon NoSQL solutions like MondoDB or Cassandra by storing data objects and rules in JSON or XML. These would allow executing code to enforce types or schemas while using one or two simple Java objects to retrieve and save entities. As I see it, this could enable a single application to be deployed as a Controller for different domains' Model-View pairs without modification (admittedly given a very clever interface).
I have tried to find a workflow engine/BPM framework that supports NoSQL backends. The closest I have found is Activiti-Neo4J, which appears to be an abandoned project enabling a connector between Activity and Neo4J.
Is there a Java Work Engine/BPM framework that supports NoSQL backends and generalizes data objects without requiring specific POJO entities?
If I were to give up on my ideal, magically general solution, I would probably choose a framework like jBPM and Activi since they have great feature sets and are mature. In trying to find other candidates, I have found a veritable graveyard of abandoned projects like this one on Java-Source.net.

Yes, Temporal Workflow has pluggable persistence and runs on Cassandra as well as on SQL databases. It was tested to up to 100 Cassandra nodes and could support tens of thousands of events per second and hundreds of millions of open workflows.
It allows to model your workflow logic as plain old java classes and ensures that the code is fully fault tolerant and durable across all sorts of failures. This includes local variable and threads.
See this presentation that goes into more details about the programming model.

I think the reason why workflow engines are often based on RDBMS is not the database schema but more the combination to a transaction-safe data store.
Transactional robustness is an important factor for workflow engines, especially for long-running or nested transactions which are typical for complex workflows.
So maybe this is one reason why most engines (like activi) did not focus on a data-driven approach. (I am not talking about data replication here which is covered by NoSQL databases in most cases)
If you take a look at the Imixs-Workflow Project you will find a different approach based on Java Enterprise. This engine uses a generic data object which can consume any kind of serializable data values. The problem of the data retrieval is solved with the Lucene Search technology. Each object is translated into a virtual document with name/value pairs for each item. This makes it easy to search through the processed business data as also to query structured workflow data like the status information or the process owners. So this is one possible solution.
Apart from that, you always have the option to store your business data into a NoSQL database. This is independent from the workflow data of a running process instance as far as you link both objects together.
Going back to the aspect of transactional robustness it's a good idea to store the reference to your NoSQL data storage into the process instance, which is transaction aware. Take also a look here.
So the only problem you can run into is the fact that it's very hard to synchronize a transaction context from a EJB/JPA to an 'external' NoSQL database. For example: what will you do when your data was successful saved into your NoSQL data storage (e.g. Casnadra), but the transaction of the workflow engine fails and a role-back is triggered?

The designers of the Activiti project have also been aware of the problem you have stated, but knew it would be quite a re-write to implement such flexibility which, arguably, should have been designed into the project from the beginning. As you'll see in the link provided below, the problem has been a lack of interfaces toward which to code different implementations other than that of a relational database. With version 6 they went ahead and ripped off the bandaid and refactored the framework with a set of interfaces for which different implementations (think Neo4J, MongoDB or whatever other persistence technology you fancy) could be written and plugged in.
In the linked article below, they provide some code examples for a simple in-memory implementation of the aforementioned interfaces. Looks pretty cool and sounds to perhaps be precisely what you're looking for.
https://www.javacodegeeks.com/2015/09/pluggable-persistence-in-activiti-6.html

Data Driven Rules Engine - Drools

I have been evaluating Drools as a Rules Engine for use in our Business Web Application.
My use case is a Order Management Application.
And the rules are of following kind:
- If User Type is "SPECIAL" give an extra 5% discount.
- If User has made 10+ Purchases already, give an extra 3% discount.
- If Product Category is "OLD", give a Gift Hamper to the user worth $5.
- If Product Category is "NEW", give a Gift Hamper to the user worth $1
- If User has made purchases of over $1000 in the past, Shipping is Free
The immediate challenge i see is that:
- There is no meaningful UI that i can offer to the end users to modify the rules.
- Guvnor UI or any Editor to modify drl files is just not acceptable from end user point of view
- Most of these Rules will operate on often huge data available in db
So,
- I want a way for Admin users to specify these Rule from within my Web App UI.
- Could i store these "Rules" in database, and then operate on them via Drools - at least that allows me to "modify" these Rules via my "own" UI. So this is something like a Decision Table in DB.
- What is the best way to go about this?

You asked me to give an answer to your question, given my answer to Data driven business rules. My answer to that question was that SQL is a bad solution to execute business rules stored in the database. The person who asked that question wanted to generate SQL expressions from their stored business rules, and I cautioned against doing that, because it would lead to problems in security, testability, performance, and maintenance.
I have not used Drools, but I gather from documentation that it includes Guvnor, a business rules manager that supports using an RDBMS as a repository for user-defined rules.
[Drools] Guvnor uses the JCR standard for storing assets such as rules. The default implementation is Apache Jackrabbit, http://jackrabbit.apache.org. This includes an out of the box storage engine/database, which you can use as is, or configure to use an existing RDBMS if needed. (http://docs.jboss.org/drools/release/5.2.0.Final/drools-guvnor-docs/html/chap-database_configuration.html)
Apache Jackrabbit is not an RDBMS, it is "a content repository is a hierarchical content store with support for structured and unstructured content, full text search, versioning, transactions, observation, and more." This seems like a more appropriate repository for Drools.
But Drools doesn't say it tries to use SQL to execute those business rules. It has a separate component, Drools Expert (Rules Engine) to do that.
Drools Expert is a declarative, rule based, coding environment. This allows you to focus on "what it is you want to do", and not the "how to do this".
(http://www.jboss.org/drools/drools-expert.html)
SQL is also a declarative programming language, but it's designed to perform relational operations on table-structured data. A language to implement a rules engine has different goals, and can probably do things that SQL can't (and vice-versa).
So I would suggest if you use Drools, feel free to use an RDBMS as a repository as they document (use their JCR-compliant implementation of content repository, do not try to design your own). Then use their Drools Expert as a specialized language designed for executing rules.

There is no meaningful UI that i can offer to the end users to modify the rules.
Out of the box, Guvnor provides web based decision tables (and Excel if you prefer), as you say you would like to provide. It provides guided editors for more complex rules, but your rules would appear to be very simple.
Guvnor UI or any Editor to modify drl files is just not acceptable from end user point of view
As mentioned, Guvnor supports decision tables. If you don't like the layout of the Guvnor web application, then you can just embed the Guvnor editors into your own web application.
Most of these Rules will operate on often huge data available in db
The size of your database is irrelevant to the use of Guvnor. Guvnor is for editing rules, not runtime evaluation. Drools Expert is the runtime rules engine. It's fast. It can deal with very large volumes of data and very large volumes of rules. All you need to do is write database queries to get relevant chunks of that data into the rules engine at runtime. You need to do that, whatever solution you try to implement.
On a side-note, if what you're really after is an explanation of when rules engines are good (and bad) solutions to a problem, then I would recommend reading the Why use a Rule engine? section of the Drools Expert manual.

Generally, I've found it is easier to work at a more abstract level, such as a Domain Model, and have some sort of programmatic conversion from that to Drools rules, instead of dealing with Drools rules directly. That way, you can store your Domain Model however you like, and you can build UIs around it, etc, and still have the option to generate Drools rules on demand. Then challenge with this is creating a programmatic transformation from your model to Drools rules, but templating tools will help here. I've used Groovy templating for this, and it has worked well.

Program Design - Package by Feature vs. Layer or Both?

I am in the design stage of a web application that allows users to create requests of work and the workers to put time against those requests. The application will also have reporting capabilities for supervisors to get daily totals, reports, and account for time spent, "cost allocation".
Applications I've worked on in the past have been designed using the package by layer approach. I'm thinking it would be more efficient to use a package by feature design and I have a question about this design.
What I am currently thinking for the packages by feature:
Requests - CRUD the requests, assign then, add invoice numbers, etc...
Work Time - CRUD daily time for users against requests, holiday, training, or meetings
Cost Allocation - create reports, accounting things that accountants want ...
The front-end will be Tomcat server and JSP. And, the back-end will be an Oracle database with EclipseLink doing the persistence.
My question:
In my understanding of package by feature, the entities and DAOs would go into the package associated with them. Spreading out the persistence layer across several packages. Leaving packages to call entities from other packages. With all of the overlap is this really functional? There would be no isolation between the packages. What are the pros and cons to using package by feature? Would it be good design to go with an additional persistence layer? Or, do I have the understanding of this totally wrong?

5 years later...
(Suspenseful music in the background)
Imagine this ridiculous situation:
Managers company, Programmers company, Human Resources company and Marketing company, where the Programmers company will only have programmers and no managers, marketeers or human resources;
We wouldn't want to split co-workers by their profession instead of organizing (self-coordinating) teams, or would we?
Packaging stuff together by what it is, and not by what it does, will only make you jump 10 times to the place you are looking for.
Now doesn't that just look sexy? By looking at the structure, you can already tell what the app is all about. Not satisfied? Read full article.

I would suggest to start package things based on business entities. And in there you can divide things based on layers.
With all of the overlap is this really
functional?
I am practising it for long. I don't see any major issues with this approach. You must find out what to decouple and how much it should be decoupled. For example, calling a persistent method of orders from a customer package using the API provided by orders is pretty fine for me.
What are the pros and cons to using
package by feature?
I find it more simple, straight, understandable and easy to work with than strict layer oriented packaging. It benefits when you want to split and distribute things to different places.
Would it be good design to go with an
additional persistence layer?
Look at this SO thread, I found JPA, or alike, don't encourage DAO pattern.
Further Reading
Generic Repository and DDD

If I would chose betwen the two package by feature vs package by layer. I would chose package by layer.
For several reason,
In layered arhcitecture the interfaces/depenencies should be clearly defined between layers, appropriate packaging will quickly highlight if you are intrudicing un-wanted dependencies or not
It isolates dependencies in one layer (for example persistance using Oracle) from your other layers.
I find it cleaner to think of each layer in isolation
But to answer your question Features vs. Layers or both, I would say both, package primarily by layers then by features.

Usecase for Workflow Engine

We have an issue where a Database table has to be updated on the status for a particular entity. Presently, its all Java code with a lot of if conditions and an update to the status. I was thinking along lines of using a Workflow engine since there can be multiple flows in future. Is it an overkill to use a Workflow Engine here... where do you draw the line ?

It depends on the complexity of your use case.
In a simple use case, we have a database column updated by multiple consumers for each stage in an Order lifecycle. This is done by a web service calling into the database.
The simple lifecycle goes from ACKNOWLEDGED > ACCEPTED/REJECTED > FULFILLED > CLOSED. All of these are in the same table on the same column. This is executed in java classes with no workflow.
A workflow engine is suited in a more complex use case which involves actions on multiple data providers eg: database or Content Mgmt or Document Mgmt or search engine, multiple parallel processes, forking based on the success/failure of a previous step, sending an email at a certain step, offline error alerting.
You can look at Apache ODE to implement this.

We have an issue where a Database table has to be updated on the status for a particular entity. Presently, its all Java code with a lot of if conditions and an update to the status.
Sounds like something punctual, no need for orchestrating actions among workflow participants.
Maybe a rule engine is better suited for this. Drools could be a good candidate. When X then Y.

If you're using Spring, this is a good article on how to implement your requirement
http://www.javaworld.com/javaworld/jw-04-2005/jw-0411-spring.html

I think you should consider a workflow engine. Workflow should be separated from application logic.
Reasons:
Maintainable: Easier to modify, add new flows and even easier to replace by another workflow engine.
Business Process management: Workflows are mostly software representations of BPM. So it is usually designed by process designers (Non-tech people). So it is not a good idea to code inside the application. Instead BPM products such as ALBPM or JPBM should be used which support graphical workflow designs.
Monitoring business flows: They are often monitored by the Top level managers and used to make strategic decisions.
Easier for Data mining/Reports/Statistics.
ALBPM(Now Oracle BPM): is a commercial tool from Oracle suitable for large scope projects.
My recommendation is JBPM. Open source tool from JBOSS. Unlike ALBPM which requires separate DB and application server, it can be packaged with your application and runs as another module in your application. I think suitable for your project.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.