Integrating ETL with Java using intellij

Integrating ETL with Java using intellij - java

I want to start with ETL on java. I am using Intellij. I wanted to know how the integration can be done or which tool is compatible with intellij.
Also if there is any tutorials on the basics of ETL with java.
Exactly what and all I will need if I want to do the transformation of data
It can be basic like just taking in random input from a file and transforming
the data based on particular logic

Creating code to extract (query different sources like DB, XML, web service, etc) to transform (you know make everything compatible, removing dup's, creating Dims and Facts) to load them to targets (databases and more)...
All this is not new. Java is great but creating ETLs with it is creating a non-standar app... And is going to become a legacy and then you need to build a scheduler to run the loads and to integrate with several components.
So. I strongly recommend instead of create a Java app, take a look to products like Informatica PowerCenter and/or Oracle Data Integrator.
This solutions are business wide standard for ETL worldwide, provides objects and methods to avoid apps that needs to be hardly mantein and are on top of any applications... Also are used for integration, migration, B2B, BI... Named...
Good luck!

You would be re- inventing the wheel if you are trying to create a Java based etl product .
Talend is a java based open source ETL tool which gives the features of an ETL tool and lets one write Java code to integrate ..
Pentaho is another Java based ETL tool..
Both of them are popular and have good UI...

Related

Spark Model to use in Java Application

For analysis.
I know we can use the Save function and load the Model in Spark application. But it works only in Spark application (Java, Scala, Python).
We can also use the PMML and export the model to other type of application.
Is there any way to use a Spark model in a Java application?

I am one of the creators of MLeap. Check us out, it is meant for exactly your use case. If there is a transformer you need that is not currently supported, get in touch with me and we will get it in there.
Our serialization format is solely JSON/Protobuf right now, so it is very portable and supports large models like RandomForest. You can serialize your model to a zip file then load it up wherever.
Take a look at our demo to get a use case:
https://github.com/TrueCar/mleap-demo

Currently no, your options are to use PMML for those models that support it, or write your own framework for using models outside of Spark.
There is movement towards enabling this (see this issue). You could also check out Mleap.

Strategy to create many tables and charts from simple but large data (census)

Our task is to create many statistical analyses from census data (much data but easy analysis - mostly (sub)sums of data). The analyses are to be represented as tables and charts (in web - 2 languages - and pdf)
Lets assume the problem of storing the data is solved (SQL, good structure). The web-application (GWT) and Pdf (iText) Software is mostly done. We "only" have to change the data-backend.
what is a good strategy to efficiently create those analysis and there representations (tables, charts)?
two different ways come to my mind:
simple java programming: jdbc or jpa, jfreechart (here we have experience, boring programming)
bi tools birt, jasper, pentaho, palo... (learning to use them, boring pointing and clicking)
but is there probably a third way? a way between those 2: using the bi tool's apis to program the reports??
Is it worth to learn using a bi tool (i think with it, its much easier to create additional reports or adjust existing ones?)
What do you think?

I've been in a similar situation recently. I've evaluated Pentaho BI server, so my remarks are based on that:
Pros of a custom system implemented in Java:
Possibility to exactly customise according to user needs
No need for star schema (Pentaho analytics only works on star schemas! Pentaho reporting does not need a star schema, it can work by simple SQL statements)
Due to not needing a star schema, you can integrate is easier with the rest of the system. By that I mean that you can reuse existing data sources. Also, by having a self-made UI, you can easily integrate it into the rest of the enterprise infrastructure (web portals, etc.)
Speed: custom SQL tuning, custom UI that is fast for your dataset, etc.
Pros of using a ready-made solution (eg. Pentaho BI Server):
After initial setup, even non technical users can create new analytics or modify existing ones. Easy exporting to excel, pdf, etc.
Pentaho comes with many supporting tools (ETL tool to import data, scheduler to create reports periodaically, etc.). It would be a huge cost to re-implement all these. Also, the new Pentaho server has a dashboard feature, which means you can have a screen with charts and tables that updates as new data is coming in.
If you can settle with the features and need no extra customization, deployment time is only a fraction of what it takes to develop new software.
Pentaho has extensive Java based APIs, you can create reports entirely in Java code, etc. Most of the core is open source, AFAIK only those parts are closed source which are in the Enterprise server (dashboard, analytics view, etc.)
As far as I know in case of Pentaho, there are APIs for the following:
Create and modify reports, generate reports in various formats
Access the scheduler
Create custom widgets for the Dashboard
Access the OLAP engine (eg. create an MDX expression and get the results)
Since the BI server is a Spring container, more or less you can integrate it as any spring app (eg. you can access the Spring Security settings and plugin in your custom enterprise security, etc.)
Though not an API, there are ways to integrate Pentaho's web based report viewer into other web applications (the easiest way is to have a IFRAME and customize the report using URL parameters)

You might have a look to icCube :
JAVA based for the back-end
easy to create a cube model from your SQL data structure
front end (charts) is made of pure Javascript API (www, www)
possibility to use your own charting library
No point and click would be required I guess. Once setup, you would have great potential for providing analysis that goes beyond simple reporting (e.g., histogram comparison).

Does Play Framework with the Scala plugin support MVC and the Java API?

Some questions:
Why when I generare a new play app with scala is there no model folder?
Can I use JPA instead of Anorm?
I saw some similarities between Ruby on Rails and Play. So are there any helper methods in Play Framework (form helper, link helper, etc)?
Is it possible to use the Play Java Framework with Scala?

Why when I generare a new play app with scala is there no model folder?
For simple apps, you just define one models.scala file. You don't necessarily need a folder, even when you're using the modelspackage. Less visual clutter so to speak. When your app grows bigger you can refactor and put everything in a separate folder.
Can I use JPA instead of Anorm?
Of course. But you should definitely check out Anorm, or Squeryl, or ...
I saw some similarities between Ruby on Rails and Play. So are there any helper methods in Play Framework (form helper, link helper, etc)?
There are some special shortcut tags, especially in the Groovy markup (checkout out the cheat sheet). Creating your own partial components is simple enough, however. The concept of helper doesn't really exist AFAIK.
Is it possible to use the Play Java Framework with Scala?
Haven't tried it for myself but I've read that you can mix Java and Scala classes, the compiler will compile everything you throw at him.

I would recommend implementing the tutorial on play website. Also, you may want to try the scalagen module - it would give you an easy way to generate code while you are learning at least. A quick disclaimer - I wrote the module :-)

CRUD Operation in OWL

I have created an ontology. Now I want to create an application but how can I perform CRUD operations in owl file. I came across different apis like Dotnetrdf, jena etc all support RDF/RDFS but there is not support for owl file
http://www.semanticoverflow.com/questions/2704/using-jena-to-query-owl-files
Problem of reading OWL/XML
Also, most of apis are available in Java and I dont know how to write simple hello world program in java. I am confused with servlet, jsp and .java and lots of configuration is required. So I prefer php.
So is there any api or any alternative way to query owl file in php ?
Regards,
anas anjaria

The only libraries I know that support SW standards in PHP are rdfapi [1] and redland php binding [2], but the level is RDF (i.e. the building block of RDFS and OWL) you will need to add CRUD operations at the triple level (i.e. simple axioms like foaf:knows )
[1] http://www4.wiwiss.fu-berlin.de/bizer/rdfapi/
[2] http://librdf.org/docs/php.html

So, it looks like you're talking about the Web Ontology Language, an XML/RDF dialect.
A few moments in Google shows pretty much zero interest in this in the world of PHP.
But, being XML, you can use one of the PHP XML extensions so read and work with the XML directly without a problem. How well this will actually work for you, I can't say. OWL looks freakishly complex, and working with it at the DOM node level will very likely stretch your sanity far worse than working with mature, established libraries in Java.

i made my final project at the university by using Jena. The Research Group where i work develop ontology generator tool which is capable of all crud operations. They also developed the Eclipse plug-in of this project.
You just create your OWL Data Model in the editor and right click the data model create everything, i creates owl files, Crud class and it's test codes for you.
Let's check it out
Download
Name of Plug-in is "SEAGENT Ontology Generator Plugin (Beta)"
I hope it will be beneficial for you like me

Integrating Pentaho/Talend/etc. with an OR Mapper

We have an application (Java) with an own OR mapper. Within this system we have what can be compared to Hibernate's interceptors (we call it triggers): Do specific actions just before saving data in the database, after it's deleted and so on. The underlying database is MySQL.
Now we would like to use tools such as Pentaho Data Integration or Talend to convert data to put it into our system. It's no problem to do that directly on the SQL level, but by doing so we loose the built-in power of our triggers.
Is there a way to somehow integrate any of the Data Integration solutions into our existing application? It would be great if there was a way to write into instances of our classes instead of writing into the database directly.
Any hints welcome :-)

I'd prefer Talend which is a Java code generator tool. (You can se my blog post at http://www.robertomarchetto.com/www/talend_studio_vs_kettle_pentao_pdi_comparison)
You could use a tJavaRow so you can write Java code for each processed row. In tJavaRow you can call Hibernate code, for example using a custom class defined in a new routine.

2 ways with Pentaho data integration I can think of straight off:
Simply create a plugin which adds/deletes data - you could copy the existing salesforce insert/update plugins, they would be a good start - rip out all the salesforce code and replace with yours.
Perhaps harder; But maybe more satisfying - Write a jdbc driver which uses your code!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.