Java ETL process

Java ETL process - java

I have this new challenge to load ~100M rows from an Oracle database and insert them in a remote MySQL database server.
I've divided the problem in two:
a server side REST server responsible for loading data into the MySQL server;
a client side Java app that is responsible from loading the Oracle data source.
At the Java side I've used plain JDBC for loading paginated content and transfer it over the wire to the server. This approach works well but it makes the code cumbersome and not very scalable as I'm doing pagination myself using Oracle's ROWNUM.....WHERE ROWNUM > x and ROWNUM < y.
I've now tried Hibernate's StatelessSession with my entities mapped through Annotations. The code is much more readable and clean but the performance is worse.
I've heard of ETL tools and SpringBatch but I don't know them very well.
Are there other approaches to this problem?
Thanks in advance.
UPDATE
Thank you for the invaluable suggestions.
I've opted for using SpringBatch to load data from the Oracle database because the environment is pretty tight and I don't have access to Oracle's toolset. SpringBatch is trie and true.
For the data writing step I opted for writing chunks of records using MySQL's LOAD DATA INFILE as you all stated. REST services are in the middle as they are hidden from each other for security reasons.

100M rows is quite a lot. You can design it in plenty of ways: REST servers, JDBC reading, Spring Batch, Spring integration, Hibernate, ETL. But the bottom line is: time.
No matter what architecture you choose, you eventually have to perform these INSERTs into MySQL. Your mileage may vary but just to give you an order of magnitude: with 2K inserts per second it'll take half a day to populate MySQL with 100M rows (source).
According to the same source LOAD DATA INFILE can handle around 25K inserts/second (roughly 10x more and about an hour of work).
That being said with such an amount of data I would suggest:
dump Oracle table using native Oracle database tools that produce human readable content (or computer readable, but you have to be able to parse it)
parse the dump file using as fast tools as you can. Maybe grep/sed/gawk/cut will be enough?
generate target file compatible with MySQL LOAD DATA INFILE (it is very configurable)
Import the file in MySQL using aforementioned command
Of course you can do this in Java with nice and readable code, unit tested and versioned. But with this amount of data you need to be pragmatic.
That is for initial load. After that probably Spring Batch will be a good choice. If you can, try to connect your application directly to both databases - again, this will be faster. On the other hand this might not be possible for security reasons.
If you want to be very flexible and not tie yourself into databases directly, expose both input (Oracle) and output (MySQL) behind web-services (REST is fine as well). Spring integration will help you a lot.

You can use Scriptella to transfer data between databases. Here is an example of a XML transformation file:
<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<connection id="in" url="jdbc:oracle:thin:#localhost:1521:ORCL"
classpath="ojdbc14.jar" user="scott" password="tiger"/>
<connection id="out" url="jdbc:mysql://localhost:3306/fromdb"
classpath="mysql-connector.jar" user="user" password="password"/>
<!-- Copy all table rows from one to another database -->
<query connection-id="in">
SELECT * FROM Src_Table
<!-- For each row executes insert -->
<script connection-id="out">
INSERT INTO Dest_Table(ID, Name) VALUES (?id,?name)
</script>
</query>
</etl>

Related

What is the best way to process large CSV files?

I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.

You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.

You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.

Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.

One SQL query to access multiple data sources in Java (from oracle, excel, sql server)

I need to develop application that can be getting data from multiple data sources ( Oracle, Excel, Microsoft Sql Server, and so on) using one SQL query. For example:
SELECT o.employeeId, count(o.orderId)
FROM employees#excel e. customers#microsoftsql c, orders#oracle o
WHERE o.employeeId = e.employeeId and o.customerId = c.customerId
GROUP BY o.employeeId;
This sql and data sources must be changes dynamically by java program. My customers want to write and run sql-like query from different database and storage in same time with group by, having, count, sum and so on in web interface of my aplication. Other requirements is perfomance and light-weight.
I find this way to do it (and what drawbacks I see, please, fix me if I wrong):
Apache Spark (drawbacks: heavy solution, more better for BigData,
slow if you need getting up-to-date informations without cached it
in Spark),
Distributed queries in SQL server (Database link of Oracle, Linked
server of Microsoft SQL Server, Power Query of Excel) - drawbacks:
problem with change data sources dynamically by java program and
problem with working with Excel,
Prestodb (drawbacks: heavy solution, more better for BigData),
Apache Drill (drawbacks: quite young solution, some problem with not
latest odbc drivers and some bugs when working),
Apache Calcite (ligth framework that be used by Apache Drill,
drawbacks: quite young solution yet),
Do join from data sources manually (drawbacks: a lot of work to
develop correct join, "group by" in result set, find best execution plan and so on)
May be, do you know any other way (using free open-source solutions) or give me any advice from your experience about ways in above? Any help would be greatly appreciated.

UnityJDBC is a commercial JDBC Driver that wraps multiple datasoruces and allows you to treat them as if they were all part of the same database. It works as follows:
You define a "schema file" to describe each of your databases. The schema file resembles something like:
...
<TABLE>
<semanticTableName>Database1.MY_TABLE</semanticTableName>
<tableName>MY_TABLE</tableName>
<numTuples>2000</numTuples>
<FIELD>
<semanticFieldName>MY_TABLE.MY_ID</semanticFieldName>
<fieldName>MY_ID</fieldName>
<dataType>3</dataType>
<dataTypeName>DECIMAL</dataTypeName>
...
You also have a central "sources file" that references all of your schema files and gives connection information, and it looks like this:
<SOURCES>
<DATABASE>
<URL>jdbc:oracle:thin:#localhost:1521:xe</URL>
<USER>scott</USER>
<PASSWORD>tiger</PASSWORD>
<DRIVER>oracle.jdbc.driver.OracleDriver</DRIVER>
<SCHEMA>MyOracleSchema.xml</SCHEMA>
</DATABASE>
<DATABASE>
<URL>jdbc:sqlserver://localhost:1433</URL>
<USER>sa</USER>
<PASSWORD>Password123</PASSWORD>
<DRIVER>com.microsoft.sqlserver.jdbc.SQLServerDriver</DRIVER>
<SCHEMA>MySQLServerSchema.xml</SCHEMA>
</DATABASE>
</SOURCES>
You can then use unity.jdbc.UnityDriver to allow your Java code to run SQL that joins across databases, like so:
String sql = "SELECT *\n" +
"FROM MyOracleDB.Whatever, MySQLServerDB.Something\n" +
"WHERE MyOracleDB.Whatever.whatever_id = MySQLServerDB.Something.whatever_id";
stmt.execute(sql);
So it looks like UnityJDBC provides the functionality that you need, however, I have to say that any solution that allows users to execute arbitrary SQL that joins tables across different databases sounds like a recipe to bring your databases to their knees. The solution that I would actually recommend to your type of requirements is to do ETL processes from all of your data sources into a single data warehouse and allow your users to query that; how to define those processes and your data warehouse is definitely too broad for a stackoverflow question.

One of the appropriate solution is DataNucleus platform which has JDO, JPA and REST APIs. It has support for almost every RDBMS (PostgreSQL, MySQL, SQLServer, Oracle, DB2 etc) and NoSQL datastore like Map based, Graph based, Doc based etc, database web services, LDAP, Documents like XLS, ODF, XML etc.
Alternatively you can use EclipseLink, which also has support for RDBMS, NoSQL, database web services and XML.
By using JDOQL which is part of JDO API, the requirement of having one query to access multiple datastore will be met. Both the solutions are open-source, relatively lightweight and performant.
Why did I suggest this solution ?
From your requirement its understood that the datastore will be your customer choice and you are not looking for a Big Data solution.
You are preferring open-source solutions, which are light weight and performant.
Considering your use case you might require a data management platform with polyglot persistence behaviour, which has the ability to leverage multiple datastore, based on your/customer's use cases.
To read more about polyglot persistence
https://dzone.com/articles/polyglot-persistence-future
https://www.mapr.com/products/polyglot-persistence

SQL is related to the database management system. SQL Server will require other SQL statements than an Oracle SQL server.
My suggestion is to use JPA. It is completely independent from your database management system and makes development in Java much more efficient.
The downside is, that cannot combine several database systems with JPA out of the box (like in an 1:1 relation between SQL Server and Oracle SQL server). You could, however, create several EntityManagerFactories (one for each database) and link them together in your code.
Pros for JPA in this scenario:
write database management system independent JPQL queries
reduces required java code
Cons for JPA:
you cannot relate entities from different databases (like in a 1:1 relationship)
you cannot query several databases with one query (combining tables from different databases in a group by or similar)
More information:
Wikipedia

I would recommend presto and calcite.
performance and lightweight doesn't always go hand in hand.
presto : quite a lot of proven usages, as you have said "big data". performs well scales well. I don't quite know what light weight means specifically, if requiring less machines is one of them, you could definitely scale less according to your need
calcite : embeded in a lot of data analytic libraries like drill kylin phoenix. does what you needed " connecting to multiple DBs" and most importantly "light weight"

Having experience with some of the candidates (Apache Spark, Prestodb, Apache Drill) makes me chose Prestodb. Even though it is used in big data mostly I think it is easy to set it up and it has support for (almost) everything your are asking for. There are plenty of resources available online (including running it in Docker) and it also has excellent documentation and active community, also support from two companies (Facebook & Netflix).

Multiple Databases on Multiple Servers from Different Vendors
The most challenging case is when the databases are on different servers and some of the servers run different database software. For example, the customers database may be hosted on machine X on Oracle, and the orders database may be hosted on machine Y with Microsoft SQL Server. Even if both databases are hosted on machine X but one is on Oracle and the other on Microsoft SQL Server, the problem is the same: somehow the information in these databases must be shared across the different platforms. Many commercial databases support this feature using some form of federation, integration components, or table linking (e.g. IBM, Oracle, Microsoft), but support in the open-source databases (HSQL, MySQL, PostgreSQL) is limited.
There are various techniques to handling this problem:
Table Linking and Federation - link tables from one source into
another for querying
Custom Code - write code and multiple queries to manually combine
the data
Data Warehousing/ETL - extract, transform, and load the data into
another source
Mediation Software - write one query that is translated by a
mediator to extract the data required

May be wage idea. Try to use Apache solr. User different data sources and import the data in to Apache solr. Once data is available you can write different queries by indexing it.
It is open source search platform, that makes sure your search is faster.

That's why Hibernate framework is for, Hibernate has its own query language HQL mostly identical to SQL. Hibernate acts as a middle ware to convert HQL query to database specific queries.

Best way to keep inserting large volume of data via JDBC. Batch insert or stored procedure?

I'm using Spring connecting to Sql Server 2008 R2 via JDBC.
All I need is to insert a large amount of data to a table in the database as fast as possible. I'm wondering which way is better:
Use Spring batch insert mention here
Create stored procedure in database and call it on Java side
Which one is better?

It depends on two things stored producer would take up the database time where as batch would take up time on the program side. so depending on what you are more concerned with it is really up to up. Me i would prefer the batch as to keep the database time free reducing errors that might occur. Hope this helps!

Spring batch is an excellent framework and it can be used as an ETL (Extract, Transform, Load) tool with respect to database.
Spring batch divides any import job in 3 steps:
1. Read : Read data from any source. It can be any other database, any file (XML, CSV or any other) or anything else
2. Process: Process input data, validate it and may convert it to your required objects.
3. Save: Save data into database or any custom file format
Spring batch is useful when you need long running jobs with restart/resume capabilities.
Also it is lot slower that any direct DB import tool like impdp for Oracle. Spring batch saves its state in database so it is an overhead and consumes long time. However you can hack spring batch and make it not save the state in DB but it costs loss of restart/resume capabilities.
So if speed is your prime requirement, you should choose some database specific option.
But if you need do some validation and/or processing Spring batch is an excellent option, you just need to configure it properly. Also Spring batch provides scalability and database independence.

For a simple application, should I be using XML files or a real database or something else?

I'm currently working on a simple Java application that calculates and graphs the different types of profit for a company. A company can have many branches, and each branch can have many years, and each year can have up to 12 months.
The hierarchy looks as follows:
-company
+branch
-branch
+year
-year
+month
-month
My intention was to have the data storage as simple as possible for the user. The structure I had in mind was an XML file that stored everything to do with a single company. Either as a single XML file or have multiple XML files that are linked together with unique IDs.
Both of these options would also allow the user to easily transport the data, as apposed to using a database.
The problem with a database that is stopping me right now, is that the user would have to setup a database by him/herself which would be very difficult for them if they aren't the technical type.
What do you think I should go for XML file, database, or something else?

It will be more complicated to use XML, XML is more of an interchange format, not a substitute for a DB.
You can use an embeddedable database such as H2 or Apache Derby / JavaDB, in this case the user won't have to set up a database. The data will be stored only locally though, so if this is ok for your application, you can consider it.

I would defintely go for the DB:
you have relational data, a thing DBs are very good at
you can query your data in that relational much easier than in XML
the CRUD operations (create, read, update, delete) are much more easier in DB than in XML
You can avoid the need for the user to install a DB engine by embedding SQLite with your app for example.

If it's a single-user application and the amount of data is unlikely to exceed a couple of megabytes, then using an XML file for the persistent storage might well make sense in that it reduces the complexity of the package and its installation process. But you're limiting the scalability: is that wise?

Benefits or using XML over MySQL and vice-versa?

I am currently working on a project that was not made by me but it makes use of a lot XML files instead of MySQL in place of it.
Because of that it makes me wonder if there is really any benefits of using XML over MySQL here.
The scene is, the XML files are loaded only ONCE and used on the server for N things it does.
The XML is only reload if the admin issue a command to the server to reload it.
All the XML files together have an average of maximum 100 mb size.
If you could as well give me a little brief of the above in regards the usage of XML over MySQL would appreciate.
What should I consider to know when a XML would be a better option over a simple innodb or myisam table ?

If your data is read-only and brought into memory only at the command of the admin, then I don't think it's much of an advantage for either technology.
MySQL would have the advantage of SQL queries if you have to search the data. Even in that case it's the type of data that matters. If you have long reference chains/object graphs, then a relational database may be slow because of all the JOINs.
But XML has its own issues. You can easily parse it into a DOM object, but then you only have XPath to search it.

XML is used as one of the ways of storing data. one of using xml is, it makes the data easy to be readable. you can use mysql if there are lot of users need the access to the data at the same time and mysql also supports transactional processing of data whereas xml does not have such features.

just adding the option in between - you could also use some form of xml database like
eXist (http://exist-db.org/index.html) or sedna (http://modis.ispras.ru/sedna/)

XML stored at local storage, and readable only by local server (don't argue me you can use memcache, replicated via rsync or so)
No doubt you can open the XML via a http server, but it will be slow.
While, mysql support port communication, and replication, it basically don't have boundaries if you expanding to multiple servers.
And even at 5.1, mysql support XML

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.