Related
I need to develop application that can be getting data from multiple data sources ( Oracle, Excel, Microsoft Sql Server, and so on) using one SQL query. For example:
SELECT o.employeeId, count(o.orderId)
FROM employees#excel e. customers#microsoftsql c, orders#oracle o
WHERE o.employeeId = e.employeeId and o.customerId = c.customerId
GROUP BY o.employeeId;
This sql and data sources must be changes dynamically by java program. My customers want to write and run sql-like query from different database and storage in same time with group by, having, count, sum and so on in web interface of my aplication. Other requirements is perfomance and light-weight.
I find this way to do it (and what drawbacks I see, please, fix me if I wrong):
Apache Spark (drawbacks: heavy solution, more better for BigData,
slow if you need getting up-to-date informations without cached it
in Spark),
Distributed queries in SQL server (Database link of Oracle, Linked
server of Microsoft SQL Server, Power Query of Excel) - drawbacks:
problem with change data sources dynamically by java program and
problem with working with Excel,
Prestodb (drawbacks: heavy solution, more better for BigData),
Apache Drill (drawbacks: quite young solution, some problem with not
latest odbc drivers and some bugs when working),
Apache Calcite (ligth framework that be used by Apache Drill,
drawbacks: quite young solution yet),
Do join from data sources manually (drawbacks: a lot of work to
develop correct join, "group by" in result set, find best execution plan and so on)
May be, do you know any other way (using free open-source solutions) or give me any advice from your experience about ways in above? Any help would be greatly appreciated.
UnityJDBC is a commercial JDBC Driver that wraps multiple datasoruces and allows you to treat them as if they were all part of the same database. It works as follows:
You define a "schema file" to describe each of your databases. The schema file resembles something like:
...
<TABLE>
<semanticTableName>Database1.MY_TABLE</semanticTableName>
<tableName>MY_TABLE</tableName>
<numTuples>2000</numTuples>
<FIELD>
<semanticFieldName>MY_TABLE.MY_ID</semanticFieldName>
<fieldName>MY_ID</fieldName>
<dataType>3</dataType>
<dataTypeName>DECIMAL</dataTypeName>
...
You also have a central "sources file" that references all of your schema files and gives connection information, and it looks like this:
<SOURCES>
<DATABASE>
<URL>jdbc:oracle:thin:#localhost:1521:xe</URL>
<USER>scott</USER>
<PASSWORD>tiger</PASSWORD>
<DRIVER>oracle.jdbc.driver.OracleDriver</DRIVER>
<SCHEMA>MyOracleSchema.xml</SCHEMA>
</DATABASE>
<DATABASE>
<URL>jdbc:sqlserver://localhost:1433</URL>
<USER>sa</USER>
<PASSWORD>Password123</PASSWORD>
<DRIVER>com.microsoft.sqlserver.jdbc.SQLServerDriver</DRIVER>
<SCHEMA>MySQLServerSchema.xml</SCHEMA>
</DATABASE>
</SOURCES>
You can then use unity.jdbc.UnityDriver to allow your Java code to run SQL that joins across databases, like so:
String sql = "SELECT *\n" +
"FROM MyOracleDB.Whatever, MySQLServerDB.Something\n" +
"WHERE MyOracleDB.Whatever.whatever_id = MySQLServerDB.Something.whatever_id";
stmt.execute(sql);
So it looks like UnityJDBC provides the functionality that you need, however, I have to say that any solution that allows users to execute arbitrary SQL that joins tables across different databases sounds like a recipe to bring your databases to their knees. The solution that I would actually recommend to your type of requirements is to do ETL processes from all of your data sources into a single data warehouse and allow your users to query that; how to define those processes and your data warehouse is definitely too broad for a stackoverflow question.
One of the appropriate solution is DataNucleus platform which has JDO, JPA and REST APIs. It has support for almost every RDBMS (PostgreSQL, MySQL, SQLServer, Oracle, DB2 etc) and NoSQL datastore like Map based, Graph based, Doc based etc, database web services, LDAP, Documents like XLS, ODF, XML etc.
Alternatively you can use EclipseLink, which also has support for RDBMS, NoSQL, database web services and XML.
By using JDOQL which is part of JDO API, the requirement of having one query to access multiple datastore will be met. Both the solutions are open-source, relatively lightweight and performant.
Why did I suggest this solution ?
From your requirement its understood that the datastore will be your customer choice and you are not looking for a Big Data solution.
You are preferring open-source solutions, which are light weight and performant.
Considering your use case you might require a data management platform with polyglot persistence behaviour, which has the ability to leverage multiple datastore, based on your/customer's use cases.
To read more about polyglot persistence
https://dzone.com/articles/polyglot-persistence-future
https://www.mapr.com/products/polyglot-persistence
SQL is related to the database management system. SQL Server will require other SQL statements than an Oracle SQL server.
My suggestion is to use JPA. It is completely independent from your database management system and makes development in Java much more efficient.
The downside is, that cannot combine several database systems with JPA out of the box (like in an 1:1 relation between SQL Server and Oracle SQL server). You could, however, create several EntityManagerFactories (one for each database) and link them together in your code.
Pros for JPA in this scenario:
write database management system independent JPQL queries
reduces required java code
Cons for JPA:
you cannot relate entities from different databases (like in a 1:1 relationship)
you cannot query several databases with one query (combining tables from different databases in a group by or similar)
More information:
Wikipedia
I would recommend presto and calcite.
performance and lightweight doesn't always go hand in hand.
presto : quite a lot of proven usages, as you have said "big data". performs well scales well. I don't quite know what light weight means specifically, if requiring less machines is one of them, you could definitely scale less according to your need
calcite : embeded in a lot of data analytic libraries like drill kylin phoenix. does what you needed " connecting to multiple DBs" and most importantly "light weight"
Having experience with some of the candidates (Apache Spark, Prestodb, Apache Drill) makes me chose Prestodb. Even though it is used in big data mostly I think it is easy to set it up and it has support for (almost) everything your are asking for. There are plenty of resources available online (including running it in Docker) and it also has excellent documentation and active community, also support from two companies (Facebook & Netflix).
Multiple Databases on Multiple Servers from Different Vendors
The most challenging case is when the databases are on different servers and some of the servers run different database software. For example, the customers database may be hosted on machine X on Oracle, and the orders database may be hosted on machine Y with Microsoft SQL Server. Even if both databases are hosted on machine X but one is on Oracle and the other on Microsoft SQL Server, the problem is the same: somehow the information in these databases must be shared across the different platforms. Many commercial databases support this feature using some form of federation, integration components, or table linking (e.g. IBM, Oracle, Microsoft), but support in the open-source databases (HSQL, MySQL, PostgreSQL) is limited.
There are various techniques to handling this problem:
Table Linking and Federation - link tables from one source into
another for querying
Custom Code - write code and multiple queries to manually combine
the data
Data Warehousing/ETL - extract, transform, and load the data into
another source
Mediation Software - write one query that is translated by a
mediator to extract the data required
May be wage idea. Try to use Apache solr. User different data sources and import the data in to Apache solr. Once data is available you can write different queries by indexing it.
It is open source search platform, that makes sure your search is faster.
That's why Hibernate framework is for, Hibernate has its own query language HQL mostly identical to SQL. Hibernate acts as a middle ware to convert HQL query to database specific queries.
I am working on a Spring-MVC application in which we are seeing that the database is growing big. The space is consumed by chat messages history mostly, and other stuff like old notifications, which are not that useful.
Because of which we thought of moving the guys to some text/XML file to give the DB some room to breath and increase the performance of queries thereby. Indexes are not that useful as too many insertions.
I wanted to know if there is any way, PostgreSQL or Hibernate has support for such a task, where data is picked out of db and saved in plain files, which can be accessed and result in atleast good performance gains.
I have only started looking up some stuff, so I don't have much in hand to show. Kindly let me know if there are any questions you guys have.
Thanks a lot.
I would use the PostgreSQL JSON storage and have two databases:
the current operations DB, the one where you are moving data away to slim it
the archive database where old data is aggregated to save storage
This way you can move data from the current database into the archive database without compromising ACID attributes and you can aggregate the old data to simplify retrieval, by grouping various related entities based on some common root entity, which you'll then use to access your old data.
This way the current operation database remains small enough, while the archive database can be shared. This way, it's easier to configure the current operation for high performance, while the archive one for scalability.
Anyway, hibernate doesn't support this out-of-the-box, but you can implement it using custom Hibernate types and JTA transactions.
I have one simple table in one Oracle database that needs to be joined with a group of tables in another Oracle database. They reside on the same server (different ports). I am using JDBC and want to keep it simple. I could connect to both DBs and join the result sets in Java. But I am wondering if there is a better/easier way.
I cannot easily use new tools or frameworks since I work in the rigid corporate environment, so want to know if it can be accomplished with just JDBC.
No way to do it in pure JDBC as far as I am aware, but you could use the oracle databaselink facility. It makes the tables from one database available in another, allowing you to carry out joins etc as if they were in the same database. JDBC will work nicely with tables that are subject to these links.
Setting them up is an administrative function, so you'll need some DBA involvement.
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_5005.htm
Other than that, if the "one" table isn't too big, you may have to read that to a Map and then peform the join parts of the query in your code ( which is not ideal )
I see two options.
Get the data from two datasources manually and then combine it(which you have already thought about).
Create a db link from one database to other database and create a synonym in your schema for the remote object and directly execute the sql statement with joins from java.
I want to or need to (without the use of other databases) setup Entities(database tables) in memory that have relationships, like one-to-many or many-to-many etc.
I saw something related here on this forum:
Map SQL (not JPQL) to a collection of simple Java objects?
I need to query these Entities that have relationships and get the resultsets from this,
in order to push the resulting data into an Access database, I am using Jackcess and its not a JDBC driver.
So far I have looked at MetaModel and jOOQ.
Is there anything else out there. I have a little bit of exposure to ORM's, do they query the in-memory collections or only just pass the sql query to the database.
Any help or suggestions is greatly appreciated.
Apparently, you're looking for something like .NET's LINQ-to-Objects in the Java ecosystem. There's nothing as sophisticated as LINQ-to-Objects, but there are a couple of ways to "query" collections in Java as well. You might be interested in any of these libraries:
Quaere: http://quaere.codehaus.org
Coolection: https://github.com/wagnerandrade/coollection
Lambdaj: https://code.google.com/p/lambdaj
JXPath: http://commons.apache.org/proper/commons-jxpath
JoSQL: http://josql.sourceforge.net
All of the above projects are open source and may not be so actively maintained anymore, as Java 8 will introduce a much better collections API along with language-supported lambda expressions, which renders these non-SQL focused LINQesque Java APIs obsolete.
Note, you were asking specifically about MetaModel and jOOQ. These provide you with a querying API for querying databases. I think that will not help you much for your use-cases.
Hibernate will query the object cache, but only if you query using Criteria or HQL. If you query straight SQL, it'll get run directly against the database.
Your problem description sounds like it's more than Jackcess can handle natively, but what if at program startup you read the full Access DB into an in-memory database (one that has a JDBC driver), run Hibernate queries against that in-memory database, and then at program exit just flush all Hibernate changes to the in-memory database and then write the in-memory database's contents into the Access database? You get all the complicated querying capability of Hibernate, and all you have to do is write Jackcess-to-JDBC code to load the Access DB into an equivalent schema in the in-memory database and then the inverse code to copy it back, which is way easier than writing the full JDBC driver for Jackcess.
We are designing a fairly large brownfield application, and run into a bit of a issue.
We have a fairly large amount of information in a DB2 database from a legacy application that is still loading data. We also have information in an Oracle database that we control.
We have to do a 'JOIN' type of operation on the tables. Right now, I was thinking of pulling the information out of the DB2 table into a List<> and then iterating those into a SQL statement on the Oracle database such as:
select * from accounts where accountnum in (...)
Is there any easier way to interact between the databases, or at least, what is the best practice for this sort of action?
I've done this two ways.
With two Sybase databases on different boxes, I set up store procedures, and called then like functions, to send data back and forth. This additionally allowed the sprocs to audit/log, to convince the customer no data was being lost in the process.
On an Oracle to Sybase one way, I used a view to marshall the data and each vendors' C libraries called from a C++ program that gave the C APIs a common interface.
On a MySQL and DB2 setup, where like your situation, the Db2 was "legacy but live", I employed a setup similar to what you're describing: pulling the data out into a (Java) client program.
If the join is always one-to-one, and each box's resultset has the same key, you can pull them both with the same ordering and trivially connect them in the client. Even if they're one-to-many, stitching them together is just a one-way iteration of both of your lists.
If it gets to be many-to-many, then I might fall back to processing one item at a time (though you could use HashSet lookup).
Basically, though, your choices are sprocs (for which you'd need to and a client layer), or just doing it in the client.
You can export data from DB2 in flat file format and use this flat file as an external table or use sql loader, this is a batch process.
There is also something called heterogeneous connectivity. Here you create a database link from Oracle to DB2. This makes it possible to query your DB2 database real time and you can join a Oracle table with a DB2 table.
You can also use this database link in combination with materialized views.
There are different kinds of heterogeneous connectivity so read the documentation carefully.
Does it have to be real time data?. If so, there are products available for heterogeneous connectivity especially db2 relational connect which is part of federated server. If the lag is accepted, you can setup scripts to replicate the data to oracle using which you can do a native join.
You will get poor performance with pulling data to client application. If this is the only option, try to create a db2 stored procedure to return the data which will make the performance slightly better.
If it is possible to copy the data from the legacy database to the database you control, you can think to a data extraction job that copies once per day (or as often as possible) the new records from the legacy DB to the Oracle DB. It might not be so simple, if you can't identify the new records that are produced in the legacy database since the last data loading.
Then, you can do the joins in your Oracle instance.
If you ask the vendors, probably the best practice would be to buy another product.
From the IBM side, there is IBM Federation Server, which can "Combine data from disparate sources such as DB2, Oracle, and SQL Server into a single virtual view." I imagine there is also one from Oracle but I'm less familiar with their products.
Oracle Transparent Gateway for DRDA http://www.oracle.com/technetwork/database/gateways/index.html
IBM Infosphere Federation Server
http://www-03.ibm.com/software/products/en/ibminfofedeserv/
Note if you have DB2 Advanced Enterprise Server Edition (AESE), Infosphere Federation Server is included.
Both products would allow you to use a single join query sent to one DB that returns data from both DBs. The Oracle product is really nice in that it allows Oracle to see the DB2 database as another Oracle DB and for DB2 to see the Oracle database as another DB2 database. (Thanks to IBM publishing the specs for both the client and server side of the DRDA protocol DB2 uses. Too bad no other vendor is willing to do so, though they have no trouble taking advantage of the fact IBM did so.)
Neither product is what I would call cheap.
For cheap, you could take advantage of Oracle Database Gateway for ODBC
http://docs.oracle.com/cd/E16655_01/gateways.121/e17936/toc.htm