I am building an application at work and need some advice. I have a somewhat unique problem in which I need to gather data housed in a MS SQL Server, and transplant it to a mySQL Server every 15 mins.
I have done this previously in C# with a DataGrid, but now am trying to build a Java version that I can run on an Ubuntu Server, but I can not find a similar model for Java.
Just to give a little background
When I pull the data from the MS SQL Server, it always has 9 columns, but could have anywhere from 0 - 1000 rows.
Before inserting into the mySQL Server blindly, I do manipulate some of the data.
I convert a time column to CST based on a STATE column
I strip some characters to prevent SQL injection
I tried using the ResultSet, but I am having issues with the "forward only result sets" rules.
What would be the best data structure to hold that information, manipulate it, and then parse it to insert later into mySQL?
This sounds like a job for PreparedStatements!
Defined here: http://download.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html
Quick example: http://download.oracle.com/javase/tutorial/jdbc/basics/prepared.html
PreparedStatements allows you to batch up sets of data before pushing them into the target database. They also allow you use the PreparedStatement.setString method which handles escaping characters for you.
For the time conversion thing, I would retrieve the STATE value from the row and then retrieve the time value. Before calling PreparedStatement.setDate, convert the time to CST if necessary.
I dont think that you would need all the overhead that an ORM tool requires.
You could consider using an ORM technology like Hibernate. This might seem a little heavyweight at first, but it means you can maintain the various table mappings for various databases with ease as well as having the power of Java's RegEx lib for any manipulation requirements.
So you'd have a Java class that represents the source table (with its Hibernate mapping) and another Java class that represents the target table and lastly a conversion utility class that does any manipulation of that data. Hibernate takes care of the CRUD SQL for you, so no need to worry about Database specific SQL (as long as you get the mapping correct).
It also lessens the SQL injection problem
Related
We have an application that runs with any of IBM Informix, MySQL and Oracle, and we are using Java with Hibernate to connect to the database. We will store XML, CSV and other text-based files inside the database (clob column). The entities in Java are byte[] objects.
One feature request to the application is now to "grep" content inside the data. So I need to find all files with a specific content.
On regular char/varchar fields I can use like '%xyz%', but this is not working on byte[] / blobs.
The first approach was to load each entity, cast the byte[] into a string and use the contains method in Java. If the use enters any filter parameters on other (non-clob) columns, I will apply those filters before testing the clob in order to reduce the number of blobs I have to scan.
That worked quite well for 100 files (clobs) and as long as the application and database are on the same server. But I think it will get really slow if I have 1.000.000 files inside the database and the database is not always in the same network. So I think that is not a good idea.
My next thought was creating a database procedure. But I am not quite sure if this is possible for Informix, MySQL and Oracle. And I am not sure if this is possible.
The last but not favored method is to store the content of the data not inside a clob. Maybe I can use a different datatype for that?
Does anyone has a good idea how to realize that? I need a solution for all three DBMS. The application knows on what kind of DBMS it is connected to. So it would be okay, if I have three different solutions (one for each DBMS).
I am completely open to changing what kind of datatype I use (BLOB, CLOB ...) — I can modify that as I want.
Note: the clobs will range from about 5 KiB to about 500 KiB, with a maximum of 1 MiB.
Look into Apache Lucene or other text indexing library.
https://en.wikipedia.org/wiki/Lucene
http://en.wikipedia.org/wiki/Full_text_search
If you go with a DB specific solution like Oracle Text Search you will have to implement a custom solution for each database. I know from experience that Oracle Text search takes significant time to learn and involves a lot of tweaking to get just right.
Also, if you use a DB solution you would receive different results in each DB even if the data sets were the same (each DB would have it's own methods of indexing and retrieving the data).
By going with a 3rd party solution like Lucene -- you only have to learn one solution and results will be consistent regardless of the Db.
I am considering using long data type to store a time stamp generated via System.currentTimeMillis.
I want to store this value as a BIGINT in a relational database.
Is this a good idea?
My main goal is to keep the DBMS-independent.. So I can move to MySQL or H2 without changing my code, and secondly make it timezone-independent.
If this is a bad idea, what is a better way?
This is a fine idea and has been used successfully on many projects.
One issue you might think about is whether you want to generate the timestamp on the app server or on the database server. The issue is that your various app servers will have slightly different ideas about what time it is - if you want time-order consistency across the whole application, you might decide to use the database as the single source of truth on your timestamps. It depends on how your timestamps are used and how tolerant of time-order noise your application is.
Of course, it seems like each database vendor has a different function to get the equivalent of System.currentTimeMillis, which clearly violates one of your state goals.
I would not store UTC as BIGINT. You will not be able to use the column in any meaningful way with SQL expressions. A TIMESTAMP column is portable if you import/export in ISO format.
I have Problem with SQl server Performance because of Heavy Calculation query,
so we decided that we put Solr as intermediate and index all data from either Hibernate or Direct from SQl server,
so can anybody suggest/help me that it is possible ?
please suggest any tutorial link for this.
You can use DataImportHandler to transfer data, which you can schedule using DataImportScheduler.
I had the similar problem where SQL Server SP took 12 hours to update relationships between objects (rows), so we ended up using Neo4j (open source graph database), which exactly matched our data model.
We needed object relationships to be reflected in Solr searches, e.g. give me all objects whose name starts with "obj" and whose parent is of type "typ".
What's the fastest way to get a large volume of data from an Oracle database into Java objects.
Are there any Oracle tricks as to the way the data should be organised?
I was thinking of using plain JDBC rather than any Hibernate style libraries?
Would it be better to get Oracle to produce a file and then read from the file - although this has to be done programatically.
All thoughts appreciated.
I am not a Java or JDBC expert, but if you plan on pulling a lot of rows down from a database, you will likely benefit by increasing the prefetch rows on the connection.
Connection conn = DriverManager.getConnection ("jdbc:oracle:","user","password");
//Set the default row prefetch setting for this connection
((OracleConnection)conn).setDefaultRowPrefetch(100);
I believe the default for JDBC is to fetch one row at a time, so you're paying for a round trip to the database for each row fetched. (Note, I've seen documentation that suggests the default is 10 rows per round trip). Setting prefetch to a larger number will fetch more rows per round trip to the database. Speed increases can be dramatic depending on the number of rows and the performance of your network.
Depending on how far you want to go with this I'd imagine dropping jdbc and writing a custom application residing on the same machine as the DB using Oracle Call API and JNI would be the fastest...
It's probably much simpler to just use a plain prepared statment using JDBC and then if that's not enough (and depending on where the bottle neck is) try making a stored procedure. The caching done by ORM's like Hibernate should not be discounted though, so I guess you'd have to do some benchmarks. Also if the bottle neck is the database and you write a stored procedure which improves the read performance, then you could still use Hibernate to marshal the data to java objects. See Using stored procedures for querying
Whatever you wind up doing, design for/implement "lazy initialization" [really only applies for complex object hierarchies/networks; you said java objects (plural) so I'm imagining something more than just a single table that maps to a single object]. So basically, you are only reading in the objects that are needed at that time; when you run a getter method, then it does more db calls for just that data.
Another trick sometimes overlooked in the Java world is: if you have some complex sql coming from the code, you can rather create a view on the Oracle side, embedding that complexity there, then map your object to the view; so if you can flatten your object like the view, then you're in business.
I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?
My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.
Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...