I am considering using long data type to store a time stamp generated via System.currentTimeMillis.
I want to store this value as a BIGINT in a relational database.
Is this a good idea?
My main goal is to keep the DBMS-independent.. So I can move to MySQL or H2 without changing my code, and secondly make it timezone-independent.
If this is a bad idea, what is a better way?
This is a fine idea and has been used successfully on many projects.
One issue you might think about is whether you want to generate the timestamp on the app server or on the database server. The issue is that your various app servers will have slightly different ideas about what time it is - if you want time-order consistency across the whole application, you might decide to use the database as the single source of truth on your timestamps. It depends on how your timestamps are used and how tolerant of time-order noise your application is.
Of course, it seems like each database vendor has a different function to get the equivalent of System.currentTimeMillis, which clearly violates one of your state goals.
I would not store UTC as BIGINT. You will not be able to use the column in any meaningful way with SQL expressions. A TIMESTAMP column is portable if you import/export in ISO format.
Related
We have an application that runs with any of IBM Informix, MySQL and Oracle, and we are using Java with Hibernate to connect to the database. We will store XML, CSV and other text-based files inside the database (clob column). The entities in Java are byte[] objects.
One feature request to the application is now to "grep" content inside the data. So I need to find all files with a specific content.
On regular char/varchar fields I can use like '%xyz%', but this is not working on byte[] / blobs.
The first approach was to load each entity, cast the byte[] into a string and use the contains method in Java. If the use enters any filter parameters on other (non-clob) columns, I will apply those filters before testing the clob in order to reduce the number of blobs I have to scan.
That worked quite well for 100 files (clobs) and as long as the application and database are on the same server. But I think it will get really slow if I have 1.000.000 files inside the database and the database is not always in the same network. So I think that is not a good idea.
My next thought was creating a database procedure. But I am not quite sure if this is possible for Informix, MySQL and Oracle. And I am not sure if this is possible.
The last but not favored method is to store the content of the data not inside a clob. Maybe I can use a different datatype for that?
Does anyone has a good idea how to realize that? I need a solution for all three DBMS. The application knows on what kind of DBMS it is connected to. So it would be okay, if I have three different solutions (one for each DBMS).
I am completely open to changing what kind of datatype I use (BLOB, CLOB ...) — I can modify that as I want.
Note: the clobs will range from about 5 KiB to about 500 KiB, with a maximum of 1 MiB.
Look into Apache Lucene or other text indexing library.
https://en.wikipedia.org/wiki/Lucene
http://en.wikipedia.org/wiki/Full_text_search
If you go with a DB specific solution like Oracle Text Search you will have to implement a custom solution for each database. I know from experience that Oracle Text search takes significant time to learn and involves a lot of tweaking to get just right.
Also, if you use a DB solution you would receive different results in each DB even if the data sets were the same (each DB would have it's own methods of indexing and retrieving the data).
By going with a 3rd party solution like Lucene -- you only have to learn one solution and results will be consistent regardless of the Db.
Extending this thread - I would just like to know why it's faster to retrieve files from a file system, rather than a MySQL database. If one were to benchmark the two to see which would retrieve the most data (multiple types of data) over 10 minutes - which one would win?
If a file system is truly faster, then why not just store everything in a file system and replace a database with csv or xml?
EDIT 1:
I found a good resource for alternate storage options for java
EDIT 2:
I'm looking for a Java API/Jar that has the functionality of a SQL Database Server Engine (or at least some of it) that uses XML for data storage (preferably). If you know of something, please leave a comment below.
At the end of the day the database does just store the data in the file system. It's all the useful stuff on top of just the raw data that makes you decide to use a database.
If you can replicate the functionality, scalability, robustness, integrity, etc, etc of a database system using CSV and still make it perform faster than a relational database then yes I'd suggest doing it your way.
It'd take you a few years to get there though.
Of course, relational systems are not the only way to store data. There are object-oriented database systems (db4o, InterSystems Cache) and document-based systems (RavenDB).
Performance is also relative to the style and volume of data you are working with and what you intend to do with it - I'm not going to even try and discuss that, it's too open ended.
I will also not start the follow on discussion: if memory is truly faster than the file system, why not just store everything in memory? :-)
This also seems similar to another question I answered a long while ago:
Is C# really slower than say C++?
Basically stuff isn't always done just for performance.
MySQL uses the file system the same as everything else on a computer. To retrieve a single piece of data, or a table of data, there is no faster way that directly from the file system. MySQL would just be a small bit of overhead added to that file system pull.
If you need to do some intelligent selecting, match some rows, or filter that data, MySQL is going to do that faster than most other options. The database server provides you calculation and data manipulation power that a filesystem can't.
When you have mixed/structured data, a DBMS is the only solution. For eg. try to get the people's name, surname and country for all your customers stored into your DB, but only those born in 1981 and living in Rome. If you have this data into files on the filesystem, how do you easily get only the required data without scanning all your files and how do you join returned data?
A DBMS give you much more than that.
Many DBMS store data into files.
This abstraction layer will make you retrieve data in a very easily, standard and structured way.
The difference is in how the desired data is located.
In a file system, locating the desired data means searching through all existing data until you find it.
Databases provide indexing which results in locating the desired data almost immediately (within ~12 comparisons) regardless of the amount of data.
What we want is an indexed file system - lucky for us, we have them. They are called databases.
I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?
My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.
Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...
I am building an application at work and need some advice. I have a somewhat unique problem in which I need to gather data housed in a MS SQL Server, and transplant it to a mySQL Server every 15 mins.
I have done this previously in C# with a DataGrid, but now am trying to build a Java version that I can run on an Ubuntu Server, but I can not find a similar model for Java.
Just to give a little background
When I pull the data from the MS SQL Server, it always has 9 columns, but could have anywhere from 0 - 1000 rows.
Before inserting into the mySQL Server blindly, I do manipulate some of the data.
I convert a time column to CST based on a STATE column
I strip some characters to prevent SQL injection
I tried using the ResultSet, but I am having issues with the "forward only result sets" rules.
What would be the best data structure to hold that information, manipulate it, and then parse it to insert later into mySQL?
This sounds like a job for PreparedStatements!
Defined here: http://download.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html
Quick example: http://download.oracle.com/javase/tutorial/jdbc/basics/prepared.html
PreparedStatements allows you to batch up sets of data before pushing them into the target database. They also allow you use the PreparedStatement.setString method which handles escaping characters for you.
For the time conversion thing, I would retrieve the STATE value from the row and then retrieve the time value. Before calling PreparedStatement.setDate, convert the time to CST if necessary.
I dont think that you would need all the overhead that an ORM tool requires.
You could consider using an ORM technology like Hibernate. This might seem a little heavyweight at first, but it means you can maintain the various table mappings for various databases with ease as well as having the power of Java's RegEx lib for any manipulation requirements.
So you'd have a Java class that represents the source table (with its Hibernate mapping) and another Java class that represents the target table and lastly a conversion utility class that does any manipulation of that data. Hibernate takes care of the CRUD SQL for you, so no need to worry about Database specific SQL (as long as you get the mapping correct).
It also lessens the SQL injection problem
I've been writing a java app on my machine and it works perfectly using the DB I set up, but when I install it on site it blows up because the DB is slightly different.
So I'm in the process of writing some code to verify that:
A: I've got the DB details correct
B: The database has all the Tables I expect and they have the right columns.
I've got A down but I've got no idea where to start with B, any suggestions?
Target DB is for the current client is Oracle, but the app can be configured to run on SQL Server as well. So a generic solution would be appreciated, but is not nessisary as I'm sure I can figure out how to do one from the other.
You'll want to query the information_schema of the database, here are some examples for Oracle, every platform I am aware of has something similar.
http://www.alberton.info/oracle_meta_info.html
You might be able to use a database migration tool like LiquiBase for this -- most of these tools have some way of checking the database. I don't have first hand experience using it so it's a guess.
I use DbUnit to test databases. It is a Java based solution, that integrates well with Junit. It is possible to use it with almost no Java. I havent used it in exactly the same situation as you described, but it should be close enough to work.
Most generic solution would be to execute queries with select clause having the expected coulmns and from clause having table names, within try catch block. You can put where clause as 1=2 so as not to fetch any data. If query executed without throwing exception then you have got the expected table and columns.
The slightly different piece might be better handled by scripting the creation of the database in the first place. A automated process gives you a better chance of making the two identical.
Another point worth making is that you minimize your risk by making your devl and prod environments identical - same database schema and vendor for both. Change the circumstances that make the two different.
Lastly, you don't say what is "slightly" different, but sometimes these are unavoidable (e.g. Oracle uses sequences, SQL Server uses identities). Maybe Hibernate can help you to switch between vendors more reliably. It abstracts details in such a way that changing databases can mean modifying a single value in a configuration file.
What you need to have is basically Unit Tests for your database. "A column must exist named FOOBAR, the type must be Integer. No foreign keys may exist etc."
This is doable with plain JUnit and JDBC (ask the table for its meta-data) as you may want to ensure that you are absolutely certain what is being done which may be harder when using e.g. dbUnit.
You can check for the presence of tables, columns, views, etc. using these tables in Oracle
USER_TABLES
USER_VIEWS
USER_PROCEDURE
(or for everything)
USER_OBJECTS WHERE OBJECT_TYPE = '??'
To keep going... USER_TAB_COLS for table columns
Regards
K
I use MigrateDB for this. It lets you build queries that do things like check for the existence of given tables, columns, rows, indexes, etc. for a given database and use those as "tests." If a test fails, it triggers an "action" (which is just another query that knows how to remedy the problem.)
MigrateDB supports multiple database platforms (you can specify the "check for table existence query" for each platform, for example), completely configurable tests (you can make your own up), comes with fairly complete Oracle tests, and can be run in "audit only" mode so that it only tells you what the differences are.
It's a nice, robust solution.
If you're using plain JDBC, you should try utilizing this method: DatabaseMetadata.getTables and other similar methods available in the metadata class.