I currently have a web application that runs with all of the data in Oracle. At the high level, the application consists of a java applet, some java servlets, some Ajax, and the oracle database. I was wondering what converting the whole suite to Hadoop instead would cost in terms of work? Below are some questions that can help me get a grasp on it.
Is there any software that can take SQL database schema creation scripts and queries and convert them to appropriate calls in Hadoop?
How different are the Java APIs for communicating with Hadoop to that of oracle SQL?
Theres a bit of Ajax in there too, how different is that from SQL to Hadoop?
Please consider me a beginner when explaining anything having to do with Hadoop. I don't need to drill down into specifics (unless you want to), just high level talks.
Thanks!
Hadoop is not suitable for usecases which needs real time querying and processing. Hadoop is best when used for offline batch processing and data analysis. You can refer to following link - Common Questions for getting some of you questions answered. You dont have a schema concept in HDFS, which is the filesystem in Hadoop. Data is stored in blocks on disk as a regular file.
I would suggest you visit apache hadoop to learn what is hadoop and in which use case it fits best.
If you are looking for SQL on Hadoop solution that is performant, then check out InfiniDB.
http://infinidb.co
We are a 4th generation columnar MPP engine behind MySQL. We can sit on top of HDFS, GlusterFS or on your local system, so we can be on Hadoop or not, your choice. We are fully open source, GPLv2, there is no difference between the open source version and the enterprise version, use it as you want, scale as you need.
We operate in the interactive SQL area, many people use us for analytical queries against their data. Hadoop MapReduce is great at batch work and transformations, but falls short on the interactive side of things, and that is where solutions like InfiniDB come in.
While you are on Oracle and using Oracle SQL, there may not be much difference between that and the MySQL syntax we support, depends on all the features of Oracle you are using. Many people use us a drop in replacement with their existing MySQL database to start to get the performance of having a cluster MPP database. Also transitioning to Hadoop as you mentioned is another use case, as we can provide the SQL interface for your applications to not even realize they are working on top of a Hadoop cluster.
Feel free to contact me if you have any questions / comments.
Related
I am looking for a database which I can use to store data about certain stock over a number of years. There will probably be a few thousand records. I am writing an application in Java and Clojure which will pull out data from this local database when required to display the data.
I was wondering if anyone knew of a good database to work with for this purpose? I only have experience with MySQL running on the server side.
Which database would be easiest to work with in Clojure and Java for local storage?
Thanks,
Adam
JDK 6 and greater comes bundled with Java DB which good enough for your use case.
For this kind of small-scale application it will almost certainly be easiest if you pick one of the many good embedded Java databases.
My personal top choices would probably be:
H2 - probably the best performance pure Java database overall, and if you believe their benchmarks then it is considerably faster than MySQL and indeed most other databases when run in a single machine environment.
Apache Derby - good all rounder, mature and well supported (Oracle have included a version branded as Java DB in recent JDKs)
After that, you should be able to use them pretty easily using the standard JDBC toolset, so not much different from MySQL.
If you're after a really nice DSL for interfacing with SQL databases with Clojure, you should definitely also take a look at Korma.
I have used Apache Derby for a similar application (although written mostly in Java). They have been running it for almost four years now, and performed more than 60,000 transactions with it with no major problems. Only the occasional bug on my part.
Derby is the same database as JavaDB, however with Derby its easier to keep up on the releases as you can just include it as a dependency, rather than wait on the whim of when the next JDK rev is coming out.
Also, IIRC, JavaDB is only included with JDK, not the JRE.
Depending on the nature of your data and application and your willingness and/or constraints in working with a new database modality, you might also want to consider one of the document-oriented databases, MongoDB or CouchDB. If your data and application are SQL oriented, use one of the databases suggested.
I'm looking for the best database software for a new open source application. The primary criteria is it has to be lightning fast for searching among tens of thousands of entries. Ideally it would be entirely Java based but simply having a Java API is OK. I'm looking to license under GPL so the project would have to be compatible with that. So far SQLite seems to be the most ubiquitous solution but I don't want to overlook something else if it could turn out to be better.
When I search the general internet, most results seems to be for object databases. I don't care if the database is object-based or relational, and I don't think I care if it's "NoSQL" . I have lots of experience with MySQL but I'm not terribly afraid of learning a new query language or interface if it's faster that way. The main kind of data this will be managing is filenames with at least 20 metadata fields attached; I'd want to have multiple datasets with the same fields, and it would be nice to also store some application preferences in the database.
I see from some responses that there may be confusion about my (former) use of "embedded" in the title. I want to clarify that I mean "embedded in the application and redistributed" and not "in use on an embedded device." The application is currently targeting full scale computers, although one reason for "ideally it would be entirely java based" is a dreamy aspiration of creating an Android version.
Ultimately it really depends on your application. SQLite is not designed to be as robust as standard client\server databases like Oracle and MySQL. From the FAQ for SQLite they say the following on the subject:
However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.
That being said SQLite is very fast but then again this depends on how you'll be using it and on what platforms. If you are running on an embedded device you may see significant performance differences than when running on a regular desktop\server which is why its hard to give a exact answer. SQlite does see significant performance gains from not abiding to the standard client\server model.
Your best bet is to pick a few, like SQLite, PostgreSQL, MySQL, and see the performance implications of each by running some tests which simulate common scenarios you will encounter in you application.
Take a look at http://www.polepos.org/ there is a benchmark which clains thathttp://www.db4o.com/
is one of the fastest embedded dbs.
I personally worked with db4o and its very nice and its licensed under GPL so it should possibly fit your needs
I am working on a project that is logging a lot of information about viewers from an online streaming platform. The problem today with the MySQL solution is that is too slow to query, and such.
Even with scaling and better performance tuning, that will now work because there are just to much data real time thats write/reads.
What will be a good(the best) NoSQL solution for me?
Extra:
We are currently also using Amazon Web services, where we store our data.
With Java API, and a open source solution is preferred.
Object orientated.
Not exactly a NoSQL solution , but have you looked at Scribe (from Facebook)? You can use http://code.google.com/p/scribe-log4j/ to write from Java
I would spend some time looking at these options:
Cassandra
MongoDB
Hadoop
All of these solutions have their pros and cons, but their wikis should provide enough information to get you started.
The first challenge you may have is how to collect huge amount of data reliably with ease of management. There're some open-source log collector implementation such as syslog, Fluentd, Scribe, and Flume :)
The big problem is how to store and process data. As you pointed out, using NoSQL solution works really well, but you need to choose among them depending on your data volume.
At first, you can use MongoDB to store all of your data, but at some moment you end up using Apache Hadoop to architect a massively scalable architecture.
The poing here is you should have a distributed logging layer which abstracts away the storage backend, and choosing the right NoSQL solution for data volume.
Here're some links to put the Apache Logs into MongoDB, or Hadoop HDFS by Fluentd.
Store Apache Logs into MongoDB
Fluentd + HDFS: Instant Big Data Collection
Recently, I came across a blog where the author mentioned about integrating Hbase and Hive. Will this be possible and if so what is the advantage of using both(in terms of performance and scalability). Kindly correct me if I went wrong.
I think it will be possible but not trivial to set up for a bit -- maybe CDH3 final will include integration when it comes out.
Advantages: Hive queries over hbase. Think joins and a easy way to do aggregates and simple operations on your HBase data.
Why not just use Hive and not bother with HBase? HBase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses HBase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
There is currently a patch which enables loading data between HBase and Hive. You can find it here:
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
The implementation overhead looks to be pretty high.
It might be easier to run a scan on the HBase table and save to an external file then import it into Hive for data manipulation. (This is also pretty cumbersome, but if you are doing it on a regular basis can be scripted.) This is currently the solution that I am currently working on. I'll let you know how it goes.
As for why you would choose HBase over Hive, they aren't really interchangeable. HBase is meant as a highly scalable data store built on top of Hadoop, with little support for data analysis. Hive on the other hand isn't used for storing data in a production environment, but rather makes it very easy to run specific queries over large amounts of data.
I have to do a class project for data mining subject. My topic will be mining stackoverflow's data for trending topics.
So, I have downloaded the data from here but the data set is so huge (posts.xml is 3gb in size), that I cannot process it on my machine.
So, what do you suggest, is going for AWS for data processing a good option or not worth it?
I have no prior experience on AWS, so how can AWS help me with my school project? How would you have gone about it?
UPDATE 1
So, my data processing will be in 3 stages:
Convert XML (from so.com dump) to .ARFF (for weka jar),
Mine the data using algos in weka,
Convert the output to GraphML format which will be read by prefuse library for visualization.
So, where does AWS fit in here? I support there are two features in AWS which can help me:
EC2 and
Elastic MapReduce,
but I am not sure how mapreduce works and how can I use it in my project. Can I?
You can consider EC2 (the part of AWS you would be using for doing the actual computations) as nothing more than a way to rent computers programmatically or through a simple web interface. If you need a lot of machines and you intend to use them for a short period of time, then AWS is probably good for you. However, there's no magic bullet. You will still have to pick the right software to install on them, load the data either in EBS volumes or S3 and all the other boring details.
Also be advised that EC2 instances and storage are relatively expensive. Be prepared to pay 5-10x more than you would pay if you actually owned the machine/disks and used it for say 3 years.
Regarding your problem, I sincerely doubt that a modern computer is not able to process a 3 gigabyte xml file. In fact, I just indexed all of stack overflow's posts.xml in SOLR on my workstation and it all went swimmingly. Are you using a SAX-like parser? If not, that will help you more than all the cloud services combined.
Sounds like an interesting project or at least a great excuse to get in touch with new technology -- I wish there would have been stuff like that when I went to school.
In most cases AWS offers you a barebone server, so the obvious question is, have you decided how you want to process your data? E.g. -- do you just want to run a shell script on the .xml's or do you want to use hadoop, etc.?
The beauty of AWS is that you can get all the capacity you need -- on demand. E.g., in your case you probably don't need multiple instances just one beefy instance. And you don't have to pay for a root server for an entire month or even a week if you need the server only for a few hours.
If you let us know a little bit more on how you want to process the data, maybe we can help further.