The best NoSQL solution for logging - java

I am working on a project that is logging a lot of information about viewers from an online streaming platform. The problem today with the MySQL solution is that is too slow to query, and such.
Even with scaling and better performance tuning, that will now work because there are just to much data real time thats write/reads.
What will be a good(the best) NoSQL solution for me?
Extra:
We are currently also using Amazon Web services, where we store our data.
With Java API, and a open source solution is preferred.
Object orientated.

Not exactly a NoSQL solution , but have you looked at Scribe (from Facebook)? You can use http://code.google.com/p/scribe-log4j/ to write from Java

I would spend some time looking at these options:
Cassandra
MongoDB
Hadoop
All of these solutions have their pros and cons, but their wikis should provide enough information to get you started.

The first challenge you may have is how to collect huge amount of data reliably with ease of management. There're some open-source log collector implementation such as syslog, Fluentd, Scribe, and Flume :)
The big problem is how to store and process data. As you pointed out, using NoSQL solution works really well, but you need to choose among them depending on your data volume.
At first, you can use MongoDB to store all of your data, but at some moment you end up using Apache Hadoop to architect a massively scalable architecture.
The poing here is you should have a distributed logging layer which abstracts away the storage backend, and choosing the right NoSQL solution for data volume.
Here're some links to put the Apache Logs into MongoDB, or Hadoop HDFS by Fluentd.
Store Apache Logs into MongoDB
Fluentd + HDFS: Instant Big Data Collection

Related

Change application from Oracle to Hadoop

I currently have a web application that runs with all of the data in Oracle. At the high level, the application consists of a java applet, some java servlets, some Ajax, and the oracle database. I was wondering what converting the whole suite to Hadoop instead would cost in terms of work? Below are some questions that can help me get a grasp on it.
Is there any software that can take SQL database schema creation scripts and queries and convert them to appropriate calls in Hadoop?
How different are the Java APIs for communicating with Hadoop to that of oracle SQL?
Theres a bit of Ajax in there too, how different is that from SQL to Hadoop?
Please consider me a beginner when explaining anything having to do with Hadoop. I don't need to drill down into specifics (unless you want to), just high level talks.
Thanks!
Hadoop is not suitable for usecases which needs real time querying and processing. Hadoop is best when used for offline batch processing and data analysis. You can refer to following link - Common Questions for getting some of you questions answered. You dont have a schema concept in HDFS, which is the filesystem in Hadoop. Data is stored in blocks on disk as a regular file.
I would suggest you visit apache hadoop to learn what is hadoop and in which use case it fits best.
If you are looking for SQL on Hadoop solution that is performant, then check out InfiniDB.
http://infinidb.co
We are a 4th generation columnar MPP engine behind MySQL. We can sit on top of HDFS, GlusterFS or on your local system, so we can be on Hadoop or not, your choice. We are fully open source, GPLv2, there is no difference between the open source version and the enterprise version, use it as you want, scale as you need.
We operate in the interactive SQL area, many people use us for analytical queries against their data. Hadoop MapReduce is great at batch work and transformations, but falls short on the interactive side of things, and that is where solutions like InfiniDB come in.
While you are on Oracle and using Oracle SQL, there may not be much difference between that and the MySQL syntax we support, depends on all the features of Oracle you are using. Many people use us a drop in replacement with their existing MySQL database to start to get the performance of having a cluster MPP database. Also transitioning to Hadoop as you mentioned is another use case, as we can provide the SQL interface for your applications to not even realize they are working on top of a Hadoop cluster.
Feel free to contact me if you have any questions / comments.

Analysis of MongoDB and Badges system

We are developing a system which does some statistical analysis based on social networking data, eg: tweets, status updates etc. I was thinking to store user related information on a relational database (MySQL) and social networks data on a nosql database (MongoDB). Is this a correct approach? Or is it better to use MongoDB for the whole system? Please share your thoughts on usage of NoSQL databases for such a system.
Also i need a badges system integrated to this one to distribute badges on more contributions by users. Are there any open source or commercial badges systems available? So far, based on my searches, i found only mozilla open badges project which i don't think is a perfect fit for us.
Thanks.
I just finished spending a solid year with Mongo and I'm not sure it would be a good fit for you with statistical analysis.
If I were you I'd want to use only one database technology. All MySQL or all Mongo. Doing both with create a lot of headaches.
MongoDB is great for quick and dirty data modeling and having heterogeneous documents living in one collection. In other words, you don't have to manage the schema so actively, which can be really nice.
The problem with MongoDB is in the analysis you would want to do. While I believe the new aggregation framework solves a lot of the problems Mongo used to have with adhoc reports and queries, the framework runs incredibly slow compared to a normal relational database like MySQL.
Lots of people scale MySQL to very large systems, so I would recommend sticking with MySQL due to the query language flexibility and the speed of running more complex queries.

Hadoop, Mahout real-time processing alternative

I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.
What libraries/technologies I can use for this purposes?
You are right, Hadoop is designed for batch-type processing.
Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
(from: InfoQ post)
However, I have not worked with it yet, so I really cannot say much about it in practice.
Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm
Given the fact that you want a real-time response in de "seconds" area I recommend something like this:
Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.
Use a real-time system to do the last few things that cannot be precomputed.
For this you should look at either using the mentioned s4 or the recently announced twitter storm.
Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.
Perhaps having a NoSQL backend for your realtime component helps.
There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...
It all depends on your real application.
Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.
What you're trying to do would be a better fit for HPCC as it has both, the back end data processing engine (equivalent to Hadoop) and the front-end real-time data delivery engine, eliminating the need to increase complexity through third party components. And a nice thing of HPCC is that both components are programmed using the same exact language and programming paradigms.
Check them out at: http://hpccsystems.com

What is the advantage of integrating Hbase and Hive

Recently, I came across a blog where the author mentioned about integrating Hbase and Hive. Will this be possible and if so what is the advantage of using both(in terms of performance and scalability). Kindly correct me if I went wrong.
I think it will be possible but not trivial to set up for a bit -- maybe CDH3 final will include integration when it comes out.
Advantages: Hive queries over hbase. Think joins and a easy way to do aggregates and simple operations on your HBase data.
Why not just use Hive and not bother with HBase? HBase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses HBase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
There is currently a patch which enables loading data between HBase and Hive. You can find it here:
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
The implementation overhead looks to be pretty high.
It might be easier to run a scan on the HBase table and save to an external file then import it into Hive for data manipulation. (This is also pretty cumbersome, but if you are doing it on a regular basis can be scripted.) This is currently the solution that I am currently working on. I'll let you know how it goes.
As for why you would choose HBase over Hive, they aren't really interchangeable. HBase is meant as a highly scalable data store built on top of Hadoop, with little support for data analysis. Hive on the other hand isn't used for storing data in a production environment, but rather makes it very easy to run specific queries over large amounts of data.

How is AWS for Data mining for school project?

I have to do a class project for data mining subject. My topic will be mining stackoverflow's data for trending topics.
So, I have downloaded the data from here but the data set is so huge (posts.xml is 3gb in size), that I cannot process it on my machine.
So, what do you suggest, is going for AWS for data processing a good option or not worth it?
I have no prior experience on AWS, so how can AWS help me with my school project? How would you have gone about it?
UPDATE 1
So, my data processing will be in 3 stages:
Convert XML (from so.com dump) to .ARFF (for weka jar),
Mine the data using algos in weka,
Convert the output to GraphML format which will be read by prefuse library for visualization.
So, where does AWS fit in here? I support there are two features in AWS which can help me:
EC2 and
Elastic MapReduce,
but I am not sure how mapreduce works and how can I use it in my project. Can I?
You can consider EC2 (the part of AWS you would be using for doing the actual computations) as nothing more than a way to rent computers programmatically or through a simple web interface. If you need a lot of machines and you intend to use them for a short period of time, then AWS is probably good for you. However, there's no magic bullet. You will still have to pick the right software to install on them, load the data either in EBS volumes or S3 and all the other boring details.
Also be advised that EC2 instances and storage are relatively expensive. Be prepared to pay 5-10x more than you would pay if you actually owned the machine/disks and used it for say 3 years.
Regarding your problem, I sincerely doubt that a modern computer is not able to process a 3 gigabyte xml file. In fact, I just indexed all of stack overflow's posts.xml in SOLR on my workstation and it all went swimmingly. Are you using a SAX-like parser? If not, that will help you more than all the cloud services combined.
Sounds like an interesting project or at least a great excuse to get in touch with new technology -- I wish there would have been stuff like that when I went to school.
In most cases AWS offers you a barebone server, so the obvious question is, have you decided how you want to process your data? E.g. -- do you just want to run a shell script on the .xml's or do you want to use hadoop, etc.?
The beauty of AWS is that you can get all the capacity you need -- on demand. E.g., in your case you probably don't need multiple instances just one beefy instance. And you don't have to pay for a root server for an entire month or even a week if you need the server only for a few hours.
If you let us know a little bit more on how you want to process the data, maybe we can help further.

Categories