We are developing a system which does some statistical analysis based on social networking data, eg: tweets, status updates etc. I was thinking to store user related information on a relational database (MySQL) and social networks data on a nosql database (MongoDB). Is this a correct approach? Or is it better to use MongoDB for the whole system? Please share your thoughts on usage of NoSQL databases for such a system.
Also i need a badges system integrated to this one to distribute badges on more contributions by users. Are there any open source or commercial badges systems available? So far, based on my searches, i found only mozilla open badges project which i don't think is a perfect fit for us.
Thanks.
I just finished spending a solid year with Mongo and I'm not sure it would be a good fit for you with statistical analysis.
If I were you I'd want to use only one database technology. All MySQL or all Mongo. Doing both with create a lot of headaches.
MongoDB is great for quick and dirty data modeling and having heterogeneous documents living in one collection. In other words, you don't have to manage the schema so actively, which can be really nice.
The problem with MongoDB is in the analysis you would want to do. While I believe the new aggregation framework solves a lot of the problems Mongo used to have with adhoc reports and queries, the framework runs incredibly slow compared to a normal relational database like MySQL.
Lots of people scale MySQL to very large systems, so I would recommend sticking with MySQL due to the query language flexibility and the speed of running more complex queries.
Related
I am a Java guy, I can work with Oracle Database, I know PLSQL, SQL. But I am not good at managing database servers. I think it is a completely different area.
My question is related to database replication. I googled it, found millions of answers but I am still confused.
I could see many times in my professional carrier that developers create complete (complicated) applications to keep sync the source database schema to a target one. It takes time to develop sync apps and very hard to maintain them, especially in case of any data structure modification for example in tables.
I could see that apps built with JPA, JDBC, Spring, myBatis, and PLSQL as well. Usually, they sync DBs during the night, scheduled by Cron, Quartz, Spring, etc. During the sync process usually, the source DB is only available for querying data, not for inserting and DB constraints and triggers are disabled.
These kinds of custom applications always scare me. I do not believe that there is no general, easy, and official way to keep sync two databases without developing a new application.
Now, I got a similar task and honestly, I would like to write zero lines of code related to this task. I believe that there are recommended and existing solutions, cover this topic offered by the database vendors.
That would be great if you can push me in the right direction. I feel that writing another new DB sync application is not the right way.
I need to focus on Oracle Database sync, but I would be happy to know a general, database vendor-independent way.
There are many ways to perform replication in a Oracle Database. Oracle has two replication techniques in the database "Advanced Replication" and "GoldenGate". GoldenGate us the new perferred method of replication which uses the redo logs files from the database. Both methods are geared for a Oracle DBA.
Often application developers will create a "interface" that will move data from one database to other. A interface is a program ( pl/sql, bash, c, etc ) that runs on a cron (database or system) that wakes on a event to move data. Interfaces are useful when data is needed to be process during replication.
I need to choose between a ad-hoc solution with JSON or pick one embedded NoSQL DB (OrientDB probably).
Scenario:
Open-Source desktop software in Java (free as beer)
Single connection
Continuous Delivery (will change)
Really easy client installation (copy and paste)
about 20,000 records
polyglot persistence
The problem:
setup NoSQL DB is hard
one environment build, interoperability (Linux and Windows)
lack of embedded Document NoSQL DB for Java
complexity
So JSOn ad-hoc is the right option? Some recommendation of a really embedded NoSQL database? or another approach?
Thanks.
One of the main motivations behind the development, and adoption, of NoSQL databases is the possibility to scale horizontally which is needed when your database reach a huge enough size that may require more nodes processing its operation to be more responsive.
If improve performance is the motivation one should have to move a database to a NoSQL approach when it is reaching a huge amount of data. As a side note, it is even interesting to think about the etymology behind the name of one of the most successful NoSQL databases so far, MongoDB that get the prefix "mongo" as a reference to humongous: enormous. This clearly states the purpose of such tools.
That being said, considering that in your scenario you are dealing with 20 thousands records only, you may have many other NoSQL alternatives that are easier to manage. You can go for JSON ad-hoc, or even use more tradicional, solid and stable tools like Firebird embedded or the most obvious and widely used option for embedded databases: SQLite.
Solr/Lucene's reverse index and query supports an subset of RDBMS functionalities, i.e. filtering, sorting, groupby, paging. In this sense it is very close to an nosql database as it also does not support transaction and joins.
With framework like Hibernate-Search, it is possible to map even complex objects to the index and perform basic CRUD operations, while supporting full-text search.
Considerations:
1) Write throughput
From my past experience, Lucene index's write throughput is much lower than RDBMS
2) Query Speed
Query speed for Lucene index should be comparable, if not faster, due to the reverse index.
3) Scalability
Could be resolved using replication or Solr-cloud.
4) Ability to handle large data set
I have used lucene index with 15M+ document on a single JVM without any performance issue.
Background:
I am currently using MongoDB with Solr and it is working well enough. However, it is not as "simple" as i would like it to be due to:
Keeping mongo and Solr index in sync (not a trivial task)
Transformation between Java object <-> mongo <-> solr (SpringData and SolrJ helps, but still not great).
Why use two "persistence" technology if one will do
From the small scale test I have done so far, I haven't found any technical road block that would prevent me from using Solr/Lucene as persistence. However, I also don't want to commit to such a drastic refactoring without more information. I also aware of projects like Solandra with attempts to bring NoSQl and Solr together, but they don't seem to be mature enough.
Question
So with applications where full-text search is an major (but not the only) requirement, is it then feasible to for-go traditional (RDBMS) and contemporary (NoSQL) data store?
Great Reference Thanks to raticulin
Atlassian (Jira) - Lucene Generic Data Indexing
I think I remember watching some presentation from Atlassian where they explained that for Jira the were using just Lucene nowadays, they had dropped their previous DB (whatever it was) and using Lucene as storage too. They were happy.
If someone can confirm it was them would be cool.
Edit:
http://blogs.atlassian.com/rebelutionary/downloads/tssjs2007-lucene-generic-data-indexing.pdf
Lucene - Full Text Search/Information Retrieval Library.
Solr - Enterprise Search Server built on top of Lucene.
Lucene/Solr should not be used in place of Persistence, neither they will be able to replace RDBMS nor it is a good thing to compare them to RDBMS, you are comparing apples & oranges.
As far index throughput speed of Lucene that you are comparing with RDBMS will not help & it is not right to compare directly, there could be a number of factors that affect Lucene throughput depending on your search schema configurations.
Lucene has one of the well known & best data structures for information retrieval, Query speed that you get depends on number of factors from configuration, HW etc..
Obviously, that's the way to go.
Handling 15M+ on a single JVM is great, but it does not go far without understanding Document size, feature set used, JVM Memory, CPU Cores etc...
Now if your problem is that RDBMS is real scalability bottleneck, you could use pick a NoSQL datastore based on your persistence needs, which you could then with integrate Solr/Lucene to provide full-text search capability. Since NoSQL is rapidly evolving & fairly new you might not find fairly stable adapters to integrate Solr/Lucene with NoSQL.
Edit:
Now that the question is updated, this is already well debated in this question NoSQL (MongoDB) vs Lucene (or Solr) as your database. It could be a pain to have too many moving parts, Lucene/Solr could very well replace MongoDB, depending on app. But you have to consider NoSQL Data Store are built from ground up to be fully distributed, you dont lose or have limited functionality due to scaling, while Solr is not built with Distributed Computing in mind, so there are limitations Distributed Search limitations when it comes horizontal scaling. SolrCloud may be the answer too that..
I'm looking for the best database software for a new open source application. The primary criteria is it has to be lightning fast for searching among tens of thousands of entries. Ideally it would be entirely Java based but simply having a Java API is OK. I'm looking to license under GPL so the project would have to be compatible with that. So far SQLite seems to be the most ubiquitous solution but I don't want to overlook something else if it could turn out to be better.
When I search the general internet, most results seems to be for object databases. I don't care if the database is object-based or relational, and I don't think I care if it's "NoSQL" . I have lots of experience with MySQL but I'm not terribly afraid of learning a new query language or interface if it's faster that way. The main kind of data this will be managing is filenames with at least 20 metadata fields attached; I'd want to have multiple datasets with the same fields, and it would be nice to also store some application preferences in the database.
I see from some responses that there may be confusion about my (former) use of "embedded" in the title. I want to clarify that I mean "embedded in the application and redistributed" and not "in use on an embedded device." The application is currently targeting full scale computers, although one reason for "ideally it would be entirely java based" is a dreamy aspiration of creating an Android version.
Ultimately it really depends on your application. SQLite is not designed to be as robust as standard client\server databases like Oracle and MySQL. From the FAQ for SQLite they say the following on the subject:
However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.
That being said SQLite is very fast but then again this depends on how you'll be using it and on what platforms. If you are running on an embedded device you may see significant performance differences than when running on a regular desktop\server which is why its hard to give a exact answer. SQlite does see significant performance gains from not abiding to the standard client\server model.
Your best bet is to pick a few, like SQLite, PostgreSQL, MySQL, and see the performance implications of each by running some tests which simulate common scenarios you will encounter in you application.
Take a look at http://www.polepos.org/ there is a benchmark which clains thathttp://www.db4o.com/
is one of the fastest embedded dbs.
I personally worked with db4o and its very nice and its licensed under GPL so it should possibly fit your needs
I am working on a project that is logging a lot of information about viewers from an online streaming platform. The problem today with the MySQL solution is that is too slow to query, and such.
Even with scaling and better performance tuning, that will now work because there are just to much data real time thats write/reads.
What will be a good(the best) NoSQL solution for me?
Extra:
We are currently also using Amazon Web services, where we store our data.
With Java API, and a open source solution is preferred.
Object orientated.
Not exactly a NoSQL solution , but have you looked at Scribe (from Facebook)? You can use http://code.google.com/p/scribe-log4j/ to write from Java
I would spend some time looking at these options:
Cassandra
MongoDB
Hadoop
All of these solutions have their pros and cons, but their wikis should provide enough information to get you started.
The first challenge you may have is how to collect huge amount of data reliably with ease of management. There're some open-source log collector implementation such as syslog, Fluentd, Scribe, and Flume :)
The big problem is how to store and process data. As you pointed out, using NoSQL solution works really well, but you need to choose among them depending on your data volume.
At first, you can use MongoDB to store all of your data, but at some moment you end up using Apache Hadoop to architect a massively scalable architecture.
The poing here is you should have a distributed logging layer which abstracts away the storage backend, and choosing the right NoSQL solution for data volume.
Here're some links to put the Apache Logs into MongoDB, or Hadoop HDFS by Fluentd.
Store Apache Logs into MongoDB
Fluentd + HDFS: Instant Big Data Collection