Why to use Hadoop? [closed]

Why to use Hadoop? [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am little confused about the usage of Hadoop. I dont understand when & where to use Hadoop.
Hadoop is an open-source framework that allows to store and process
big data in a distributed environment across clusters of computers
using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local
computation and storage.
According to the definition, this job also gets done by other databases like Oracle, MSSQL, etc i.e. storing & processing data across clusters. Then what else is the advantage of using Hadoop ?

Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system, Hadoop provides an API for processing all that stored data - Map-Reduce.
The basic idea is that since the data is stored in many nodes, you're better off processing it in a distributed manner where each node can process the data stored on it rather than spend a lot of time moving it over the network.
Unlike RDMS that you can query in realtime, the map-reduce process takes time and doesn't produce immediate results.
On top of this basic scheme you can build a Column Database, like HBase.
A column-database is basically a hashtable that allows realtime queries on rows.
As per my knowledge, there are lot of differences. Please read below differences.
Hadoop is not a database. Hbase or Impala may be considered databases but Hadoop is just a file system (hdfs) with built in redundancy, parallelism.
Traditional databases/RDBMS have ACID properties - Atomicity, Consistency, Isolation and Durability. You get none of these out of the box with Hadoop. So if you have to for example write code to take money from one bank account and put into another one, you have to (painfully) code all the scenarios like what happens if money is taken out but a failure occurs before its moved into another account.
Hadoop offers massive scale in processing power and storage at a very low comparable cost to an RDBMS.
Hadoop offers tremendous parallel processing capabilities. You can run jobs in parallel to crunch large volumes of data.
Some people argue that traditional databases do not work well with un-structured data, but its not as simple as that. There are many applications built using traditional RDBMS that use a lot of unstructured data or video files or PDFs that I have come across that work well.
Typically RDBMS will manage a large chunk of the data in its cache for faster processing while at the same time maintaining read consistency across sessions. I would argue Hadoop does a better job at using the memory cache to process the data without offering any other items like read consistency.
Hive SQL is almost always a magnitude of times slower than SQL you can run in traditional databases. So if you are thinking SQL in Hive is faster than in a database, you are in for a sad disappointment. It will not scale at all for complex analytics.
Hadoop is very good for parallel processing problems - like finding a set of keywords in a large set of documents (this operation can be parallelized). However typically RDBMS implementations will be faster for comparable data sets.

RDBMS is not capable of processing big data in a cost effective way. As the size of the data increases, the RDBMS systems which uses vertical scalability techniques will not work well. In this place, the big data processing frameworks such as hadoop work well in a cost effective way.
Most of the big data processing frameworks are opensource and designed to run on commodity hardware. So the cost will be very less compared to that of RDBMS required for the same set up.
In simple words, bigdata starts from where the RDBMS stops due to data size and complexity.
Another point is RDBMS mainly deals with structured data. But most of the big data frameworks can deal with Structured, Unstructured and Semi-structured data. Most of the big data frameworks are designed to process any kind of large data.

Distribute data and computation. The computation local to data prevents the network overload. Tasks are independent so, it is easy to handle partial failure. Here the entire nodes can fail and restart.
It avoids crawling horrors of failure and tolerant synchronous distributed systems.Linear scaling in the ideal case. It used to design for cheap, commodity hardware. Simple programming model. The end-user programmer only writes map-reduce tasks.

Related

Which is the fastest NoSql database accessed from the same machine?

In my use case the data is relatively small (~1000.000 Strings), but i have to access as fast as possible (every nano sec counts), from a multithreaded environment (implemented in pure Java)
Currently I'm using redis (in localhost) and I'm basically happy with it, but i want to know if there is some better alternative, since redis has all the network stuff, and is not designed for multithred stuff. The persistence is also very low priority for my use case.
I want to run in the same machine (no networking at all)
I want to be as fast as possible
Relativity small data (my current Redis instance is about 20MB max in memory)
i don't want to :
use other solution than NoSql database.

There are lots of great NoSQL databases that function as a key-value store. Each have unique capabilities.
Redis is great in a single server and is dead easy to install and use. But Redis becomes difficult to shard and manage when your data outgrows beyond a single server.
Thumbtack Technologies (of NYC) published two white papers comparing performance and reliability of MongoDB, Cassandra and Aerospike. The papers are very objective, the benchmarks where done using the YCSB benchmarking tool and were conducted on the same hardware.
Which one to used depends on what you need.
MongoDB is a feature rich key-value store with lots of nice programmer features. It offers queries on secondary indexes and is a very good document store. It's a In-memory database so all you data must fit into RAM. Mongo can be clustered and I have heard that it becomes tricky to manage if you have a big cluster.
CouchBase is great for storing large amounts of data and a portion of that data is cached in RAM. So its very quick if the value you are after is in the cache working set. This is great if your use case mostly works with hot data and accesses cold data less often.
Cassandra is really good for a 'write heavy' use case. Its easy to use and is a good programmer experience. It is written in Java and periodically pauses while it does GC, so you need to tune you GC parameters.
Aerospike is good for storing large amounts of data in a small number of servers. It boasts single digit millisecond (or better) latencies, high availability and high reliability, and it is probably (IMHO) the easiest to maintain and scale. It is multi-code aware, NUMA node aware and has a self-healing zero touch cluster technology. It's great for "real-time" use cases where access to any record needs to be fast and predictable. Aerospike is my favorite.
Cassandra, CouchBase, MongoDB and Aerospike all have an "analytics" capability, and which one you choose depends on the use case and your performance envelope.

You have 1 million strings?
That's a tiny amount of data. If you want speed than nothing will be faster than just using in-memory data structure inside your application code itself. Just store all the data in a file, load up into a list on program startup then serialize back to the file if you need to save it.
Avoid all the overhead of running and interacting with a database - especially you don't care about persistence.
A simple flat file with each line being a separate string will take about 100ms to read and parse.

Smaller scale Java distributed programming

I'm learning a bit more about hadoop and its applications, and I understand it is geared toward massive datasets and large files. Let's say I had an application in which I was processing a relatively small number of files (say 100k), which isn't a huge number for something like hadoop/hdfs. However, it does take a macro amount of time to run on a single machine, so I'd like to distribute the process.
The problem can be broken down into a map reduce style problem (e.g. each of the files can be processed independently and then I can aggregate the results). I'm open to using infrastructure such as Amazon EC2, but I'm not so sure about what technologies to be exploring for actually aggregating the results of the process. Seems like hadoop might be a bit overkill here.
Can anyone provide guidance on this type of problem?

First off, you may want to reconsider your assumption that you can't combine files. Even images can be combined- you just need to figure out how to do that in a way that allows you to break them out again in your mappers. Combining them with some sort of sentinel value or magic number between them might make it possible to turn them into one giant file.
Other options include HBase, where you could store the images in cells. HBase also has a built-in TableMapper and TableReducer, and can store the results of your processing alongside the raw data in a semi-structured way.
EDIT: As for the "is Hadoop overkill" question, you need to consider the following:
Hadoop adds at least one machine of overhead (the HDFS NameNode). You typically dont want to store data or run jobs on that machine, since it is a SPOF.
Hadoop is best suited for processing data in batch, with relatively high latency. As #Raihan mentions, there are several other FOSS distributed compute architectures that may server your needs better if you need realtime or low-latency results.
100k files isn't so very few. Even if they are 100k each, that's 10GB of data.
Other than the above, Hadoop is a relatively low-overhead way of approaching distributed computing problems. It has a huge, helpful community behind it, so you can get help quickly if you need it. And it is focused on running on cheap hardware and a free OS, so there really isnt any significant overhead.
In short, I'd try it before you discard it for something else.

Is "Adopting MapReduce model" = Universal answer to scalability?

I have been trying to understand the MapReduce concept and apply it to my current situation. What is my situation? Well, I have an ETL tool here, in which data transformation happens outside of source and destination data sources (databases). Hence,the source data source is purely used for extract and destination for load.
So, this act of transformation today, say takes about X hours for a million records. I would like to address a scenario where I would have a billion records, but I would want the work done in the same X hours. So, here is the need, for my product to scale out (adding more commodity machines) based on the scale of data. As you can see, I am only worried about the ability of distributing my product's transformation functionality to different machines, there by, leveraging CPU power from all these machines.
I started looking for options and I came across Apache Hadoop and then eventually the concept of MapReduce. I was pretty successful in settin up Hadoop quickly without running into issues in cluster mode and was happy to run a wordcount demo too. Soon, I realized that for implementing my own MapReduce model, I would have to redefine my product's transformation functionality into MAP and REDUCE functions.
Here's when trouble began. I read a copy of Hadoop: Definitive Guide, and I understood that many of the common use cases of Hadoop are in scenarios where one is faced with:
Unstructed data and one would like to perform aggregation/ sort/ or something of that kind.
Unstrucuted text and there is a need to perform mining
etc!
Here is my scenario where I extract from a database and load to a database (which has structured data), and my sole purpose is about bringing in more CPUs into play, in a reliable manner, and there by distribute my transformation. And redefining my transformation to fit a Map and Reduce model makes it a huge challenge in itself. So here are my questions:
Have you used Hadoop in ETL
scenarios? If yes, could be specific
about how you handled MapReducing of
your transformation? Have you used
Hadoop purely for leveraging extra
CPU power?
Is MapReduce concept the
universal answer to distributed
computing? Are there other equally
good options?
My understanding is
that MapReduce applies to large
dataset for
sorting/analytics/grouping/counting/aggregation/etc,
is my understading correct?

If you want to scale-out a processing problem over a lot of systems you must do two things:
Make sure you can process the information in independent parts.
There should be NO shared resource that is needed among these parts.
If there are dependencies then these will be the limit in your horizontal scalability.
So if you are starting from a relational model then the main obstruction is the fact that you have relationships. Having these relationships is a great asset in relational databases but is a pain in the ... when trying to scale-out.
The simplest way to go from relational to independent parts is to make a jump and de-normalize your data into records that have everything in them and are focussed around the part you want to do the processing around. Then you can disribute them over a huge cluster and after the processing has been completed you use the results.
If you cannot do such a jump you're in trouble.
So coming back to your questions:
# Have you used Hadoop in ETL scenarios?
Yes, the input being Apache logfiles and the loading and transformation consisted of parsing, normalizing and filtering these loglines. The result wan't put in a normal RDBMS!
# Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?
MapReduce is a very simple processing model that will work great for any processing problem you are able to split into a lot of smaller 100% independent parts. The MapReduce model is so simple that as far as I know any problem that can be split into independent parts can be written as series of mapreduce steps.
HOWEVER: It is important to note that at this moment only BATCH oriented processing can be done with Hadoop. If you want "realtime" processing you are currently out of luck.
I don't know of a better model at this moment that an actual implementation exists for.
# My understanding is that MapReduce applies to large dataset for sorting/analytics/grouping/counting/aggregation/etc, is my understading correct?
Yep, that is the most common application.

MapReduce is "one" solution for "some" class of problems. It does not solve all the distributed systems problems - think about large TPS systems as the ones in banks or telecoms or telco signaling - there MR might be ineffective. But for the non real-time data processing MR performs awesome and you might consider it for massive ETL.

I cannot answer #1, as I haven't used MapReduce in ETL scenarios. However, I can say that MapReduce is not an "universal answer" for distributed computing; it's a useful tool for handling certain types of situations, where data is structured in a certain way. Think of it like a hashtable; very useful for certain situations, but not an "ultimate algorithm" by any definition of terms.
My personal understanding is that MapReduce is particularly useful for large quantities of "understructured" data; that is, it's useful for imposing some structure (basically, effectively providing a "first order" operation on large unstructured datasets). However, for datasets that are very large and relatively "tightly bound" (i.e. strong association between disparate data elements), it's (in my understanding) not a great solution.

Persistence strategy for low latency reads and writes

I am building an application that includes a feature to bulk tag millions of records, more or less interactively. The user interaction is very similar to Gmail where users can tag individual emails, or bulk tag large amounts of emails. I also need quick read access to these tag memberships as well, and where the read pattern is more or less random.
Right now we're using Mysql and inserting one row for every tag-document pair. Writing millions of rows to Mysql takes a while (high I/O), even with bulk insertions and heavy optimization. We need this to be an interactive process, not a batch process.
For the data that we're storing and reading, consistency and availability of the data are not as important as performance and scalability. So in the event of system failure while the writes are occurring, I can deal with some data loss. However, the data definitely needs to be persisted to secondary storage at some point.
So, to sum up, here are the requirements:
Low latency bulk writes of potentially tens of millions of records
Data needs to be persisted in some way
Low latency random reads
Durable writes not required
Eventual consistency is okay
Here are some solutions I've looked at:
Write behind caches (Terracotta, Gigaspaces, Coherence) where records are written to memory and drained to the database asynchronously. These scare me a little because they appear to add a certain amount of complexity to the app that I'd want to avoid.
Highly scalable key-value stores, like MongoDB, HBase, Tokyo Tyrant

If you have the budget to use Coherence for this, I highly recommend doing so. There is direct support for write-behind, eventual consistency behavior in Coherence and it is very survivable to both a database outage and Coherence cluster node outages (if you use >= 3 Coherence nodes on separate JVMs, preferably on separate hosts). I have implemented this for doing high-volume CRM for a Fortune 100 company's e-commerce site and it works fantastically.
One of the best aspects of this architecture is that you write your Java application code as if none of the write-behind behavior were taking place, and then plug in the Coherence topology and configuration that makes it happen. If you need to change the behavior or topology of Coherence later, no change in your application is required. I know there are probably a handful of reasonable ways to do this, but this behavior is directly supported in Coherence rather than having to invent or hand-roll a way of doing it.
To make a really fine point - your worry about adding application complexity is a good one. With Coherence, you simply write updates to the cache (or if you're using Hibernate it can be the L2 cache provider). Depending upon your Coherence configuration and topology, you have the option to deploy your application to use write-behind, distributed, caches. So, your application is no more complex (and, frankly unaware) due to the features of the cache.
Finally, I implemented the solution mentioned above from 2005-2007 when Coherence was made by Tangosol and they had the best possible support. I'm not sure how things are now under Oracle - hopefully still good.

I've worked on a large project that used asyncrhonous writes althoguh in that case it was just hand-written using background threads. You could also implement something like that by offloading the db write process to a JMS queue.
One thing that will certainly speed up db writes is to do them in batches. JDBC batch updates can be orders of magnitude faster than individual writes, and if you're doing them asynchronously you can just write them 500 at a time.

Depending on how your data is organized perhaps you would be able to use sharding,
if the read latency isn't low enough you can also try to add caching. Memcache is one popular solution.

Berkeley DB has a very high performance disk-based hash table that supports transactions, and integrates with a Java EE environment if you need that. If you're able to model the data as key/value pairs, this can be a very scalable solution.
http://www.oracle.com/technology/products/berkeley-db/je/index.html
(Note: oracle bought berkeley db about 5-10 years ago; the original product has been around for 15-20 years).

Java Fast Data Storage & Retrieval

I need to store records into a persistant storage and retrieve it on demand. The requirement is as follows:
Extremely fast retrieval and insertion
Each record will have a unique key. This key will be used to retrieve the record
The data stored should be persistent i.e. should be available upon JVM restart
A separate process would move stale records to RDBMS once a day
What do you guys think? I cannot use standard database because of latency issues. Memory databases like HSQLDB/ H2 have performace contraints. Moreover the records are simple string objects and do not qualify for SQL. I am thinking of some kind of flat file based solution. Any ideas? Any open source project? I am sure, there must be someone who has solved this problem before.

There are lot of diverse tools and methods, but I think none of them can shine in all of the requirements.
For low latency, you can only rely on in-memory data access - disks are physically too slow (and SSDs too). If data does not fit in the memory of a single machine, we have to distribute our data to more nodes summing up enough memory.
For persistency, we have to write our data to disk after all. Supposing optimal organization
this can be done as background activity, not affecting latency.
However for reliability (failover, HA or whatever), disk operations can not be totally independent of the access methods: we have to wait for the disks when modifying data to make shure our operation will not disappear. Concurrency also adds some complexity and latency.
Data model is not restricting here: most of the methods support access based on a unique key.
We have to decide,
if data fits in the memory of one machine, or we have to find distributed solutions,
if concurrency is an issue, or there are no parallel operations,
if reliability is strict, we can not loose modifications, or we can live with the fact that an unplanned crash would result in data loss.
Solutions might be
self implemented data structures using standard java library, files etc. may not be the best solution, because reliability and low latency require clever implementations and lots of testing,
Traditional RDBMS s have flexible data model, durable, atomic and isolated operations, caching etc. - they actually know too much, and are mostly hard to distribute. That's why they are too slow, if you can not turn off the unwanted features, which is usually the case.
NoSQL and key-value stores are good alternatives. These terms are quite vague, and cover lots of tools. Examples are
BerkeleyDB or Kyoto Cabinet as one-machine persistent key-value stores (using B-trees): can be used if the data set is small enough to fit in the memory of one machine.
Project Voldemort as a distributed key-value store: uses BerkeleyDB java edition inside, simple and distributed,
ScalienDB as a distributed key-value store: reliable, but not too slow for writes either.
MemcacheDB, Redis other caching databases with persistency,
popular NoSQL systems like Cassandra, CouchDB, HBase etc: used mainly for big data.
A list of NoSQL tools can be found eg. here.
Voldemort's performance tests report sub-millisecond response times, and these can be achieved quite easily, however we have to be careful with the hardware too (like the network properties mentioned above).

Have a look at LinkedIn's Voldemort.

If all the data fits in memory, MySQL can run in memory instead of from disk (MySQL Cluster, Hybrid Storage). It can then handle storing itself to disk for you.

What about something like CouchDB?

I would use a BlockingQueue for that. Simple, and built into Java.
I do something similar using realtime data from Chicago Merchantile Exchange.
The data is sent to one place for realtime use... and to another place (via TCP),
using a BlockingQueue (Producer/Consumer) to persist the data to a database (Oracle,H2).
The Consumer uses a time delayed commit to avoid fdisk sync issues in the database.
(H2 type databases are asyncronous commit by default and avoid that issue)
I log the persisting in the Consumer to keep track of the queue size to be sure
it is able to keep up with the Producer. Works pretty good for me.

MySQL with shards may be a good idea. However, it depends on what is the data volume, transactions per second and latency you need.
In memory databases are also a good idea. In fact MySQL provides memory-based tables as well.

Would a Tuple space / JavaSpace work? Also check out other enterprise data fabrics like Oracle Coherence and Gemstone.

MapDB provides highly performant HashMaps/TreeMaps that are persisted to disk. Its a single library that you can embed in your Java program.

Have you actually proved that using an out-of-process SQL database like MySQL or SQL Server is too slow, or is this an assumption?
You could use a SQL database approach in conjunction with an in-memory cache to ensure that retrievals do not hit the database at all. Despite the fact that the records are plaintext I would still advise using SQL over a flat file solution (e.g. using a text column in your table schema) as the RDBMS will perform optimisations that a file system cannot (e.g. caching recently accessed pages, etc).
However, without more information about your access patterns, expected throughput, etc. I can't provide much more in the way of suggestions.

If you are looking for a simple key-value store and don't need complex sql querying, Berkeley DB might be worth a look.
Another alternative is Tokyo Cabinet, a modern DBM implementation.

How bad would it be if you lose a couple of entries in case of a crash?
If it isn't that bad the following approach might work for you:
Create flat files for each entry, name of file equals id. Possible one file for a not so big number of consecutive entries.
Make sure your controller has a good cache and/or use one of the existing caches implemented in Java.
Talk to a file system expert how to make this really fast
It is simple and it might be fast.
Of course you lose transactions including the ACID principles.

Sub millisecond r/w means you cannot depend on disk, and you have to be careful about network latency. Just forget about standard SQL based solutions, main-memory or not. In a ms, you cannot get more than 100 KByte over a GBit network. Ask a telecom engineer, they are used to solving these kind of problems.

How much does it matter if you lose a record or two? Where are they coming from? Do you have a transactional relationship with the source?
If you have serious reliability requirements then I think you may need to be prepared to pay some DB Overhead.
Perhaps you could separate the persistence problem from the in-memory problem. Use a pup-sub approach. One subscriber look after in-memory, the other persisting the data ready for subsequent startup?
Distributed cahcing products such as WebSphere eXtreme Scale (no Java EE dependency) might be relevent if you can buy rather than build.

Chronicle Map is a ConcurrentMap implementation which stores keys and values off-heap, in a memory-mapped file. So you have persistence on JVM restart.
ChronicleMap.get() is consistently faster than 1 us, sometimes as fast as 100 ns / operation. It's the fastest solution in the class.

Will all the records and keys you need fit in memory at once? If so, you could just use a HashMap<String,String>, since it's Serializable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.