What are the differences between Dataframe, Dataset, and RDD in Apache Spark? - java

In Apache Spark, what are the differences between those API? Why and when should we choose one over the others?

First, lets define what spark does
Simply put what it does is to execute operations on distributed data. Thus, the operations also need to be distributed. Some operations are simple, such as filter out all items that doesn't respect some rule. Others are more complex, such as groupBy that needs to move data around, and join that needs to associate items from 2 or more datasets.
Another important fact is that input and output are stored in different formats, spark has connectors to read and write those. But that means to serialize and deserialize them. While being transparent, serialization is often the most expensive operation.
Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once again, it is done transparently but can be costly. Interesting fact: estimating the data size can take time
The APIs
RDD
It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java objects distributed over a cluster. All operations executed on it are jvm methods (passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the jvm classes and methods.
Dataframe
It came after and is semantically very different from RDD. The data are considered as tables and operations such as sql operations can be applied on it. It is not typed at all, so error can arise at any time during execution. However, there are I think 2 pros: (1) many people are used to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole line to process one of its column, if the data format provide suitable column access. And many do, such as the parquet file format that is the most commonly used.
Dataset
It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which we associate an "encoder" related to a jvm class. So spark can check that the data schema is correct before executing the code. Note however that, we can read sometime that dataset are strongly type, but it is not: it brings some strongly type safety where you cannot compile code that use a Dataset with a type that is not what has been declared. But it is very easy to make code that compile but still fail at runtime. This is because many dataset operations loose the type (pretty much everything apart from filter). Still it is a huge improvements because even when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at start) instead of during data processing.
Note: Dataframe are now simply untyped Dataset (Dataset<Row>)
Note2: Dataset provide the main API of RDD, such as map and flatMap. From what I know, it is a short cut to convert to rdd, then apply map/flatMap, then convert to dataset. It's practical, but also hide the conversion making it difficult to realize that possibly costly ser/deser-ialization happened.
Pros and cons
Dataset:
pros: has optimized operations over column oriented storages
pros: also many operations doesn't need deserialization
pros: provide table/sql semantic if you like it (I don't ;)
pros: dataset operations comes with an optimization engine "catalyst" that improves the performance of your code. I'm not sure however if it is really that great. If you know what you code, i.e. what is done to the data, your code should be optimized by itself.
cons: most operation loose typing
cons: dataset operations can become too complicated for complex algorithm that doesn't suit it. The 2 main limits I know are managing invalid data and complex math algorithm.
Dataframe:
pros: required between dataset operations that lose type
cons: just use Dataset it has all the advantages and more
RDD:
pros: (really) strongly typed
pros: scala/java semantic. You can design your code pretty much how you would for a single-jvm app that process in-memory collections. Well, with functional semantic :)
cons: full jvm deserialization is required to process data, at any step mentioned before: after reading input, and between all processing steps that requires data to be moved between worker, or stored locally to manage memory bound.
Conclusion
Just use Dataset by default:
read input with an Encoder, if the data format allows it it will validate input schema at start
use dataset operations and when you loose type, go back to a typed dataset. Typically, use typed dataset as input and output of all methods.
There are cases where what you want to code would be too complex to express using dataset operations. Most app doesn't, but it often happen in my work where I implements complex mathematical models. In this case:
start with dataset
filter and shuffle (groupBy, join) data as much as possible with dataset ops
once you have only the required data, and need not move them, convert to rdd and apply you complex computing.

In short:
RDDs are coming from the early versions of Spark. Still used "under the hood" by the Dataframes.
Dataframes were introduced in late Spark 1.x and really matured in Spark 2.x. They are the preferred storage now. They are implemented as a Dataset in Java.
Datasets are the generic implementation, as you could have a Dataset for example.
I use dataframes and highly recommend them: Spark's optimizer, Catalyst, understands better datasets (and as such, dataframes) and the Row is a better storage container than a pure JVM object. You will find a lot of blog posts (including Databricks') on the internals.

Related

How to trait 1,64Million of records on MongoDB?

I have a hug amount of data on MongoDB I want to compare them but the process needs some logical side to achieve the goal, I know that using Java for the case seems to be complicated for Memory usages?
USA CASE :
I want to compare two documents "Tables" in mongodb that has a large amount of data >1.46M lines, we need to use an approach or some logical part which allows us to do it efficiently without losing memory capacities or arrived to the worth cases ?
I've tried to put the data into java collections and in order to get the descripancies between them, I implement the comparison logic code side using threads but i got a lot of memory leaks and the process takes a lot of time.

Spark Encoders: when to use beans()

I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoders with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.
Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoders? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option?
For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.
Whenever you can. Unlike generic binary Encoders, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T] leverages the structure of an object, to provide class specific storage layout.
This difference becomes obvious when you compare the schemas created using Encoders.bean and Encoders.kryo.
Why does it matter?
You get efficient field access using SQL API without any need for deserialization and full support for all Dataset transformations.
With transparent field serialization you can fully utilize columnar storage, including built-in compression.
So when to use kryo Encoder? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).

Non associative aggregations in apache spark streaming

I am trying to build a utility layer in Java over apache spark streaming where users are able to aggregate data over a period of time (using window functions in spark) but it seems all the available options require associative functions (taking two arguments). However for some fairly common use cases like averaging temperature sensor values over an hour, etc. dont seem possible with the spark API.
Is there any alternative for achieving this kind of functionality? I am thinking of implementing repetitive interactive queries to achieve that, but it will be too slow.
Statistical aggregates (average, variance) are actually associative and can be computed online. See here for a good numerical way of doing this.
In terms of number of arguments, remember the type of what you put in arguments is of your choice. You can nest several arguments in one of these using a Tuple.
Finally, you can also use stateful information with something like updateStateByKey.

Bitcask ok for simple and high performant file store?

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.

Java: Serializing a huge amount of data to a single file

I need to serialize a huge amount of data (around 2gigs) of small objects into a single file in order to be processed later by another Java process. Performance is kind of important. Can anyone suggest a good method to achieve this?
Have you taken a look at google's protocol buffers? Sounds like a use case for it.
I don't know why Java Serialization got voted down, it's a perfectly viable mechanism.
It's not clear from the original post, but is all 2G of data in the heap at the same time? Or are you dumping something else?
Out of the box, Serialization isn't the "perfect" solution, but if you implement Externalizable on your objects, Serialization can work just fine. Serializations big expense is figuring out what to write and how to write it. By implementing Externalizable, you take those decisions out of its hands, thus gaining quite a boost in performance, and a space savings.
While I/O is a primary cost of writing large amounts of data, the incidental costs of converting the data can also be very expensive. For example, you don't want to convert all of your numbers to text and then back again, better to store them in a more native format if possible. ObjectStream has methods to read/write the native types in Java.
If all of your data is designed to be loaded in to a single structure, you could simply do ObjectOutputStream.writeObject(yourBigDatastructure), after you've implemented Externalizable.
However, you could also iterate over your structure and call writeObject on the individual objects.
Either way, you're going to need some "objectToFile" routine, perhaps several. And that's effectively what Externalizable provides, as well as a framework to walk your structure.
The other issue, of course, is versioning, etc. But since you implement all of the serialization routines yourself, you have full control over that as well.
A simplest approach coming immediately to my mind is using memory-mapped buffer of NIO (java.nio.MappedByteBuffer). Use the single buffer (approximately) corresponding to the size of one object and flush/append them to the output file when necessary. Memory-mapped buffers are very effecient.
Have you tried java serialization? You would write them out using an ObjectOutputStream and read 'em back in using an ObjectInputStream. Of course the classes would have to be Serializable. It would be the low effort solution and, because the objects are stored in binary, it would be compact and fast.
I developped JOAFIP as database alternative.
Apache Avro might be also usefull. It's designed to be language independent and has bindings for the popular languages.
Check it out.
protocol buffers : makes sense. here's an excerpt from their wiki : http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
Getting More Speed
By default, the protocol buffer compiler tries to generate smaller files by using reflection to implement most functionality (e.g. parsing and serialization). However, the compiler can also generate code optimized explicitly for your message types, often providing an order of magnitude performance boost, but also doubling the size of the code. If profiling shows that your application is spending a lot of time in the protocol buffer library, you should try changing the optimization mode. Simply add the following line to your .proto file:
option optimize_for = SPEED;
Re-run the protocol compiler, and it will generate extremely fast parsing, serialization, and other code.
You should probably consider a database solution--all databases do is optimize their information, and if you use Hibernate, you keep your object model as is and don't really even think about your DB (I believe that's why it's called hibernate, just store your data off, then bring it back)
If performance is very importing then you need write it self. You should use a compact binary format. Because with 2 GB the disk I/O operation are very important. If you use any human readable format like XML or other scripts you resize the data with a factor of 2 or more.
Depending on the data it can be speed up if you compress the data on the fly with a low compression rate.
A total no go is Java serialization because on reading Java check on every object if it is a reference to an existing object.

Categories