Spark Encoders: when to use beans()

Spark Encoders: when to use beans() - java

I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoders with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.
Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoders? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option?
For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.

Whenever you can. Unlike generic binary Encoders, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T] leverages the structure of an object, to provide class specific storage layout.
This difference becomes obvious when you compare the schemas created using Encoders.bean and Encoders.kryo.
Why does it matter?
You get efficient field access using SQL API without any need for deserialization and full support for all Dataset transformations.
With transparent field serialization you can fully utilize columnar storage, including built-in compression.
So when to use kryo Encoder? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).

Related

What are the differences between Dataframe, Dataset, and RDD in Apache Spark?

In Apache Spark, what are the differences between those API? Why and when should we choose one over the others?

First, lets define what spark does
Simply put what it does is to execute operations on distributed data. Thus, the operations also need to be distributed. Some operations are simple, such as filter out all items that doesn't respect some rule. Others are more complex, such as groupBy that needs to move data around, and join that needs to associate items from 2 or more datasets.
Another important fact is that input and output are stored in different formats, spark has connectors to read and write those. But that means to serialize and deserialize them. While being transparent, serialization is often the most expensive operation.
Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once again, it is done transparently but can be costly. Interesting fact: estimating the data size can take time
The APIs
RDD
It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java objects distributed over a cluster. All operations executed on it are jvm methods (passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the jvm classes and methods.
Dataframe
It came after and is semantically very different from RDD. The data are considered as tables and operations such as sql operations can be applied on it. It is not typed at all, so error can arise at any time during execution. However, there are I think 2 pros: (1) many people are used to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole line to process one of its column, if the data format provide suitable column access. And many do, such as the parquet file format that is the most commonly used.
Dataset
It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which we associate an "encoder" related to a jvm class. So spark can check that the data schema is correct before executing the code. Note however that, we can read sometime that dataset are strongly type, but it is not: it brings some strongly type safety where you cannot compile code that use a Dataset with a type that is not what has been declared. But it is very easy to make code that compile but still fail at runtime. This is because many dataset operations loose the type (pretty much everything apart from filter). Still it is a huge improvements because even when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at start) instead of during data processing.
Note: Dataframe are now simply untyped Dataset (Dataset<Row>)
Note2: Dataset provide the main API of RDD, such as map and flatMap. From what I know, it is a short cut to convert to rdd, then apply map/flatMap, then convert to dataset. It's practical, but also hide the conversion making it difficult to realize that possibly costly ser/deser-ialization happened.
Pros and cons
Dataset:
pros: has optimized operations over column oriented storages
pros: also many operations doesn't need deserialization
pros: provide table/sql semantic if you like it (I don't ;)
pros: dataset operations comes with an optimization engine "catalyst" that improves the performance of your code. I'm not sure however if it is really that great. If you know what you code, i.e. what is done to the data, your code should be optimized by itself.
cons: most operation loose typing
cons: dataset operations can become too complicated for complex algorithm that doesn't suit it. The 2 main limits I know are managing invalid data and complex math algorithm.
Dataframe:
pros: required between dataset operations that lose type
cons: just use Dataset it has all the advantages and more
RDD:
pros: (really) strongly typed
pros: scala/java semantic. You can design your code pretty much how you would for a single-jvm app that process in-memory collections. Well, with functional semantic :)
cons: full jvm deserialization is required to process data, at any step mentioned before: after reading input, and between all processing steps that requires data to be moved between worker, or stored locally to manage memory bound.
Conclusion
Just use Dataset by default:
read input with an Encoder, if the data format allows it it will validate input schema at start
use dataset operations and when you loose type, go back to a typed dataset. Typically, use typed dataset as input and output of all methods.
There are cases where what you want to code would be too complex to express using dataset operations. Most app doesn't, but it often happen in my work where I implements complex mathematical models. In this case:
start with dataset
filter and shuffle (groupBy, join) data as much as possible with dataset ops
once you have only the required data, and need not move them, convert to rdd and apply you complex computing.

In short:
RDDs are coming from the early versions of Spark. Still used "under the hood" by the Dataframes.
Dataframes were introduced in late Spark 1.x and really matured in Spark 2.x. They are the preferred storage now. They are implemented as a Dataset in Java.
Datasets are the generic implementation, as you could have a Dataset for example.
I use dataframes and highly recommend them: Spark's optimizer, Catalyst, understands better datasets (and as such, dataframes) and the Row is a better storage container than a pure JVM object. You will find a lot of blog posts (including Databricks') on the internals.

File backed Java map

Is there a simple way of having a file backed Map?
The contents of the map are updated regularly, with some being deleted, as well as some being added. To keep the data that is in the map safe, persistence is needed. I understand a database would be ideal, but sadly due to constraints a database can't be used.
I have tried:
Writing the whole contents of the map to file each time it gets updated. This worked, but obviously has the drawback that the whole file is rewritten each time, the contents of the map are expected to be anywhere from a couple of entries to ~2000. There are also some concurrency issues (i.e. writing out of order results in loss of data).
Using a RandomAccessFile and keeping a pointer to each file's start byte so that each entry can be looked up using seek(). Again this had a similar issue as before, changing an entry would involve updating all of the references after it.
Ideally, the solution would involve some sort of caching, so that only the most recently accessed entries are kept in memory.
Is there such a thing? Or is it available via a third party jar? Someone suggested Oracle Coherence, but I can't seem to find much on how to implement that, and it seems a bit like using a sledgehammer to crack a nut.

You could look into MapDB which was created with this purpose in mind.
MapDB provides concurrent Maps, Sets and Queues backed by disk storage
or off-heap-memory. It is a fast and easy to use embedded Java
database engine.

Yes, Oracle Coherence can do all of that, but it may be overkill if that's all you're doing.
One way to do this is to "overflow" from RAM to disk:
BinaryStore diskstore = new BerkeleyDBBinaryStore("mydb", ...);
SimpleSerializationMap mapDisk = SimpleSerializationMap(diskstore);
LocalCache mapRAM = new LocalCache(100 * 1024 * 1024); // 100MB in RAM
OverflowMap cache = new OverflowMap(mapRAM, mapDisk);
Starting in version 3.7, you can also transparently overflow from RAM journal to flash journal. While you can configure it in code (as per above), it's generally just a line or two of config and then you ask for the cache to be configured on your behalf, e.g.
// simplest example; you'd probably use a builder pattern or a configurable cache factory
NamedCache cache = CacheFactory.getCache("mycache");
For more information, see the doc available from http://coherence.oracle.com/
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

jdbm2 looks promising, never used it but it seems to be a candidate to meet your requirements:
JDBM2 provides HashMap and TreeMap which are backed by disk storage. It is very easy and fast way to persist your data. JDBM2 also have minimal hardware requirements and is highly embeddable (jar have only 145 KB).
You'll find many more solutions if you look for key/value stores.

Key-Value Database with Java client

I basically want to store a hashtable on disk so I can query it later. My program is written in Java.
The hashtable maps from String to List.
There are a lot of key-value stores out there, but after doing a lot of research/reading, its not clear which one is the best for my purposes. Here are some things that are important to me.
Simple key-value store which allows you to retrieve a value with a single key.
Good Java client that is documented well.
Dataset is small and there is no need for advanced features. Again, I want it to be simple.
I have looked into Redis and MongoDB. Both look promising but not ideal for my purposes.
Any info would be appreciated.

If your dataset is small and you want it to be SIMPLE. why don't you serialize your hashmap to a file or rdbms and load it in your application?
How do you wan't to "query" your hashmap? key approximation? value 'likeness'? I don't know, seems overkill to me to mantain a keyvalue storage just for the sake of.

What you are looking for is a library that supports object prevalence. These libraries are designed to be simple and fast providing collection like API. Below are few such libraries that allow you to work with collections but behind the scenes use a disk storage.
space4j
Advagato
Prevayler

Before providing any sort of answers, I'd start by asking myself why do I need to store this hashtable on disk as according to your description the data set is small and so I assume it can fit into memory. If it is just to be able to reuse this structure after restarting your application, then you can probably use any sort of format to persist it.
Second, you don't provide any reasons for Redis or MongoDB not being ideal. Based on your (short) 3 requirements, I would have said Redis is probably your best bet:
good Java clients
not only able to store lists, but also supports operations on the list values (so data is not opaque)
The only reason I could suppose for eliminating Redis is that you are looking for strict ACID characteristics. If that's what you are looking for than you could probably take a look at BerkleyDB JE. It has been around for a while and the documentation is good.

Check out JDBM2 - http://code.google.com/p/jdbm2/
I worked on the JDBM 1 code base, and have been impressed with what I've seen in jdbm2

Chronicle Map should be a perfect fit, it's an embeddable key-value store written in pure Java, so it acts as the best possible "client" (though actually there are no "client" or "server", you just open your database and have full read/update in-process access to it).
Chronicle Map resides a single file. This file could be moved around filesystem, and even sent to another machine with different OS and/or architecture and still be an openable Chronicle Map database.
To create or open a data store (if the database file is non-existent, it is created, otherwise an existing store is accessed):
ChronicleMap<String, List<Point>> map = ChronicleMap
.of(String.class, (Class<List<Point>>) (Class) List.class)
.averageKey("range")
.averageValue(asList(of(0, 0), of(1, 1)))
.entries(10_000)
.createPersistedTo(myDatabaseFile);
Then you can work with created ChronicleMap object just as with a simple HashMap, not bothering with keys and values serialization.

Are they any decent on-disk implementations of Java's Map?

I'm looking for an on-disk implementation of java.util.Map. Nothing too fancy, just something that I can point at a directory or file and have it store its contents there, in some way it chooses. Does anyone know of such a thing?

You could have a look at the Disk-Backed-map project.
A library that implements a disk backed map in Java
A small library that provide a disk backed map implementation for storing large number of key value pairs. The map implementations (HashMap, HashTable) max out around 3-4Million keys/GB of memory for very simple key/value pairs and in most cases the limit is much lower. DiskBacked map on the other hand can store betweeen 16Million (64bit JVM) to 20Million(32bit JVM) keys/GB, regardless the size of the key/value pairs.

If you are looking for key-object based structures to persist data then NoSQL databases are a very good choice. You'll find that some of them such MongoDB or Redis scale and perform for big datasets and apart from hash based look ups they provide interesting query and transactional features.
In essence these types of systems are a Map implementation. And it shouldn't be too complicated to implement your own adapter that implements java.util.Map to bridge them.

MapDB (mapdb.org) does exactly what you are looking for. Besides disk backed TreeMap and HashMap it gives you other collection types.
It's maps are also thread-safe and have really good performance.
See Features

Chronicle Map is a modern and the fastest solution to this problem. It implements ConcurrentMap interface and persists the data to disk (under the hood, it is done by mapping Chronicle Map's memory to a file).

You could use a simple EHCache implementation? The nice thing about EHCache being that it can be very simple to implement :-)
I take it you've ruled out serialising / deserialising an actual Map instance?

This seems like a relatively new open source solution to the problem, I've used it, and like it so far
https://github.com/jankotek/JDBM4

Java: Serializing a huge amount of data to a single file

I need to serialize a huge amount of data (around 2gigs) of small objects into a single file in order to be processed later by another Java process. Performance is kind of important. Can anyone suggest a good method to achieve this?

Have you taken a look at google's protocol buffers? Sounds like a use case for it.

I don't know why Java Serialization got voted down, it's a perfectly viable mechanism.
It's not clear from the original post, but is all 2G of data in the heap at the same time? Or are you dumping something else?
Out of the box, Serialization isn't the "perfect" solution, but if you implement Externalizable on your objects, Serialization can work just fine. Serializations big expense is figuring out what to write and how to write it. By implementing Externalizable, you take those decisions out of its hands, thus gaining quite a boost in performance, and a space savings.
While I/O is a primary cost of writing large amounts of data, the incidental costs of converting the data can also be very expensive. For example, you don't want to convert all of your numbers to text and then back again, better to store them in a more native format if possible. ObjectStream has methods to read/write the native types in Java.
If all of your data is designed to be loaded in to a single structure, you could simply do ObjectOutputStream.writeObject(yourBigDatastructure), after you've implemented Externalizable.
However, you could also iterate over your structure and call writeObject on the individual objects.
Either way, you're going to need some "objectToFile" routine, perhaps several. And that's effectively what Externalizable provides, as well as a framework to walk your structure.
The other issue, of course, is versioning, etc. But since you implement all of the serialization routines yourself, you have full control over that as well.

A simplest approach coming immediately to my mind is using memory-mapped buffer of NIO (java.nio.MappedByteBuffer). Use the single buffer (approximately) corresponding to the size of one object and flush/append them to the output file when necessary. Memory-mapped buffers are very effecient.

Have you tried java serialization? You would write them out using an ObjectOutputStream and read 'em back in using an ObjectInputStream. Of course the classes would have to be Serializable. It would be the low effort solution and, because the objects are stored in binary, it would be compact and fast.

I developped JOAFIP as database alternative.

Apache Avro might be also usefull. It's designed to be language independent and has bindings for the popular languages.
Check it out.

protocol buffers : makes sense. here's an excerpt from their wiki : http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
Getting More Speed
By default, the protocol buffer compiler tries to generate smaller files by using reflection to implement most functionality (e.g. parsing and serialization). However, the compiler can also generate code optimized explicitly for your message types, often providing an order of magnitude performance boost, but also doubling the size of the code. If profiling shows that your application is spending a lot of time in the protocol buffer library, you should try changing the optimization mode. Simply add the following line to your .proto file:
option optimize_for = SPEED;
Re-run the protocol compiler, and it will generate extremely fast parsing, serialization, and other code.

You should probably consider a database solution--all databases do is optimize their information, and if you use Hibernate, you keep your object model as is and don't really even think about your DB (I believe that's why it's called hibernate, just store your data off, then bring it back)

If performance is very importing then you need write it self. You should use a compact binary format. Because with 2 GB the disk I/O operation are very important. If you use any human readable format like XML or other scripts you resize the data with a factor of 2 or more.
Depending on the data it can be speed up if you compress the data on the fly with a low compression rate.
A total no go is Java serialization because on reading Java check on every object if it is a reference to an existing object.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.