Hadoop: Processing large serialized objects

Hadoop: Processing large serialized objects - java

I am working on development of an application to process (and merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file on different hosts. But as deserialization will require the all the blocks to be present on single host, its gonna hit the performance drastically. How can I deal this situation where different blocks have to cant be individually processed, unlike text files ?

There's two issues: one is that each file must (in the initial stage) be processed in whole: the mapper that sees the first byte must handle all the rest of that file. The other problem is locality: for best efficiency, you'd like all the blocks for each such file to reside on the same host.
Processing files in whole:
One simple trick is to have the first-stage mapper process a list of filenames, not their contents. If you want 50 map jobs to run, make 50 files each with that fraction of the filenames. This is easy and works with java or streaming hadoop.
Alternatively, use a non-splittable input format such as NonSplitableTextInputFormat.
For more details, see "How do I process files, one per map?" and "How do I get each of my maps to work on one complete input-file?" on the hadoop wiki.
Locality:
This leaves a problem, however, that the blocks you are reading from are disributed all across the HDFS: normally a performance gain, here a real problem. I don't believe there's any way to chain certain blocks to travel together in the HDFS.
Is it possible to place the files in each node's local storage? This is actually the most performant and easiest way to solve this: have each machine start jobs to process all the files in e.g. /data/1/**/*.data (being as clever as you care to be about efficiently using local partitions and number of CPU cores).
If the files originate from a SAN or from say s3 anyway, try just pulling from there directly: it's built to handle the swarm.
A note on using the first trick: If some of the files are much larger than others, put them alone in the earliest-named listing, to avoid issues with speculative execution. You might turn off speculative execution for such jobs anyway if the tasks are dependable and you don't want some batches processed multiple times.

It sounds like your input file is one big serialized object. Is that the case? Could you make each item its own serialized value with a simple key?
For example, if you were wanting to use Hadoop to parallelize the resizing of images you could serialize each image individually and have a simple index key. Your input file would be a text file with the key values pairs being index key and then serialized blob would be the value.
I use this method when doing simulations in Hadoop. My serialized blob is all the data needed for the simulation and the key is simply an integer representing a simulation number. This allows me to use Hadoop (in particular Amazon Elastic Map Reduce) like a grid engine.

I think the basic (unhelpful) answer is that you can't really do this, since this runs directly counter to the MapReduce paradigm. Units of input and output for mappers and reducers are records, which are relatively small. Hadoop operates in terms of these, not file blocks on disk.
Are you sure your process needs everything on one host? Anything that I'd describe as a merge can be implemented pretty cleanly as a MapReduce where there is no such requirement.
If you mean that you want to ensure certain keys (and their values) end up on the same reducer, you can use a Partitioner to define how keys are mapped onto reducer instances. Depending on your situation, this may be what you really are after.
I'll also say it kind of sounds like you are trying to operate on HDFS files, rather than write a Hadoop MapReduce. So maybe your question is really about how to hold open several SequenceFiles on HDFS, read their records and merge, manually. This isn't a Hadoop question then, but, still doesn't need blocks to be on one host.

Related

Collection Framework ,Big data and best Practice

I have following class
public class BdFileContent {
String filecontent;
}
E.g file1.txt has following content:
This is test
"This" represents single instance of file content object.
"is" represents another file content object
"test" represents another file content object
Suppose following is folder structure:
lineage
|
+-folder1
| |
| +-file1.txt
| +-file2.txt
|
+-folder2
| |
| +-file3.txt
| +-file4.txt
+-...
|
+-...+-fileN.txt
.
.
.
.
N
N>1000 files
N value will be very huge value
BdFileContent class represents each string in file in directory.
I have to do lots of data manipulation and need to create a work on complex data structure .I have to perform computation on both in memory and in disk .
ArrayList<ArrayList<ArrayList<BdFileContent>>> filecontentallFolderFileAsSingleStringToken = new ArrayList<>();
For example Above object represents all file contents of directory. I have to add this object for tree node in BdTree .
I am writing my own tree and adding
filecontentallFolderFileAsSingleStringToken as node .
In What extend collection framework data structure is appropriate for huge data.
At this point i want to get some insight of how big company uses data structure to manipulate huge set of data generated every day.
Are they using collection framework?
Do they use there own custom data structure ?
Are they using multi node data structure with each node running on separate JVM?
Till now collection object runs on single jvm and can not dynamically use another jvm when there is signal for overflow flow in memory and lack resource for processing
Normally what other developer approach for data structure for big data ?
How other developer are handling it ?
I want to get some hints for real uses cases and experience.

When you're dealing with big data you must change approach. First of all, you have to assume that all your data will not fit into the memory of a single machine, so you need to split the data among several machines, let them compute what you need to, and then re-assemble all this together. So, you can use Collection, but only for a part of the whole job.
I can suggest you to take a look at:
Hadoop: the first framework for dealing with big data
Spark: another framework for big data, often faster than Hadoop
Akka: a framework for writing distributed applications
While Hadoop and Spark are the de-facto standard for big data world, Akka is just a framework that is used in a lot of contexts and not only with big data: that means that you'll have to write a lot of the stuff that Hadoop and Spark already have; I put it in the list just for sake of completeness.
You can read about the WordCount example, which is the "HelloWorld" equivalent in big data world to have an idea of how the MapReduce programming paradigm works for Hadoop, or you can take a look at the quick start guide for obtaining the equivalent transformation with Spark.

When it comes to Big Data, the lead technologies available is Hadoop Distributed File System aka HDFS (a variant of Google DFS), Hadoop, Spark/MapReduce and Hive (originally developed by Facebook). Now, as you are asking mainly about the data structure being used in Big Data processing, you need to understand the role of these system.
Hadoop Distributed File System - HDFS
In very simple words, this is a file storage system, which uses a cluster of cheap machine to store files which is 'highly available' and 'fault tolerant' in nature. So, this becomes the data input source in Big Data processing. Now this can be a structured data (say comma delimited records) or unstructured data (Content of all the books in the world).
How to deal with structured data
One prominent technology being used for structured data is Hive. This gives a Relational-database like view of the data. Note that it is not a relational database itself. The source to this view is again the files stored on Disk (or HDFS, which Big companies uses). Now here when you process the data hive, the logic is applied on the files (internally via one/more Map Reduce program) and result is returned. Now, if you wish to store this result, it is going to land on disk (or hdfs) again in the form of structured file.
Thus a sequence of Hive queries, help you to refine a big data set into the desired data set via step-wise transformation. Think it like extracting data from traditional DB system using joins and then store data into temp table.
How to deal with unstructured data
When it comes to deal with unstructured data, the Map-Reduce approach is one of the popular one, along with Apache Pig (which is ideal for semi-structured data). The Map-Reduce paradigm mainly uses disk data(or hdfs) to process them on multiple machine and output the result on the disk.
If you read the popular book on Hadoop - Orielly - Hadoop: The Definitive Guide; you will find that the Map Reduce program fundamentally works of Key- Value type of data structure (like Map); but it never keep all the values in the memory at one point of time. It is more like
Get the Key-Value data
Do some processing
Spit the data to the disk via context
Do this for all the key-values thus processing one logical unit at a time from Big Data source.
At the end, the output of one Map-Reduce program is written to disk and now you have new set of data for next level of processing (again might be another Map Reduce program).
Now to answer, your specific queries:
At this point i want to get some insight of how big company uses data structure to manipulate huge set of data generated every day.
They use HDFS (or similar Distributed File System) to store Big Data. If the data is structured, Hive is a popular tool to process them. Because Hive query to transform the data is more closer to SQL (Syntax-wise); the learning curve is really low.
Are they using collection framework?
While processing the Big data, the whole content is never kept into memory (not even on cluster nodes). Its more like a chunk of data is processed at a time. This chunk of data might be represented as a collection (in-memory) while it is being processed, but at the end, the whole set of output data is dumped back on the disk in structured form.
Do they use there own custom data structure ?
Since not all data is stored in memory, so no specific point of custom data structure comes. However, the data movement within Map-Reduce or across network happens in the form of data structure, so yes - there is a data structure; but that is not so important consideration from an application developer perspective. Again the logic inside the Map-Reduce or other Big-Data processing is going to be written by developer, you can always use any API (or custom collection) to process the data; but the data has to be written back to the disk in the data structure expected by the framework.
Are they using multi node data structure with each node running on separate JVM?
The big data in files are processed across multiple machine in blocks. e.g. a 10 TB data is processed in the block of 64 MB across cluster by multiple node (separate JVM, and sometime Multiple JVM on one machine as well). But again its not a shared data structured across JVM; rather it is distributed data input (in the form of file block) across JVMs.
Till now collection object runs on single jvm and can not dynamically use another jvm when there is signal for overflow flow in memory and lack resource for processing
You are right.
Normally what other developer approach for data structure for big data ?
For the data input/output perspective, it is always a file on HDFS. From the processing of the data (application logic); you can use any normal Java API which can be run in the JVM. Now, since JVMs in the cluster run in the Big data environment, they also have resource constraints. So, you must device your application logic to work within that resource limit (like we do for a normal java program)
How other developer are handling it ?
I would suggest to Read the definitive guide (mentioned in above section) to understand the building block of Big-Data processing. This book is awesome and touch many aspects/problems and their solution approach in Big-Data.
I want to get some hints for real uses cases and experience.
There are numerous use cases of Big data processing specially with Financial institutions. Google Analytic is one of the prominent use case, which catches the user's behavior on a website, in order to determine the best position on a webpage to place the google ad block. I am working with a leading financial institution, which loads user's transaction data into Hive in order to do a fraud detection based on user's behavior.

These are the answers to your queries ( These queries are addressed by keeping Hadoop in mind)
Are they using collection framework?
No. HDFS file system is used in case of Hadoop.
Do they use there own custom data structure ?
You have to understand HDFS - Hadoop Distributed File System. Refer this book fro Orielly - Hadoop: The Definitive Guide, 3rd Edition for purchase. If you want to know the fundamentals without buying the book, try this link- HDFC Basics Or Apache Hadoop.
HDFC file system is reliable & fault tolerant system.
Are they using multi node data structure with each node running on separate JVM?
Yes. Refer to Hadoop 2.0 YARN archictecture
Normally what other developer approach for data structure for big data ?
There are many. Refer to :Hadoop Alternatives
How other developer are handling it ?
Through the framework provided respective technologies. Map Reduce framework in case of Hadoop
I want to get some hints for real uses cases and experience
BigData technologies are useful where RDBMS fails - Data analytics, Data Warehouse (a system used for reporting and data analysis). Some of the use cases - Recommendation engines (LinkedIn), Ad targeting (youtube), processing large volumes data - find hottest/coldest day of a place over 100+ years of weather details, share price analysis, market trending etc.
Refer to many real life use cases for Big Data Use Cases

Processing a large (GB) file, quickly and multiple times (Java)

What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.

The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!

Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)

How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.

Java: Optimal approach for storing and reading 1 billion data records

I'm looking for the fastest approach, in Java, to store ~1 billion records of ~250 bytes each (storage will happen only once) and then being able to read it multiple times in a non-sequential order.
The source records are being generated into simple java value objects and I would like to read them back in the same format.
For now my best guess is to store these objects, using a fast serialization library such as Kryo, in a flat file and then to use Java FileChannel to make direct random access to read the records at specific positions in the file (when storing the data, I will keep in a hashmap (also to be saved on disk) with the position in the file of each record so that I know where to read it).
Also, there is no need to optimize disk space. My key concern is to optimize read performance, while having a reasonable write performance (that, again, will happen only once).
Last precision: while the records are all of the same type (same Java value object), their size (in bytes) is variable (e.g. it contains strings).
Is there any better approach than what I mentioned above? Any hint or suggestion would be greatly appreciated !
Many thanks,
Thomas

You can use Apache Lucene, it will take care of everything you have mentioned above :)
It is super fast, you can search results more quickly then ever.
Apache Lucene persist objects in files and indexes them. We have used it in couple of apps and it is super fast.

You could just use an embedded Derby database. It's written in Java and you can actually run it up embedded within your process so there is no overhead of inter-process or networked communication. It will store the data and allow you to query it/etc handling all the complexity and indexing for you.

How to reduce the number of file writes when there are multiple threads?

here's the situation.
In a Java Web App i was assigned to mantain, i've been asked to improve the general response time for the stress tests during QA. This web app doesn't use a database, since it was supposed to be light and simple. (And i can't change that decision)
To persist configuration, i've found that everytime you make a change to it, a general object containing lists of config objects is serialized to a file.
Using Jmeter i've found that in the given test case, there are 2 requests taking up the most of the time. Both these requests add or change some configuration objects. Since the access to the file must be sinchronized, when many users are changing config, the file must be fully written several times in a few seconds, and requests are waiting for the file writing to happen.
I have thought that all these serializations are not necessary at all, since we are rewriting the most of the objects again and again, the changes in every request are to one single object, but the file is written as a whole every time.
So, is there a way to reduce the number of real file writes but still guarantee that all changes are eventually serialized?
Any suggestions appreciated

One option is to do changes in memory and keep one thread on the background, running at given intervals and flushing the changes to the disk. Keep in mind, that in the case of crash you'll lost data that wasn't flushed.
The background thread could be scheduled with a ScheduledExecutorService.
IMO, it would be better idea to use a DB. Can't you use an embedded DB like Java DB, H2 or HSQLDB? These databases support concurrent access and can also guarantee the consistency of data in case of crash.

If you absolutely cannot use a database, the obvious solution is to break your single file into multiple files, one file for each of config objects. It would speedup serialization and output process as well as reduce lock contention (requests that change different config objects may write their files simultaneously, though it may become IO-bound).

One way is to to do what Lucene does and not actually overwrite the old file at all, but to write a new file that only contains the "updates". This relies on your updates being associative but that is usually the case anyway.
The idea is that if your old file contains "8" and you have 3 updates you write "3" to the new file, and the new state is "11", next you write "-2" and you now have "9". Periodically you can aggregate the old and the updates. Any physical file you write is never updated, but may be deleted once it is no longer used.
To make this idea a bit more relevant consider if the numbers above are records of some kind. "3" could translate to "Add three new records" and "-2" to "Delete these two records".
Lucene is an example of a project that uses this style of additive update strategy very successfully.

Bitcask ok for simple and high performant file store?

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?

I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.

Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.