Collection Framework ,Big data and best Practice

Collection Framework ,Big data and best Practice - java

I have following class
public class BdFileContent {
String filecontent;
}
E.g file1.txt has following content:
This is test
"This" represents single instance of file content object.
"is" represents another file content object
"test" represents another file content object
Suppose following is folder structure:
lineage
|
+-folder1
| |
| +-file1.txt
| +-file2.txt
|
+-folder2
| |
| +-file3.txt
| +-file4.txt
+-...
|
+-...+-fileN.txt
.
.
.
.
N
N>1000 files
N value will be very huge value
BdFileContent class represents each string in file in directory.
I have to do lots of data manipulation and need to create a work on complex data structure .I have to perform computation on both in memory and in disk .
ArrayList<ArrayList<ArrayList<BdFileContent>>> filecontentallFolderFileAsSingleStringToken = new ArrayList<>();
For example Above object represents all file contents of directory. I have to add this object for tree node in BdTree .
I am writing my own tree and adding
filecontentallFolderFileAsSingleStringToken as node .
In What extend collection framework data structure is appropriate for huge data.
At this point i want to get some insight of how big company uses data structure to manipulate huge set of data generated every day.
Are they using collection framework?
Do they use there own custom data structure ?
Are they using multi node data structure with each node running on separate JVM?
Till now collection object runs on single jvm and can not dynamically use another jvm when there is signal for overflow flow in memory and lack resource for processing
Normally what other developer approach for data structure for big data ?
How other developer are handling it ?
I want to get some hints for real uses cases and experience.

When you're dealing with big data you must change approach. First of all, you have to assume that all your data will not fit into the memory of a single machine, so you need to split the data among several machines, let them compute what you need to, and then re-assemble all this together. So, you can use Collection, but only for a part of the whole job.
I can suggest you to take a look at:
Hadoop: the first framework for dealing with big data
Spark: another framework for big data, often faster than Hadoop
Akka: a framework for writing distributed applications
While Hadoop and Spark are the de-facto standard for big data world, Akka is just a framework that is used in a lot of contexts and not only with big data: that means that you'll have to write a lot of the stuff that Hadoop and Spark already have; I put it in the list just for sake of completeness.
You can read about the WordCount example, which is the "HelloWorld" equivalent in big data world to have an idea of how the MapReduce programming paradigm works for Hadoop, or you can take a look at the quick start guide for obtaining the equivalent transformation with Spark.

When it comes to Big Data, the lead technologies available is Hadoop Distributed File System aka HDFS (a variant of Google DFS), Hadoop, Spark/MapReduce and Hive (originally developed by Facebook). Now, as you are asking mainly about the data structure being used in Big Data processing, you need to understand the role of these system.
Hadoop Distributed File System - HDFS
In very simple words, this is a file storage system, which uses a cluster of cheap machine to store files which is 'highly available' and 'fault tolerant' in nature. So, this becomes the data input source in Big Data processing. Now this can be a structured data (say comma delimited records) or unstructured data (Content of all the books in the world).
How to deal with structured data
One prominent technology being used for structured data is Hive. This gives a Relational-database like view of the data. Note that it is not a relational database itself. The source to this view is again the files stored on Disk (or HDFS, which Big companies uses). Now here when you process the data hive, the logic is applied on the files (internally via one/more Map Reduce program) and result is returned. Now, if you wish to store this result, it is going to land on disk (or hdfs) again in the form of structured file.
Thus a sequence of Hive queries, help you to refine a big data set into the desired data set via step-wise transformation. Think it like extracting data from traditional DB system using joins and then store data into temp table.
How to deal with unstructured data
When it comes to deal with unstructured data, the Map-Reduce approach is one of the popular one, along with Apache Pig (which is ideal for semi-structured data). The Map-Reduce paradigm mainly uses disk data(or hdfs) to process them on multiple machine and output the result on the disk.
If you read the popular book on Hadoop - Orielly - Hadoop: The Definitive Guide; you will find that the Map Reduce program fundamentally works of Key- Value type of data structure (like Map); but it never keep all the values in the memory at one point of time. It is more like
Get the Key-Value data
Do some processing
Spit the data to the disk via context
Do this for all the key-values thus processing one logical unit at a time from Big Data source.
At the end, the output of one Map-Reduce program is written to disk and now you have new set of data for next level of processing (again might be another Map Reduce program).
Now to answer, your specific queries:
At this point i want to get some insight of how big company uses data structure to manipulate huge set of data generated every day.
They use HDFS (or similar Distributed File System) to store Big Data. If the data is structured, Hive is a popular tool to process them. Because Hive query to transform the data is more closer to SQL (Syntax-wise); the learning curve is really low.
Are they using collection framework?
While processing the Big data, the whole content is never kept into memory (not even on cluster nodes). Its more like a chunk of data is processed at a time. This chunk of data might be represented as a collection (in-memory) while it is being processed, but at the end, the whole set of output data is dumped back on the disk in structured form.
Do they use there own custom data structure ?
Since not all data is stored in memory, so no specific point of custom data structure comes. However, the data movement within Map-Reduce or across network happens in the form of data structure, so yes - there is a data structure; but that is not so important consideration from an application developer perspective. Again the logic inside the Map-Reduce or other Big-Data processing is going to be written by developer, you can always use any API (or custom collection) to process the data; but the data has to be written back to the disk in the data structure expected by the framework.
Are they using multi node data structure with each node running on separate JVM?
The big data in files are processed across multiple machine in blocks. e.g. a 10 TB data is processed in the block of 64 MB across cluster by multiple node (separate JVM, and sometime Multiple JVM on one machine as well). But again its not a shared data structured across JVM; rather it is distributed data input (in the form of file block) across JVMs.
Till now collection object runs on single jvm and can not dynamically use another jvm when there is signal for overflow flow in memory and lack resource for processing
You are right.
Normally what other developer approach for data structure for big data ?
For the data input/output perspective, it is always a file on HDFS. From the processing of the data (application logic); you can use any normal Java API which can be run in the JVM. Now, since JVMs in the cluster run in the Big data environment, they also have resource constraints. So, you must device your application logic to work within that resource limit (like we do for a normal java program)
How other developer are handling it ?
I would suggest to Read the definitive guide (mentioned in above section) to understand the building block of Big-Data processing. This book is awesome and touch many aspects/problems and their solution approach in Big-Data.
I want to get some hints for real uses cases and experience.
There are numerous use cases of Big data processing specially with Financial institutions. Google Analytic is one of the prominent use case, which catches the user's behavior on a website, in order to determine the best position on a webpage to place the google ad block. I am working with a leading financial institution, which loads user's transaction data into Hive in order to do a fraud detection based on user's behavior.

These are the answers to your queries ( These queries are addressed by keeping Hadoop in mind)
Are they using collection framework?
No. HDFS file system is used in case of Hadoop.
Do they use there own custom data structure ?
You have to understand HDFS - Hadoop Distributed File System. Refer this book fro Orielly - Hadoop: The Definitive Guide, 3rd Edition for purchase. If you want to know the fundamentals without buying the book, try this link- HDFC Basics Or Apache Hadoop.
HDFC file system is reliable & fault tolerant system.
Are they using multi node data structure with each node running on separate JVM?
Yes. Refer to Hadoop 2.0 YARN archictecture
Normally what other developer approach for data structure for big data ?
There are many. Refer to :Hadoop Alternatives
How other developer are handling it ?
Through the framework provided respective technologies. Map Reduce framework in case of Hadoop
I want to get some hints for real uses cases and experience
BigData technologies are useful where RDBMS fails - Data analytics, Data Warehouse (a system used for reporting and data analysis). Some of the use cases - Recommendation engines (LinkedIn), Ad targeting (youtube), processing large volumes data - find hottest/coldest day of a place over 100+ years of weather details, share price analysis, market trending etc.
Refer to many real life use cases for Big Data Use Cases

Related

How to capture formulas and support formula evaluation in java web application

We have a requirement to incorporate an excel based tool in java web application. This excel tool has set of master data and couple of result outputs using formula calculations on master data.
Master data can be captured in database with relational tables. We are looking for the best way to provide capability to capture, validate and evaluate. formulas.
So far looked at using scripting engines nashorn and provide formula support using eval. We would like to know how people are doing in other places.

I've searched and found two possible libraries that could be useful for you please have a look.
http://mathparser.org/
http://mathparser.org/mxparser-hello-world/mxparser-hello-world-java/
https://lallafa.objecthunter.net/exp4j/
https://lallafa.objecthunter.net/exp4j/#Evaluating_an_expression_asynchronously

Depends on how big your data is and what your required SLA is. Also on what kind of formulas/other functions that you want to support.
For example, consider a function like sum or max. Now, the master data is in some relation table containing 10k rows. You could pull in all this data inside a java app and do a sum (or run any function). However, imagine if the table contained 500K rows. This would take some time to stream all 500K rows to Java app but consumes lot of cpu and network bandwidth (database resources, local cpu resources). A better optimized scenario in that case would be index that column in the database and let database do all the hard work for you.
Personally, I don't like using eval. I would rather parse the user input to determine what actions to take.
I am assuming that data is not big to use big data tools.

Storing Large Amounts of Dictionary-Like Data Within an Application in Java

I fear I may not be truly understanding the utility of database software like MySQL, so perhaps this is an easy question to answer.
I'm writing a program that stores and accesses a bestiary for use in the program. It is a stand-alone application, meaning that it will not connect to the internet or a database (which I am under the impression requires a connection to a server). Currently, I have an enormous .txt file that it parses via a simple pattern (Habitat is on every tenth line, starting with the seventh; name is on every tenth line, starting with the first; etc.) This is prone to parsing errors (problems with reading data that is unrecognizable with the specified encoding, as a lot of the data is copy/pasted by lazy data-entry-ists) and I just feel that parsing a giant .txt file every time I want data is horribly inefficient. Plus, I've never seen a deployed program that had a .txt laying around called "All of our important data.txt".
Are databases the answer? Can they be used simply in basic applications like this one? Writing a class for each animal seems silly. I've heard XML can help, too - but I know virtually nothing about it except that its a mark-up language.
In summary, I just don't know how to store large amounts of data within an application. A good analogy would be: How would you store data for a dictionary/encyclopedia application?

So you are saying that a standalone application without internet access cannot have a database connection? Well your Basic assumption that DB cannot exist in standalone apps is wrong. Today's web applications use Browser assisted SQL databases to store data. All you need is to experiment rather than speculate. If you need direction, start with light weight SQLite

While databases are undoubtedly a good idea for the kind of application you're describing, I'll throw another suggestion your way, which might suit you if your data doesn't necessarily need to change at all, and there's not a "huge" amount of it.
Java provides the ability to serialise objects, which you could use to persist and retrieve object instance data directly to/from files. Using this simple approach, you could:
Write code to parse your text file into a collection of serialisable application-specific object instances;
Serialise these instances to some file(s) which form part of your application;
De-serialise the objects into memory every time the application is run;
Write your own Java code to search and retrieve data from these objects yourself, for example using ordered collection structures with custom comparators.
This approach may suffice if you:
Don't expect your data to change;
Do expect it to always fit within memory on the JVMs you're expecting the application will be run on;
Don't require sophisticated querying abilities.
Even if one or more of the above things do not hold, it may still suit you to try this approach, so that your next step could be to use a so-called object-relational mapping tool like Hibernate or Castor to persist your serialisable data not in a file, but a database (XML or relational). From there, you can use the power of some database to maintain and query your data.

Inserting to and searching a large amount of data in Java

I am writing a program in Java which tracks data about baseball cards. I am trying to decide how to store the data persistently. I have been leaning towards storing the data in an XML file, but I am unfamiliar with XML APIs. (I have read some online tutorials and started experimenting with the classes in the javax.xml hierarchy.)
The software has to major use cases: the user will be able to add cards and search for cards.
When the user adds a card, I would like to immediately commit the data to the persistant storage. Does the standard API allow me to insert data in a random-access way (or even appending might be okay).
When the user searches for cards (for example, by a player's name), I would like to load a list from the storage without necessarily loading the whole file.
My biggest concern is that I need to store data for a large number of unique cards (in the neighborhood of thousands, possibly more). I don't want to store a list of all the cards in memory while the program is open. I haven't run any tests, but I believe that I could easily hit memory constraints.
XML might not be the best solution. However, I want to make it as simple as possible to install, so I am trying to avoid a full-blown database with JDBC or any third-party libraries.
So I guess I'm asking if I'm heading in the right direction and if so, where can I look to learn more about using XML in the way I want. If not, does anyone have suggestions about what other types of storage I could use to accomplish this task?

While I would certainly not discourage the use of XML, it does have some draw backs in your context.
"Does the standard API allow me to insert data in a random-access way"
Yes, in memory. You will have to save the entire model back to file though.
"When the user searches for cards (for example, by a player's name), I would like to load a list from the storage without necessarily loading the whole file"
Unless you're expected multiple users to be reading/writing the file, I'd probably pull the entire file/model into memory at load and keep it there until you want to save (doing periodical writes the background is still a good idea)
I don't want to store a list of all the cards in memory while the program is open. I haven't run any tests, but I believe that I could easily hit memory constraints
That would be my concern to. However, you could use a SAX parser to read the file into a custom model. This would reduce the memory overhead (as DOM parsers can be a little greedy with memory)
"However, I want to make it as simple as possible to install, so I am trying to avoid a full-blown database with JDBC"
I'd do some more research in this area. I (personally) use H2 and HSQLDB a lot for storage of large amount of data. These are small, personal database systems that don't require any additional installation (a Jar file linked to the program) or special server/services.
They make it really easy to build complex searches across the datastore that you would otherwise need to create yourself.
If you were to use XML, I would probably do one of three things
1 - If you're going to maintain the XML document in memory, I'd get familiar with XPath
(simple tutorial & Java's API) for searching.
2 - I'd create a "model" of the data using Objects to represent the various nodes, reading it in using a SAX. Writing may be a little more tricky.
3 - Use a simple SQL DB (and Object model) - it will simply the overall process (IMHO)
Additional
As if I hadn't dumped enough on you ;)
If you really want to XML (and again, I wouldn't discourage you from it), you might consider having a look a XML database style solution
Apache Xindice (apparently retired)
Or you could have a look at some other people think
Use XML as database in Java
Java: XML into a Database, whats the simplest way?
For example ;)

Hadoop: Processing large serialized objects

I am working on development of an application to process (and merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file on different hosts. But as deserialization will require the all the blocks to be present on single host, its gonna hit the performance drastically. How can I deal this situation where different blocks have to cant be individually processed, unlike text files ?

There's two issues: one is that each file must (in the initial stage) be processed in whole: the mapper that sees the first byte must handle all the rest of that file. The other problem is locality: for best efficiency, you'd like all the blocks for each such file to reside on the same host.
Processing files in whole:
One simple trick is to have the first-stage mapper process a list of filenames, not their contents. If you want 50 map jobs to run, make 50 files each with that fraction of the filenames. This is easy and works with java or streaming hadoop.
Alternatively, use a non-splittable input format such as NonSplitableTextInputFormat.
For more details, see "How do I process files, one per map?" and "How do I get each of my maps to work on one complete input-file?" on the hadoop wiki.
Locality:
This leaves a problem, however, that the blocks you are reading from are disributed all across the HDFS: normally a performance gain, here a real problem. I don't believe there's any way to chain certain blocks to travel together in the HDFS.
Is it possible to place the files in each node's local storage? This is actually the most performant and easiest way to solve this: have each machine start jobs to process all the files in e.g. /data/1/**/*.data (being as clever as you care to be about efficiently using local partitions and number of CPU cores).
If the files originate from a SAN or from say s3 anyway, try just pulling from there directly: it's built to handle the swarm.
A note on using the first trick: If some of the files are much larger than others, put them alone in the earliest-named listing, to avoid issues with speculative execution. You might turn off speculative execution for such jobs anyway if the tasks are dependable and you don't want some batches processed multiple times.

It sounds like your input file is one big serialized object. Is that the case? Could you make each item its own serialized value with a simple key?
For example, if you were wanting to use Hadoop to parallelize the resizing of images you could serialize each image individually and have a simple index key. Your input file would be a text file with the key values pairs being index key and then serialized blob would be the value.
I use this method when doing simulations in Hadoop. My serialized blob is all the data needed for the simulation and the key is simply an integer representing a simulation number. This allows me to use Hadoop (in particular Amazon Elastic Map Reduce) like a grid engine.

I think the basic (unhelpful) answer is that you can't really do this, since this runs directly counter to the MapReduce paradigm. Units of input and output for mappers and reducers are records, which are relatively small. Hadoop operates in terms of these, not file blocks on disk.
Are you sure your process needs everything on one host? Anything that I'd describe as a merge can be implemented pretty cleanly as a MapReduce where there is no such requirement.
If you mean that you want to ensure certain keys (and their values) end up on the same reducer, you can use a Partitioner to define how keys are mapped onto reducer instances. Depending on your situation, this may be what you really are after.
I'll also say it kind of sounds like you are trying to operate on HDFS files, rather than write a Hadoop MapReduce. So maybe your question is really about how to hold open several SequenceFiles on HDFS, read their records and merge, manually. This isn't a Hadoop question then, but, still doesn't need blocks to be on one host.

Java: Advice on handling large data volumes. (Part Deux)

Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
One way or another, all the data will be read during the course of processing.
Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
Give me the first 10 kilobytes of all my files!
Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?

Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.

#Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms

Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.

I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.

You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.

#Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?

This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).

I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.

I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.