What data structure to use for big data

What data structure to use for big data - java

I have an excel sheet with a million rows. Each row has 100 columns.
Each row represents an instance of a class with 100 attributes, and the columns values are the values for these attributes.
What data structure is the most optimal for use here, to store the million instance of data?
Thanks

It really depends on how you need to access this data and what you want to optimize for – like, space vs. speed.
If you want to optimize for space, well, you could just serialize and compress the data, but that would likely be useless if you need to read/manipulate the data.
If you access by index, the simplest thing is an array of arrays.
If you instead use an array of objects, where each object holds your 100 attributes, you have a better way to structure your code (encapsulation!)
If you need to query/search the data, it really depends on the kind of queries. You may want to have a look at BST data structures...

One million rows with 100 values where is each value uses 8 bytes of memory is only 800 MB which will easily fit into the memory of most PC esp if they are 64-bit. Try to make the type of each column as compact as possible.
A more efficient way of storing the data is by column. i.e. you have array for each column with a primitive data type. I suspect you don't even need to do this.
If you have many more rows e.g. billions, you can use off heap memory i.e. memory mapped files and direct memory. This can efficient store more data than you have main memory while keeping you heap relatively small. (e.g. 100s of GB off-heap with 1 GB in heap)

If you want to store all the data in memory, you can use one of the implementations of Table from Guava, typically ArrayTable for dense tables or HashBasedTable if most cells are expected to be empty. Otherwise, a database (probably with some cache system like ehcache or terracota) would be a better shot.

Your best option would be to store them in a table in an actual database, like Postgres etc. These are optimised to work for what you are talking about!

In that kind of data i would prefer using a MYSQL database because it is faster and can accumulate a large file like that.

The best option would be using a database that can store large number of data and fast enough for faster accessibility like ORACLE, MSSQL, MYSQL and any other databases that are fast and can store large amount of data.

If you really have a million rows or more with 100 values each, I doubt it will all fit into your memory... or is there a special reason for it? For example poor performance using a database?
Since you wnat to have random access, I'd use a persistence provider like hibernate and some database you like (for example mysql).
But be aware that the way you use the persistence provider has a great impact on performance. For example you should use batch-inserts (which are incompatible with autogenerated ids).

Related

Storing in Hashtables

I an working on an application that might potentially get thousands and thousands of messages (perhaps millions). And I want to store these messages in a hashtable for easy lookup since each message has an id. Is this a good idea? If not, what's the best data structure or way to go about this. Thank you.

Is this a good idea?
Keeping an unbounded amount of data in an in-memory data structure is a bad idea. You will eventually run out of memory, and your application will crash.
If you are able to discard old "messages" so that you can place a reasonable bound on the amount of memory the application needs, then this could be a viable solution.
However, as the comments point out there are other solutions (distibuted memory caches, SQL databases, NoSQL databases, etcetera) that could well be better, depending on how much data there is and how fast access really needs to be.

Using Map (data will be stored main memory) is simple, but should be the least preferable and non realistic option, as you need to and implement/reinvent the logic for the data expiration, clustering, etc.. by yourself.
Using Caching frameworks (data will be stored main memory), this can be chosen only if you have an idea about how much size of data and how long the data needs to be resided in the cache (i.e., when the data can expired and removed), this option limits the data size to the max size of the JVM Heap space.
Using Database (data will be stored in disc space), this is the ideal option for holding millions of data, but comes with a cost as disc operations takes more time compared to the in memory operations.

avoid loading large number of objects in heap for route calculation in map

I've given java program which calculates turn by turn navigation data(route direction). The code starts by loading map information(lat-longs,paths etc) from flat files(generated from database) which amounts to 6 GB. Once all the information is loaded, route is calculated using the loaded data and in turn providing the turn-by-turn navigation.
Is there a better design for such applications which involve calculations using large number of objects to lessen overall memory consumption?

It depends on how you're using the loaded data. I think you may be able to use java.io.RandomAccessFile http://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html. It allows you to read through/modify files, without loading them into memory.

You can index the map information in a graph database, e.g. Neo4J - this should make it easier to only load the relevant part of the map needed for the route calculation, rather than having to load the entire map into memory.
You can use a more traditional relational database instead of a graph database, but you'll probably need to come up with a reasonable schema for your data rather than relying on the database to take care of indexing etc for you.

Keeping data in database or in session

I'm in the early stages of doing a web project which will require working with arrays containing around 500 elements of custom object type. Objects will likely contain between 10 and 40 fields (based on user input), mostly booleans, strings and floats. I'm gonna use PHP for this project, but I'm also interested to know how to treat this problem in Java.
I know that "premature optimization is the root of all evil", but I think I need to decide now, how do I handle those arrays. Do I keep them in the Session object or do I store them in the database (mySQL) and keep just a minimum amount of keys in the session. Keeping data in the session would make application work faster, but when visitor numbers start growing I risk using up too much memory. On the other hand reading and writing from and into database all the time will degrade performance.
I'd like to know where the line is between those two approaches. How do I decide when it's too much data to keep inside session?

When I face a problem like this I try to estimate the size of per user data that I want to keep fast.
If your case, suppose for example to have 500 elements with 40 fields each of which sizing 50 bytes (making an average among texts, numbers, dates, etc.). So we have to keep in memory about 1MB per user for this storage, so you will have about 1GB every 1000 users only for this cache.
Depending on your server resource availability you can find bottlenecks: 1000 users consume CPU, memory, DB, disks accesses; so are in this scenario 1GB the problem? If yes keep them in DB if no keep them in memory.
Another option is to use an in-memory DB or a distributed cache solution that does it all for you, at some cost:
architectural complexity
eventually licence costs

I would be surprised if you had that amount of unique data for each user. Ideally, some of this data would be shared across users, and you could have some kind of application-level cache that stores the most recently used entries, and transparently fetches them from the database if they're missing.
This kind of design is relatively straightforward to implement in Java, but somewhat more involved (and possibly less efficient) with PHP since it doesn't have built-in support for application state.

Is it faster to access a java list (arraylist) compared to accessing the same data in a mysql database?

I have the MYSQL database in the local machine where I'm running the java program from.
I plan create a array list of all the entries of a particular table. From this point on wards I will not access the database to get a particular entry in the table, instead I will use the array list created. Is this going to be faster or slower compared to accessing the database to grab a particular entry in the table?
Please note that the table I'm interested has about 2 million entries.
Thank you.
More info : I need only two fields. 1 of type Long and 1 of type String. The index of the table is Long , not int.

No, it's going to be much slower, because to find an element in an ArrayList, you've to scan sequentially the ArrayList until your element is found.
It can be faster, for a few hundreds entry, because you don't have the connection overhead, but with two millions entry, MySQL is going to win, provided that you create the correct indexes. Only retrieve the rows that you actually need each time.
Why are you thinking to do this? Are you experiencing slow queries?
To find out, in your my.cnf activate the slow query log, by uncommenting (or adding) the following lines.
# Here you can see queries with especially long duration
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 1
Then see which queries take a long time, and run them with EXPLAIN in front, consider to add index where the explain command tells you that is not using indexes, or just post a new question with your CREATE TABLE statement and your example query to optimize.

This question is too vague, and can easily go either way depending on:
How many fields in each record, how big are the fields?
What kind of access are you going to perform? Text search? Sequential?
For example, if each records consists of a couple bytes of data it's much faster to store them all in-memory (not necessarily an ArrayList though). You may want to put them into a TreeSet for example.

It depends on what you will do with the data. If you just wanted a few rows, only those should be fetched from the DB. If you know that you need ALL the data, go ahead and load the whole table into java if it can fit in memory. What will you do with it after? Sequencial or random reading? Will data be changed? A Map or Set could be a faster alternative depending on how the collection will be used.

Whether it is faster or slower is measurable. Time it. It is definitely faster to work with structures stored in memory than it is to work with data tables located on the disk. That is if you have enough memory and if you do not have 20 users running the same process at the same time.
How do you access the data? Do you have an integer index?

First, accessing an array list is much much faster than accessing a data base. Accessing memory is much more faster than accessing a hard disk.
If the number of entries in the array is big and I guess it is, then you need to consider using a "direct access" data structure such as a HashMap which will act as a database table where you have values referenced by their keys

When is BIG, big enough for a database?

I'm developing a Java application that has performance at its core.
I have a list of some 40,000 "final" objects,
i.e., I have an initialization input data of 40,000 vectors.
This data is unchanged throughout the program's run.
I am always preforming lookups against a single ID property to retrieve the proper vectors.
Currently I am using a HashMap over a sub-sample of a 1,000 vectors,
but
I'm not sure it will scale to production.
When is BIG, actually big enough for a use of DB?
One more thing, an SQLite DB is a viable option as no concurrency is involved,
so I guess the "threshold" for db use, is perhaps lower.

I think you're asking whether a HashMap with 40,000 entries in will be okay. The answer is yes - unless you really don't have enough memory, that should be absolutely fine. If you're writing a performance-sensitive app, then putting a large amount of fast memory in the machine running the app is likely to be an efficient way of boosting performance anyway.
There won't be very much overhead for each HashMap entry, so if you've got enough space to store the objects themselves in memory, it's unlikely that the overhead of the map would cause a problem.
Is there any reason why you can't just test this with a reasonable amount of data?
If you really have no more requirements than:
Read data at start-up
Put data in a map by a single ID (no need for joins, queries against different fields, substring matches etc)
Fetch data from map
... then using a full-blown database would be a huge amount of overkill, IMO.

As long as you're loading the data set in a memory at the beginning of the program and keeping it in memory and you don't have any complex queries, some sort of serialization/deserialization seems to be more feasible than a full blown database.

You could start a DB with as little as 100 (or less). There is no general rule of when the amount of data is large enough to store in a database. It's more if you believe you should better store this data in a database, if this will give you any profit (performance boost, easier programming, more flexible options for your users).
When the benefits are greater than the cost of implementation put it in a database.

There is no set size for a Collection vs a Database. It high depends on what you want to do with the data. Size is less important.
You can have a Map with a billion entries.

There's no such thing as 'big enough for a database'. The question is whether there are enough advantages in using a database to overcome the costs.
Having said that, 40,000 isn't 'big' ;-) Unless the objects are huge or you have complex query requirements I would start with an in-memory implementation. But if you expect to scale this number up over time it might be better to use the database from the beginning.

One option that you might want to consider is the Oracle Berkeley DB Java Edition library. It's a simple JAR file that can read/write data to persistent storage. Because of it's small footprint and ease of use, it's used for applications running on small to very large data sets. It's designed to be linked into the application, so that it's embedded and doesn't require complex client/server installation or protocol stacks.
What's even better is that it's extremely scalable (which works well if you end up with larger data sets than you expect), is very fast, and supports both a Java Collections API and a Direct Persistence Layer API (POJO-like). So you can use it seamlessly with Java Collections.
Berkeley DB Java Edition was designed specifically with Java application developers in mind. It's designed to be simple to use, light weight in terms of resources required, but very fast, scalable and reliable.
You can find information more about Oracle Berkeley DB Java Edition here
Regards,
Dave

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

What data structure to use for big data - java

I have an excel sheet with a million rows. Each row has 100 columns. Each row represents an instance of a class with 100 attributes, and the columns values are the values for these attributes. What data structure is the most optimal for use here, to store the million instance of data? Thanks

Your best option would be to store them in a table in an actual database, like Postgres etc. These are optimised to work for what you are talking about!

In that kind of data i would prefer using a MYSQL database because it is faster and can accumulate a large file like that.

The best option would be using a database that can store large number of data and fast enough for faster accessibility like ORACLE, MSSQL, MYSQL and any other databases that are fast and can store large amount of data.

Related

Storing in Hashtables

avoid loading large number of objects in heap for route calculation in map

Keeping data in database or in session

Is it faster to access a java list (arraylist) compared to accessing the same data in a mysql database?

When is BIG, big enough for a database?

Categories

Resources