Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am studying Data Structures in a Fundamentals of Software Development course. I have encountered the following data structures:
Structs
Arrays
Lists
Queues
Hash Tables
... among others. I have a good understanding of how they work, but I'm struggling to understand when and where to use them.
I can identify the use of the Queue Data structure, as this would be helpful in printer and/or thread queuing and prioritizing.
Knowing the strengths and weaknesses of a data structure and implementing it in code are different things, and I am finding the former difficult.
What is a simple example of the use of each of the data structures listed above?
For example:
Queue: first-in, first-out → used for printer queue to queue docs
I had trouble understanding them when i first started programming and so i decided to give a heads up to start with.
I am trying to be as simple as possible. Try Oracle Docs fro further details
Struct: When ever you need Object like structure, where you can group related data, use structs. Structs are very rarely used in java though(as objects are created in their place)
Arrays: Arrays are contiguous memory. when ever you want fixed time access based on index, unlike linkedlist, arrays are very fast and so use them.
But the backlog with arrays is that you need to know the size at the time of initialization. Also arrays does not support higher level methods such as add(), remove(), clear(),contains(), indexOf() etc.
List: is an interface which can be implemented using Arrays(ArrayList)
or LinkedLists (LinkedList). They support all the higher level methods specified earlier.
Also Lists re-sizes themselves whenever it is getting out of space. You can specify the initial size which the underlying Arrays or LinkedLists will be created, but whenever the limit is reached, it created the underlying structure with a bigger size and then copies the contents of the initial one.
Queue or Stack: is an implementation technique and not really a data structure. If you want FIFO implementation, you implement Queue on either Arrays or LinkedList(yes, you can implement this technique on both these data structures)
https://en.wikibooks.org/wiki/Data_Structures/Stacks_and_Queues
HashMap: Hashmap is used whenever you want to store key value pairs. if you notice, you cannot use arrays or linked lists or any other mentioned data structure for this purpose. a key can be any thing from String to Object(but note that it has to be an object and cannot be a primitive) and a value can also be any object
google out each data structure for more details
It depends on what you need. If you read and learn more about these data structures you will find convenient ways for their implementation.
Maybe read this book? http://www.amazon.com/Data-Structures-Abstraction-Design-Using/dp/0470128704
All these data structures are used based on their needs in the program. Try to find advantages of one data structure to the other. That should get things more clear for you.What I say wouldn't be much clear, but i'll give it a shot
Like for example,
Structs are used to create a data type, say you want to have a data type for Book & have the names of the bookBook Structure.
Lists are easier to access both ways if you use linked lists & are better than array some times. Queues, well, you can imagine them as real life queues, First In will be First Out. So you can use them when you need to set this priority.
Like I said, looking for advantages of one over the other should get things clear for you.
Related
I know priority queues tend to use heaps but what is the point of a priority queue when they basically seem the same as heaps? I initially thought all priority queues use hash maps to keep track of all object's locations in the heap, making finding and updating/deleting said object easier. However, I have used Java's priority queue and you have to manually iterate over it to update or delete objects not at the root. It seems odd to have a priority queue that appears to literally just be a heap with nothing else special about it.
It might help to reason by analogy here:
List is to dynamic array as PriorityQueue is to binary heap.
That is, the abstract idea of a list (a sequence of things starting at position zero where items can be inserted and removed) is a nice, high-level concept, while a dynamic array (an array along with a capacity that doubles or 1.5x’s in size if extra space is needed) is one possible way of implementing a list. If you’re using a list, you can just think “oh, it’s a sequence, and I can put things places” without worrying how that sequence is actually represented. On the other hand, working with a dynamic array requires you to track which array elements are valid versus which ones don’t actually get used, you need to manually transfer things over when there’s no more space and think carefully about your growth strategy, etc. The distinction here is at what level you’re viewing things. If you just need “a sequence,” think “list.” If you need to build a type from scratch representing a sequence, think “dynamic array.”
This is basically what’s going on with priority queues versus binary heaps. A priority queue abstractly represents the idea of “I can put things in and they’ll come back in sorted order.” A binary heap is one specific possible way of implementing a priority queue. When working with an abstract priority queue, you can focus your thoughts purely on questions like “what elements do I want to add?” and “how do I rank them?” When working with a binary heap, you have to think about things like “do I use one-indexing or zero indexing?” and “what’s the formula for identifying the children of a node at index k?” If all you need is the ability to put things in a bag and take them out in sorted order, you can use a priority queue without worrying about how it works. If you need to build one from scratch, you can use a binary heap.
Going back to the list versus dynamic array analogy: there are many types you can use to represent lists. Dynamic arrays are one, but you could also use a circular buffer (good if you add or remove from the ends) or a linked list (good if items get moved around between lists). Similarly, there are many ways you can implement a priority queue. Binary heaps are one option, but there’s also pairing heaps, binomial heaps, etc. Keeping the relevant abstraction in focus - I just want a sequence of things, I just want a way to retrieve things in sorted order - means that you don’t need to worry as much about how things work when what you care about is what operations you want to do.
Personal opinion, you are right, Java's PriorityQueue is heap. Java as a high level programming language, it is reasonable for it to provide all the common and standard algorithm implementations, most of the time we focus on business logic development and how to get the job done faster. So we don't want spend too much time on building a priority queue from the ground up, besides it is tedious and error-prone to do it yourself.
If you want update or delete objects at the same time, and don't want to iterate over it manually, you can just do this:
Object updatedObject;
priorityQueue.add(priorityQueue.remove(updatedObject));
although it's not efficient enough when updating occurs frequently, there is an alternative algorithm called Fibonacci heap to do the job better:
It seems odd to have a priority queue that appears to literally just be a heap with nothing else special about it.
Why?
Nothing about the name PriorityQueue promises anything more than the ability to put items in one end and get them out the other in sorted-by-priority order. That's also basically the definition of a heap, which is why a heap makes an ideal data structure to implement a priority queue.
So, essentially, the Java Collections Framework designers implemented a heap. Only instead of calling it a Heap, they called it a PriorityQueue. End of story. As the song lyric goes: "Who could ask for anything more?"
Java's Priority queue is can be either a min Heap or a max Heap, and based on how you have constructed it, it will always give you the min/max value.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
What I think I'm looking for is a no-SQL, library-embedded, on disk (ie not in-memory) database, thats accessible from java (and preferably runs inside my instance of the JVM). That's not really much of a database, and I'm tempted to roll-my-own. Basically I'm looking for the "should we keep this in memory or put it on disk" portion of a database.
Our model has grown to several gigabytes. Right now this is all done in memory, meaning we're pushing the JVM for upward of several gigabytes. It's currently all stored in a flat XML file, serialized and deserialized with xstream and compressed with Java'a built in gzip libraries. That's worked well when our model stays under 100MB, but now that its larger than that its becoming a problem.
loosely speaking that model can be broken down as
Project
configuration component (directed-acyclic-graph), not at all database friendly
a list of a dozen "experiment" structures
each containing a list of about a dozen "run-model" structures.
each run-model contains hundreds of megs of data. Once written they are never edited.
What I'd like to do is have something that conforms to a map interface, of guid -> run-model. This mini-database would keep a flat table of these objects. On our experiment model, we would replace the list of run-models with a list of guids, and add, at the application layer, a get call to this map, which would pull it off the disk and into memory.
That means we can keep configuration of our program in XML (which I'm very happy with) and keep a table of the big data in a DBMS that will keep us from consuming multi-GB of memory. On program start and exit I could then load and unload the two portions of our model (the config section in XML, and the run-models in the database format) from an archiving format.
I'm sort've feeling gung-ho about this, and think that I could probably implement it with some of X-Stream's XML inspection strategies and a custom map implementation, but something a voice in the back of my head is telling me I should find a library to do it instead.
Should I roll my own or is there a database that's small enough to fit this bill?
Thanks guys,
-Geoff
http://www.mapdb.org/
Also take a look at this question: Alternative to BerkeleyDB?
Since MapDB is a possible solution for your problem, Chronicle Map is also worth consideration. It's an embeddable Java key-value store, optionally persistent, offering a very similar programming model to MapDB: it also via the vanilla java.util.Map interface and transparent serialization of keys and values.
The major difference is that according to third-party benchmarks, Chronicle Map is times faster than MapDB.
Regarding stability, no bugs were reported about the Chronicle Map data storage for months now, while it is in active use in many projects.
Disclaimer: I'm the developer of Chronicle Map.
If this was a question posted to you about a program you wrote what sort of response would they be looking for, quite a basic understanding coder.
Your question is not very clear, but it seems that you are being asked about what data structures you used that fit the given requirements in order to implement your solution (e.g., a hashtable, tree, etc.)
I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.
This question asks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).
So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.
My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
Some columns will have a lot of repeated values
primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns
I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.
EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.
Some columns will have a lot of
repeated values
immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.
Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.
Otherwise you can try with Apache collections.. just do your benchmarks!
In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)
Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops (reference).
Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrates with Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer.
So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.
A few possibilities -
Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.
You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.
What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?
Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.
If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.
EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.
However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.
(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)
Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.
keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.
One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.
Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity parameter).
I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.
I've been experimenting with using the SparseObjectMatrix2D from the Colt project. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.
It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.
From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column).
I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException.
You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.
Some columns will have a lot of
repeated values
I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.
Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:
Maximum bytes to be used from Local heap.
Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.
For me apache commons collections did not save any space, here are two similar heap dumps just before OOME comparing Java 11 HashMap to Apache Commons HashedMap:
The Apache Commons HashedMap doesn't appear to make any meaningful difference.
I need a data structure to store users which should be retrieved by id.
I noticed there are several classes that implement the Map interface. Which one should be my default choice? They all seem quite equivalent to me.
Probably it depends on how many users you plan to have and if you will need them ordered or just getting single items by id.
HashMap uses hash codes to store things so you have constant time for put and get operations but items are always unordered.
TreeMap instead uses a binary tree so you have log(n) time for basic operations but items are kept ordered in the tree.
I would use HashMap because it's the simpler one (remember to give it a suitable initial capacity). Remember that these datastructures are not synchronized by default, if you plan to use it from more than one thread take care of using ConcurrentHashMap.
A middle approach is the LinkedHashMap that uses same structure as HashMap (hashcode and equals method) but it also keeps a doubly linked list of element inserted in the map (mantaining the order of insertion). This hybrid has ordered items (ordered in sense of insertion order, as suggested by comments.. just to be precise but I had already specified that) without performance losses of TreeMap.
No concurrency: use java.util.HashMap
Concurrency: use java.util.concurrent.ConcurrentHashMap
If you want some control on the order used by iterators, use a TreeMap or a LinkedHashMap.
This is covered on the Java Collections Trail, Implementations page.
If they all seem equivalent then you haven't read the documentation. Sun's documentation is pretty much as terse as it gets and provides very important points for making your choices.
Start here.
Your choice could be modified by how you intend to use the data structure, and where you would rather have performance - reads, or writes?
In a user-login system, my guess is that you'll be doing more reads than writes.
(I know I already answered this once, but I feel this needs saying)
Have you considered using a database to store this information? Even if it's SQLite, it'd probably be easier than storing your user database in the program code or loading the entire dataset into memory each time.