I am writing a program to analyze some spreadsheet data. There are two columns: start time and duration (both variables are Doubles). The spreadsheet is not sorted. I need to sort the columns together by start time (that is, the durations have to stay with their matching start times). There are a few thousand rows, and analysis will happen periodically so I don't want to keep sorting the entire collection over and over again as more data gets added.
A Treemap using start time as the key and duration as the value seemed perfect because it would insert the information into the correct position as it gets read in, and keep the two pieces of data together as it goes.
And it did work perfectly for 90% of my data. Unfortunately I realized tonight that sometimes 2 events will have the same start time. Since the Treemap doesn't keep duplicate keys, I lose a row when the new data overwrites the old one.
There are many posts about this (see this and this and sort of this) and I see two suggestions keep coming up:
a custom comparator to 'trick' the Treemap into allowing duplicates.
using something like Treemap(Double,List(Double)) to store multiple values for a key.
The first suggestion is easiest for me to implement but I read comments that this breaks the contract of the Treemap and isn't a good idea. The second suggestion can be done but will make the analysis more complicated as I'll have to iterate through the list as I iterate through the keys, instead of simply iterating through the keys alone.
What I need is a way to keep two lists sorted together and allow duplicate entries. I'm hoping someone can suggest the best way to do this. Thanks so much for your help.
Related
when I have an Array and I want to remove one value from it I need to shift the next element to lift but the idea is to do shifting one time when a n of null value in array.
Of course it is micro-optimisation, and ArrayList (maybe LinkedList) would be a production quality data structure for dynamic arrays.
Here you might keep an extra list of nulled entries. At a certain threshold one could do **System.arraycopy**s to remove the gaps. If there are many index based inserts too, you might opt for keeping gaps, maybe collecting small gaps together.
This is a traditional technique in editors for text.
For several data structures one might search through guava classes.
For instance write-on-copy data structures.
Or concurrency, compactifying in the background.
For a specific data structure & algorithm maybe someone else can give pointers.
So the title does a terrible job at explaining what I mean, so let me explain.
So I basically want to make a mini cryptocurrency blockchain as a project. I'm trying to think of the best thing to use as the blocks in a blockchain. I'd need the blocks to hold the header information which is the following:
Previous Blocks Hash. Timestamp. Difficulty Target. Nonce. Version. Merkle Root.
And then the transactions that are also contained within each block. All in all the block would end up looking like this.
I thought a Hashmap would be the best way to do this as you can assign each of the above keys a value, but I'm not sure how I'd go about it. I would need a list that held the Hashmaps which held multiple keys and values, but I'd also need the list to hold whatever the transactions would be held in. I'd also need the next list to be able to see the previous list (so it can get the hash from everything in its header).
Could anyone give me some advice on the best way to deal with this? Been thinking it over for the last few days but I don't think I'm coming any closer to an answer.
Thank you.
I have a very large set of key value pairs (TBs of data), read from some files.
For simplicity, let's assume the keys and values are both integers.
In the end, I am interested in keeping each key with the highest N values it was encountered and writing them, again as key value pairs, to some different files.
There is no issue if the output file contains more than N entries for a given key, as long as the smallest 3 values are among them.
Keeping the files as they are satisfies the above condition, but I'm trying to reduce the size, since some keys have lots of values in the input, which are not of interest.
Keeping all the data in memory is clearly not an option.
Thus I'm looking for some kind of cache. Something where I can keep a sorted list for each key I find, and once a specific size limit is reached, I'd just flush half of the entries from the cache to the output. Guava's LoadingCache does not seem to help me here, because the weights are computed at entry creation time, and are static thereafter.
Is there a specific data structure/algorithm I can use/implement that may help me here?
Simple approach:
Sort the original file Your sorting criteria is key in ascending order, value in descending order. The Linux sort utility makes quick work of this. (Well, quick as in it's quick to type the command. Sorting terabytes of data will take some time.)
Write a program that goes through the file sequentially and saves the top N values.
You're all done.
If the data is spread across multiple files, where values for a particular key can be in more than one file, then you sort each file individually and then merge the multiple files together. Again, sort can do this for you.
I can't guarantee that the above will execute faster than a custom solution, but I'm pretty confident in saying that it will execute faster than you can design, code, debug, test, and then run your custom solution.
I'm having trouble figuring out what kind of data structure to use. I'm trying to represent classes that need to be taken to complete a major, complete with prereqs. Originally, I was thinking of using and adjacency matrix to represent the data. However, I'm currently using an arraylist to store each course. Then, in each course object, I have an arraylist containing the parents of each course and the children of each course. I'm planning on having a dummy course with the first courses in its children array. Then I plan on doing a BFS and adding classes so long as the parents have been taken and the current course hasn't. I can't help but feeling that this isn't the ideal solution. I tried looking at the difference between lists, sets, and maps, and I can't come to a clear reason why one is superior to the other. All of the courses are unique and sets don't contain duplicates, so that seems to make sense. On the other hand, the order of classes taken matters, so a list seems like a good idea. Then again, I could have the course number be the key and the course object be the value in a map.
If each of them is guaranteed to have a unique key (generated and
enforced by an external keying system) which Map implementation is
the correct fit for me? Assume this has to be optimized for
concurrent lookup only (The data is initialized once during the
application startup).
Does this 300 million unique keys have any positive or negative
implications on bucketing/collisions?
Any other suggestions?
My map would look something like this
Map<String, <boolean, boolean, boolean, boolean>>
I would not use a map, this needs to much memory. Especially in your case.
Store the values in one data array, and store the keys in a sorted index array.
In the sorted array you use binSearch to find the position of a key in data[].
The tricky part will be building up the array, without running out of memory.
you dont need to consider concurreny because you only read from the data
Further try to avoid to use a String as key. try to convert them to long.
the advantage of this solution: search time garuanteed to not exceed log n. even in worst cases when keys make problems with hashcode
Other suggestion? You bet.
Use a proper key-value store, Redis is the first option that comes to mind. Sure it's a separate process and dependency, but you'll win big time when it comes to proper system design.
There should be a very good reason why you would want to couple your business logic with several gigs of data in same process memory, even if it's ephemeral. I've tried this several times, and was always proved wrong.
It seems to me, that you can simply use TreeMap, because it will give you O(log(n)) for data search due to its sorted structure. Furthermore, it is eligible method, because, as you said, all data will be loaded at startup.
If you need to keep everything in memory, then you will need to use some library meant to be used with these amount of elements like Huge collections. On top of that, if the number of writes will be big, then you have to also think about some more sophisticated solutions like Non-blocking hash map