Suitable Java data structure for parsing large data file

Suitable Java data structure for parsing large data file - java

I have a rather large text file (~4m lines) I'd like to parse and I'm looking for advice about a suitable data structure in which to store the data. The file contains lines like the following:
Date Time Value
2011-11-30 09:00 10
2011-11-30 09:15 5
2011-12-01 12:42 14
2011-12-01 19:58 19
2011-12-01 02:03 12
I want to group the lines by date so my initial thought was to use a TreeMap<String, List<String>> to map the date to the rest of the line but is a TreeMap of Lists a ridiculous thing to do? I suppose I could replace the String key with a date object (to eliminate so many string comparisons) but it's the List as a value that I'm worried might be unsuitable.
I'm using a TreeMap because I want to iterate the keys in date order.

There's nothing wrong with using a List as the value for a Map. All of those <> look ugly, but it's perfectly fine to put a generics class inside of a generics class.
Instead of using a String as the key, it would probably be better to use java.util.Date because the keys are dates. This will allow the TreeMap to more accurately sort the dates. If you store the dates as Strings, then the TreeMap may not properly sort the dates (they will be sorted as strings, not as "real" dates).
Map<Date, List<String>> map = new TreeMap<Date, List<String>>();

is a TreeMap of Lists a ridiculous thing to do?
Conceptually not, but it is going to be very memory-inefficient (both because of the Map and because of the List). You're looking at an overhead of 200% or more. Which may or may not be acceptable, depending on how much memory you have to waste.
For a more memory-efficient solution, create a class that has fields for every column (including a Date), put all those in a List and sort it (ideally using quicksort) when you're done reading.

There is no objection against using Lists. Though in your case maybe a List<Integer> as values of the Map would be appropriate.

Related

Java get values from LinkedHashMap with part of the key

I have the following key-value system (HashMap) , where String would be a key like this "2014/12/06".
LinkedHashMap<String, Value>
So, I can retrieve an item knowing the key, but what I'm looking for is a method to retrieve a list of the value which key matches partialy, I mean, how could I retrieve all the values of 2014?.
I would like to avoid solutions like, test every item in the list, brute-force, or similar.
thanks.

Apart from doing the brute-force solution of iterating over all the keys, I can think of two options :
Use a TreeMap, in which the keys are sorted, so you can find the first key that is >= "2014/01/01" (using map.getCeilingEntry("2014/01/01")) and go over all the keys from there.
Use a hierarchy of Maps - i.e. Map<String,Map<String,Value>>. The key in the outer Map would be the year. The key in the inner map would be the full date.

Not possible with LinkedHashMap only. If you can copy the keys to an ordered list you can perform a binary search on that and then do a LinkedHashMap.get(...) with the full key(s).

If you're only ever going to want to retrieve items using the first part of the key, then you want a TreeMap rather than a LinkedHashMap. A LinkedHashMap is sorted according to insertion order, which is no use for this, but a TreeMap is sorted according to natural ordering, or to a Comparator that you supply. This means that you can find the first entry that starts with 2014 efficiently (in log time), and then iterate through until you get to the first one that doesn't match.
If you want to be able to match on any part of the key, then you need a totally different solution, way beyond a simple Map. You'd need to look into full text searching and indexing. You could try something like Lucene.

You could refine a hash function for your values so that values with similar year would hash around similar prefixed hashes. That wouldn't be efficient (probably poor distribution of hashes) nor to the spirit of HashMaps. Use other map implementations such as TreeMaps that keep an order of your choice.

Storing tables in java for refrencing

So the question is regarding optimization of the code. I have a table for retirement date which im going to list below
Year of Birth Full Retirement Age
1937 or earlier.............................65
1938........................................65 years 2 months
1939........................................65-4
1934.......................................65-6
.
.
.and the list is a long list
What i want to do is to store this table in a in list object or something so that I can pass in the year of birth in a method and the list object and get back the corresponding retirement age. I dont want to have a lot of If and Else Statements in my code because the list is so damn big and the code will be confusing.
What can be a possible solution for this problem?
Thanks in advance

Try using map instead of list. Use year of birth as key, so that you can directly get the associated value from the map.

You can use map but there is a chance for duplicate keys.
Two persons can born in same year.
Use MultiMap
A Multimap that can hold duplicate key-value pairs and that maintains the insertion ordering of values for a given key. See the Multimap documentation for information common to all multimaps.

Use a map. Map is a List object with Key:Value.
Map<String, Object> map = new HashMap<String, Object>();
map.put('1937', 65);
...
To go through a map you can use this:
for (String key : map.keySet()) {
System.out.println(map.get(key));
}
You can change values for <String, Object> as you wish (Integer, Date... or whatever). Always follow the same order <KeyType, ValueType>

Store your list/table into a HashMap...then retrieve from your method, something like:
public String getRetirementAge(String yearOfBirth) {
return yourMap.get(yearOfBirth);
}

If you have data for every year i would use a java map http://docs.oracle.com/javase/tutorial/collections/interfaces/map.html where the key is the year and the value is the retirement value.
This would give you an O(1)
If you have sparse data and you have somehow to calculate the nearest year you could either use a sorted List and use Binary search which gives you an O(logn) or even use a B-tree.
BR,
David

I would recommend that you store this information in a database, especially if the list is a very long list (which you say it is). There will be many optimizations that come from using a database. For one thing, you won't have to store that huge list in memory. For another, SQL queries for data are often much faster than data structures in code. Martin Fowler has an (admittedly old) article about this at http://www.martinfowler.com/articles/dblogic.html. The other thing you gain from putting this in a database is that this is the type of list that is likely to change. They are already talking about adjusting retirement age in order to save social security. It is much easier to update data in a database than it is to edit code and recompile / redeploy.
The type of database you use can be NoSQL or relational, embedded or online. That decision I'll leave up to you. It will be a bonus for you if there is already a database available to this application for other reasons.

HashMap<DateTime, ArrayList<Email>>

I work on a graph where I visualize my emails. I want to be able to get the emails from a certain day.
Is this a bad way to store?
HashMap<DateTime, ArrayList<Email>>
Or is it better to convert the date to a string and then use HashMap<String, ArrayList<Email>>
Note, the dates are added without hours, minutes and seconds, so just like 06/07/2010 for example.

DateTime has properly defined equals and hashcode methods, so using those as the key in a HashMap is perfectly OK. There's not much to be gained by converting them to strings first.
I would suggest, however, that if you only want to store the year/month/day components, then you may want to use LocalDate instead of DateTime.
Additionally, you could also consider using TreeMap rather than HashMap, so that your map is automatically sorted by date. Might be handy.

Get a value from hashtable by a part of its key

Say I have a Hashtable<String, Object> with such keys and values:
apple => 1
orange => 2
mossberg => 3
I can use the standard get method to get 1 by "apple", but what I want is getting the same value (or a list of values) by a part of the key, for example "ppl". Of course it may yield several results, in this case I want to be able to process each key-value pair. So basically similar to the LIKE '%ppl%' SQL statement, but I don't want to use a (in-memory) database just because I don't want to add unnecessary complexity. What would you recommend?
Update:
Storing data in a Hashtable isn't a requirement. I'm seeking for a kind of a general approach to solve this.

The obvious brute-force approach would be to iterate through the keys in the map and match them against the char sequence. That could be fine for a small map, but of course it does not scale.
This could be improved by using a second map to cache search results. Whenever you collect a list of keys matching a given char sequence, you can store these in the second map so that next time the lookup is fast. Of course, if the original map is changed often, it may get complicated to update the cache. As always with caches, it works best if the map is read much more often than changed.
Alternatively, if you know the possible char sequences in advance, you could pre-generate the lists of matching strings and pre-fill your cache map.
Update: Hashtable is not recommended anyway - it is synchronized, thus much slower than it should be. You are better off using HashMap if no concurrency is involved, or ConcurrentHashMap otherwise. Latter outperforms a Hashtable by far.
Apart from that, out of the top of my head I can't think of a better collection to this task than maps. Of course, you may experiment with different map implementations, to find the one which suits best your specific circumstances and usage patterns. In general, it would thus be
Map<String, Object> fruits;
Map<String, List<String>> matchingKeys;

Not without iterating through explicitly. Hashtable is designed to go (exact) key->value in O(1), nothing more, nothing less. If you will be doing query operations with large amounts of data, I recommend you do consider a database. You can use an embedded system like SQLite (see SQLiteJDBC) so no separate process or installation is required. You then have the option of database indexes.
I know of no standard Java collection that can do this type of operation efficiently.

Sounds like you need a trie with references to your data. A trie stores strings and lets you search for strings by prefix. I don't know the Java standard library too well and I have no idea whether it provides an implementation, but one is available here:
http://www.cs.duke.edu/~ola/courses/cps108/fall96/joggle/trie/Trie.java
Unfortunately, a trie only lets you search by prefixes. You can work around this by storing every possible suffix of each of your keys:
For 'apple', you'd store the strings
'apple'
'pple'
'ple'
'le'
'e'
Which would allow you to search for every prefix of every suffix of your keys.
Admittedly, this is the kind of "solution" that would prompt me to continue looking for other options.

first of all, use hashmap, not hashtable.
Then, you can filter the map using a predicate by using utilities in google guava
public Collection<Object> getValues(){
Map<String,Object> filtered = Maps.filterKeys(map,new Predicate<String>(){
//predicate methods
});
return filtered.values();
}

Can't be done in a single operation
You may want to try to iterate the keys and use the ones that contain your desired string.

The only solution I can see (I'm not Java expert) is to iterate over the keys and check for matching against a regular expression. If it matches, you put the matched key-value pair in the hashtable that will be returned.

If you can somehow reduce the problem to searching by prefix, you might find a NavigableMap helpful.

it will be interesting to you to look throw these question: Fuzzy string search library in Java
Also take a look on Lucene (answer number two)

I need data structure for effective handling with dates

What I need is something like Hashtable which I will fill with prices that were actual at desired days.
For example: I will put two prices: January 1st: 100USD, March 5th: 89USD.
If I search my hashtable for price: hashtable.get(February 14th) I need it to give me back actual price which was entered at Jan. 1st because this is the last actual price. Normal hashtable implementation won't give me back anything, since there is nothing put on that dat.
I need to see if there is such implementation which can find quickly object based on range of dates.

Off the top of my head, there are a couple ways, but I would use a TreeMap<Date> (or Calendar, etc).
When you need to pull out a Date date, try the following:
Attempt to get(date)
If the result is null, then the result is in headMap(date).lastKey()
One of those will work. Of course, check the size of headMap(date) first because lastKey() will throw an Exception if it is empty.

You could use a DatePrice object that contains both and keep those in a list or array sorte by date, then use binary search (available in the Collections and Arrays classes) to find the nearest date.
This would be significantly more memory-effective than using TreeMap, and it doesn't look like you'll want to insert or remove data randomly (which would lead to bad performance with a array).

Create a Tree Map with Date,String. If some one calls for a date then convert the string to date and call map.get(date), if you find then take the previous key than the current element.

You have all your tools already at hand. Consider a TreeMap. Then you can create a headmap, that contains only the portion of the map that is strictly lower that a given value. Implementation example:
TreeMap<Date,Double> values = new TreeMap<Date,Double>();
...fill in stuff...
Date searchDate = ...anydate...
// Needed due to the strictly less contraint:
Date mapContraintDate = new Date(searchDate.getTime()+1);
Double searchedValue = values.get(values.headMap(mapContraintData).lastKey);
This is efficient, because the headMap is not create by copying the original map, but returns only a view.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Suitable Java data structure for parsing large data file - java

There is no objection against using Lists. Though in your case maybe a List<Integer> as values of the Map would be appropriate.

Related

Java get values from LinkedHashMap with part of the key

Storing tables in java for refrencing

HashMap<DateTime, ArrayList<Email>>

Get a value from hashtable by a part of its key

I need data structure for effective handling with dates

Categories

Resources