Java HashMap/List alternative for huge data - java

In my Java application i have to scan a filesystem and store recursively the paths of founded files for an early search.
I tried List/ArrayList and HashMap as store structure but the memory usage is TOO much when the filesystem contains 1.000.000+ files.
How can i store and fast retrieve those 'strings' without use an half of my RAM (8 GB)?

You are storing large number of strings in main memory.It will take memory irrespective of data structure you use.One way might be not to store whole path all the time but to store them in a hierarchical structure eg. storing name of directory in map as a key and storing all values of that directory in list as a value recursively.

In the global hashmap instead of storing the full paths as Strings you can store pointers to Dir-Objects.
For each directory you find create a Dir-object. Each Dir-object has a pointer to its parent Dir-object and its local name.
Example:
/a/long...path/p/ is a Dir you already found.
/a/long...path/p/a
/a/long...path/p/b are two new Dirs
The two sub Dirs only have to store a reference to the parent Dir plus their local names "a" or "b".
Note that you do not have to find the parent Object first: When scanning the file system you should do this recursively or using a Stack explicitly. When you created a Dir-object (e.g. /p here) you then push that object onto a stack and then you visit (go into) that directory. When you are creating the /a and /b sub-Dirs you just look at the top of the stack to find their parent. When you are done with the whole contents of /a/long...path/p/ then you pop the Dir-object representing it off the stack.

This question can have many answers. People can offer you a wide range of data structure to use or may ask you to increase your hardware memory or heap size of JVM. But I think the problem is somewhere else.
This problem cannot be solved by using just basic datastructures. This may require a change at the design level too. Think about your need. You are asking for such a huge space which is not needed by today's operating system or even RDBMS with very large data in store.
Data structure as a Service.(DSAS - it already exist e.g. redis but hey I may have coined this term!).
In your application design, try introducing a component or service like redis, memcached or couchdb which is specialized for doing things like 'storing huge amount of data', 'fast search' over standard sockets or other high speed communication protocol like DBUS.
Do not worry about the internal working of such protocols. There are enough of libraries/apis to do it for you.

I can suggest you to use HashSet and store md5 sum for path:
Set<Md5Sum> paths = new HashSet<>();
//for each path
String path = ...
byte[] md5 = messageDigestObject.update(path.getBytes());
path.add(new Md5Sum(md5));
You can not use byte[] directly as key in hash set. So you need create simple helper class:
class Md5Sum{
//it is more memory effiecient than byte[]
long part1, part2;
//override equals and hashCode methods
//..........
}
About updates
You need rescan filesystem and recreate this hash set object, or you can subscribe for file system events (see WatchService).

Related

How can I update a serialized HashMap contained in a file?

I have a file that contains a serialized HashMap containing an element of type MyObject:
�� sr java.util.HashMap���`� F
loadFactorI thresholdxp?# w  t (a54d88e06612d820bc3be72877c74f257b561b19sr com.myproject.MyObject C�m�I�/ I partitionL hashcodet Ljava/lang/String;L idt Ljava/lang/Long;L offsetq ~ L timestampq ~ L topicq ~ xp q ~ ppppx
Now, I also have some other MyObject objects that I would like to add to that map. However, I dont want to first read and deserialize the map back into memory, then update it and then write the whole updated map back to file. How would one update the serialization in the file in a more efficient way?
How would one update the serialization in the file in a more efficient way?
Basically by reverse engineering the binary protocol that Java uses when serializing objects into their binary representation. That would enable you to understand which elements in that binary blob would need to be updated in which way.
Other people have already done that, see here for example.
Anything else is just work. You sitting down and writing code.
Or you write the few lines of code that read in the existing files, and write out a new file with that map plus the other object you need in there.
You see, efficiency depends on the point of view:
do you think the single update of a file with binary serialized objects is so time critical that it needs to be done by manually "patching" that binary file
do you think it is more efficient to spend hours and hours to learn the underlying binary format, to correctly update its content?
The only valid reason (I can think of) why to do that: to learn exactly such things: binary data formats, and how to patch content. But even then there might be "better" assignments that give you more insights (of real value in the real world) than ... spending your time re-implementing Java binary serialization.

Is there any length limit for value kept in properties file?

I have configAllowedUsers.properties.
It has single in Entry like the following:
users = abc, pew, rt, me1, me3, ku3,........
I have some doubts about length of value stored in it. I will read it using java.util.Properties. Thousands of usernames would be stored in it and I could not store them in database.
I had the same question in mind today, so I started researching. I did not find any limitations originating from java.util.Properties, so it is probably safe to assume that the rules are the same as for String:
A String of length Integer.MAX_VALUE (which is 231-1)
Or half your maximum heap size,
Whichever is reached first in your environment.
Of course, not finding any official statements on this topic does not prove that these assumptions are correct, but let's consider the Properties class innocent unless proven guilty.

Shortening long urls with a hash?

I've got a file cache, the files being downloaded from different urls. I'd like to save each file by the name of their url. These names can be quite long though, and I'm on a device using a FAT32 file system - so the long names are eating up resources well before I run out of actual disk space.
I'm looking for a way to shorten the filenames, have gotten suggestions to hash the strings. But I'm not sure if the hashes are guaranteed to be unique for two different strings. It would be bad if I accidentally fetch the wrong image if two hashed urls come up with the same hash value.
You could generate an UUID for each URL and use it as the file name.
UUIDs are unique (or "practically unique") and are 36 characters long, so I guess the file name wouldn't be a problem.
As of version 5, the JDK ships with a class to generate UUIDs (java.util.UUID). You could use randomly generate UUIDs if there's a way to associate them with the URLs, or you could use name based UUIDs. Name based UUIDs are always the same, so the following is always true:
String url = ...
UUID urlUuid = UUID.nameUUIDFromBytes(url.getBytes);
assertTrue(urlUuid.equals(UUID.nameUUIDFromBytes(url.getBytes)));
There's no (shortening) hash which can guarantee different hashes for each input. It's simply not possible.
The way I usually do it is by saving the original name at the beginning (e.g., first line) of the cache file. So to find a file in the cache you do it like this:
Hash the URL
Find the file corresponding to that hash
Check the first line. If it's the same as the full URL:
The rest of the file is from line two and forward
You can also consider saving the URL->file mapping in a database.
But I'm not sure if the hashes are guaranteed to be unique for two different strings.
They very much aren't (and cannot be, due to the pigeonhole principle). But if the hash is sufficiently long (at least 64 bit) and well-distributed (ideally a cryptographic hash), then the likelihood of a collision becomes so small that it's not worth worrying about.
As a rough guideline, collisions will become likely once the number of files approaches the square root of the number of possible different hashes (birthday paradox). So for a 64 bit hash (10 character filenames), you have about a 50% chance of one single collision if you have 4 billion files.
You'll have to decide whether that is an acceptable risk. You can reduce the chance of collision by making the hash longer, but of course at some point that will mean the opposite of what you want.
Currently, the SHA-1 algorithm is recommended. There are no known ways to intentionally provoke collisions for this algorithm, so you should be safe. Provoking collisions with two pieces of data that have common structure (such as the http:// prefix) is even harder. If you save this stuff after you get a HTTP 200 response, then the URL obviously fetched something, so getting two distinct, valid URLs with the same SHA-1 hash really should not be a concern.
If it's of any re-assurance Git uses it to identify all objects, commits and folders in the source code repository. I've yet to hear of someone with a collision in the object store.
what you can do is save the files by an index and use a index file to find the location of the actual file
in the directory you have:
index.txt
file1
file2
...
etc.
and in index.txt you use some datastructure to find the filenames efficiently (or replace with a DB)
Hashes are not guaranteed to be unique, but the chance of a collision is vanishingly small.
If your hash is, say, 128 bits then the chance of a collision for any pair of entries is 1 in 2^128. By the birthday paradox, if you had 10^18 entries in your table then the chance of a collision is only 1%, so you don't really need to worry about it. If you are extra paranoid then increase the size of the hash by using SHA256 or SHA512.
Obviously you need to make sure that the hashed representation actually takes up less space than the original filename. Base-64 encoded strings represent 6 bits per character so you can do the math to find out if it's even worth doing the hash in the first place.
If your file system barfs because the names are too long then you can create prefix subdirectories for the actual storage. For example, if a file maps the the hash ABCDE then you can store it as /path/to/A/B/CDE, or maybe /path/to/ABC/DE depending on what works best for your file system.
Git is a good example of this technique in practice.
Look at my comment.
One possible solution (there are a lot) is creating a local file (SQLite? XML? TXT?) in which you store a pair (file_id - file_name) so you can save your downloaded files with their unique ID as filename.
Just an idea, not the best...

Tokenize big files to hashtable in Java

I'm having this problem: I'm reading 900 files and, after processing the files, my final output will be an HashMap<String, <HashMap<String, Double>>. First string is fileName, second string is word and the double is word frequency. The processing order is as follows:
read the first file
read the first line of the file
split the important tokens to a string array
copy the string array to my final map, incrementing word frequencies
repeat for all files
I'm using string BufferedReader. The problem is, after processing the first files, the Hash becomes so big that the performance is very low after a while. I would like to hear solution for this. My idea is to create a limited hash, after the limit reached store into a file. do that until everything is processed, mix all the hashs at the end.
Why not just read one file at a time, and dump that file's results to disk, then read the next file etc? Clearly each file is independent of the others in terms of the mapping, so why keep the results of the first file while you're writing the second?
You could possibly write the results for each file to another file (e.g. foo.txt => foo.txt.map), or you could create a single file with some sort of delimiter between results, e.g.
==== foo.txt ====
word - 1
the - 3
get - 3
==== bar.txt ====
apple - 2
// etc
By the way, why are you using double for the frequency? Surely it should be an integer value...
The time for a hash map to process shouldn't increase significantly as it grows. It is possible that your map is skewing because of an unsuited hashing function or filling up too much. Unless you're using more RAM than you can get from the system, you shouldn't have to break things up.
What I have seen with Java when running huge hash maps (or any collection) with a lots of objects in memory is that the VM goes crazy trying to run the garbage collector. It gets to the point where 90% of the time is spent with the JVM kicking off the garbage collector which takes a while and finds almost every object has a reference.
I suggest profiling your application, and if it is the garbage collector, then increasing heap space and tuning the garbage collector. Also, it will help if you can approximate the needed size of your hash maps and provide sufficiently large allocations (see initialCapacity and loadFactor options in the constructor).
I am trying to rethink your problem:
Since you are trying to construct an inverted index:
Use Multimap rather then Map<String, Map<String, Integer>>
Multimap<word, frequency, fileName, .some thing else tomorrow>
Now, read one file, construct the Multimap and save it on disk. (similar to Jon's answer)
After reading x files, merge all the Multimaps together: putAll(multimap) if you really need one common map of all the values.
You could try using this library to improve your performance.
http://high-scale-lib.sourceforge.net/
It is similar to the java collections api, but for high performance. It would be ideal if you can batch and merge these results after processing them in small batches.
Here is an article that will help you with some more inputs.
http://www.javaspecialists.eu/archive/Issue193.html
Why not use a custom class,
public class CustomData {
private String word;
private double frequency;
//Setters and Getters
}
and use your map as
Map<fileName, List<CustomData>>
this way atleast you will have only 900 keys in your map.
-Ivar

How to treat file contents as String

I am creating a Scrabble game that uses a dictionary. For efficiency, instead of loading the entire dictionary (via txt file) to a Data Structure (Set, List etc.) is there any built in java class that can help me treat the contents of the file as String.
Specifically what I want to do is check whether a word made in the game is a valid word of the dictionary by doing something simple like fileName.contains (word) instead of having a huge list that is memory inefficient and using list.contains (word).
Do you guys have any idea on what I may be able to do. If the dictionary file has to be in something other than a txt file (e.g. xml file), I am open to try that as well.
NOTE: I am not looking for http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29
This method is not a part of the java API.
HashSet didn't come to mind, I was stuck in the idea that all contains () methods used O(n) time, thanks to Bozho for clearing that with me, looks like I will be using a HashSet.
I think your best option is to load them all in memory, in a HashSet. There contains(word) is O(1).
If you are fine with having it in memory, having it as String on which to call contains(..) is much less efficient than a HashSet.
And I have to mention another option - there's a data structure to represent dictionaries - it's called Trie. You can't find an implementation in the JDK though.
A very rough calculation says that with all English words (1 million) you will need ~12 megabytes of RAM. which is a few times less than the default memory settings of the JVM. (1 million * 6 letters on average * 2 bytes per letter = 12 milion bytes, which is ~12 megabytes). (Well, perhaps a bit more to store hashes)
If you really insist on not reading it in memory, and you want to scan the file for a given word, so you can use a java.util.Scanner and its scanner.findWithHorizon(..). But that would be inefficient - I assume O(n), and I/O overhead.
While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.
The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.
Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.
All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.
HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:
Dictionary = 5MB.
Loaded into HashSet (assuming lots of overhead) = 20MB.
Memory usage in relation to other things = Minimal (assumes laptop/desktop)
Time to implement with HashSet = 2 Minutes.
I will have only "lost" 2 Minutes if I decide a HashSet wasn't good enough :-)
Happy coding.
Links to random data-structure implementations (may or may not be suitable):
TernarySearchTrie Reads in a flat file (must be specially constructed?)
TrieTree Has support for creating the Trie file from a flat file. Not sure if this Trie works from disk.
FileHash Hash which uses a file backing.
HashStore Another disk-based hash
WB B-Tree Simple B-tree implementation / "database"
SQLite Small embedded RDBMS.
UTF8String Can be used to significantly reduce the memory requirements of using HashSet<String> when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)
You need to compress your data to avoid having to store all those words. The way to do so would be a tree in which nodes are letters and leaves reflect the end of a word. This way you're not storing repetitive data such as the there these where those words all have the same prefix.
There is a way to make this solution even more memory efficient. (Hint: letter order)
Use the readline() of java.io.BufferedReader. That returns a string.
String line = new BufferedReader (new FileReader (file) ).readline ();

Categories