I'm not used to working with really large datasets and I'm kind of stumped here.
I have the following code:
private static Set<String> extractWords(BufferedReader br) throws IOException {
String strLine;
String tempWord;
Set<String> words = new HashSet<String>();
Utils utils = new Utils();
int articleCounter = 0;
while(((strLine = br.readLine()) != null)){
if(utils.lineIsNotCommentOrLineChange(strLine)){
articleCounter++;
System.out.println("Working article : " + utils.getArticleName(strLine) + " *** Article #" + articleCounter + " of 3.769.926");
strLine = utils.removeURLs(strLine);
strLine = utils.convertUnicode(strLine);
String[] temp = strLine.split("\\W+");
for(int i = 0; i < temp.length; i++){
tempWord = temp[i].trim().toLowerCase();
if(utils.validateWord(tempWord)){
words.add(tempWord);
System.out.println("Added word " + tempWord + " to list");
}
}
}
}
return words;
}
This basically gets a huge text file from the BufferedReader where each line of text is a text from an article. I want to make a list of unique words in this text file, but there are 3.769.926 articles in there, so the word count is quite immense.
From what I understand about Sets, or specifically HashSets, this should be the man for the job so to speak. Everything runs quite smoothly at first, but after 500.000 articles it starts slowing down a bit. When it reaches 700.000 its beginning to get slow enough that it basically stops for a second of two before going on again. There's a bottleneck here somewhere, and I can't see what it is..
Any ideas?
I believe the issue you may be facing is that a Hash Table(set or map) has to be backed by a fixed number of entries it can hold. So your first declaration may have a table able to hold 16 entries. Putting aside things like load factors, once you tried to put 17 entries into the table, it has to grow to accommodate more entries to prevent collisions, so Java will expand it for you.
This expansion includes creating a new table with 2 * previousSize entries, then copying over the old entries. So if you are constantly expanding, you may end up hitting an area, like
524,288 where it will have to grow, but it will create a new table able to handle 1,048,576 entries, but it will have to copy over the entire previous table.
If you don't mind the extra look up time, you might think about using a TreeSet instead of a HashSet. You lookups will now be logarithmic time, but a Tree doesn't have a pre-allocated table and can grow dynamically easily. Either use this, or declare the size of your HashSet so it won't grow dynamically.
Honestly for that sort of scale you are better off going over to a database. You can embed Derby inside your application if you don't want to use a separate one.
Their indexing systems are optimised for this sort of scale, and while HashSet etc will cope if you massage them right you are better off using the right tool for it.
As noted by TheSageMage, the HashSet implementation will constantly resize the underlying HashMap as the data grows. There are a couple of ways of getting around that: initial capacity and load factor. You can set both by using the 2-arg constructor: HashSet(int, float). If you know the approximate number of words you are going to need, you can set the initial capacity to be bigger than that number. This will make smaller maps work a little slower, but will prevent dramatic slow-down for larger maps. The load factor is how full the map must get before increasing the underlying size rehashing. Since this is a relatively time-consuming operation for large maps, you may want to set it to a large fraction, say 0.9. If your initial capacity was set so that you may exceed it but will never exceed twice that size, a large load factor will guarantee that you rehash only once and as late as possible.
Related
I'm implementing a file transfer tool in Java that will transfer some 'X' no. of files, where 'X' is configurable by user from one SFTP server to another. The transfer bit works but it can potentially pick up duplicate files (logic for which is not yet in place).
Now the SFTP_source server receives several hundred thousand files everyday and I'm not able to figure out how to perform a quick search to avoid duplicate file transfer in this behemoth list of files on the source server.
Or please also suggest if there's any better, faster way to achieve this without performing an expensive search operation? If searching through file names is the only way to go then what search paradigm to use?
Thanks.
6M files is not that much memory. Experimentally, adding the string representations of the first 6M natural numbers to a HashSet<String> works with -Xmx1G and fails with -Xmx512M; and it only takes 2.5s on my machine (Java 8, 64-bit). Using a HashSet is therefore definitely feasible.
You can drastically lower the memory footprint if you are willing to sacrifice speed, by using the disk to store an index. In that case, you may be better of using an actual database - they are very well optimized to index and search large collections that would not fit in memory.
The code that I used for testing:
import java.util.*;
public class C {
public static void main(String ... args) {
HashSet<String> hs = new HashSet<>();
long t = System.currentTimeMillis();
for (int i=0; i< 6 * 1000 * 1000; i++) {
hs.add("" + i); // add returns "false" if key is already present
}
System.out.println("Added " + hs.size() + " keys in "
+ (System.currentTimeMillis()-t));
}
}
My program reads a text file line by line in a while loop. It then processes each line and extracts some information to be written in the output. Everything it does inside the while loop is O(1) except two ArrayList indexOf() method calls which I suppose are O(N). The program runs at a reasonable pace (1M lines per 100 seconds) in the beginning but over time it slows down dramatically. I have 70 M lines in the input file so the loop iterates 70 million times. In theory this should take about 2 hours but in practice it takes 13 hours. Where is the problem?
Here is the code snippet:
BufferedReader corpus = new BufferedReader(
new InputStreamReader(
new FileInputStream("MyCorpus.txt"),"UTF8"));
Writer outputFile = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("output.txt"), "UTF-8"));
List<String> words = new ArrayList();
//words is being updated with relevant values here
LinkedHashMap<String,Integer> DIC = new LinkedHashMap();
//DIC is being updated with relevant key-value pairs here
String line = "";
while ((line = corpus.readLine()) != null)
String[] parts = line.split(" ");
if (DIC.containsKey(parts[0]) && DIC.containsKey(parts[1])) {
int firstIndexPlusOne = words.indexOf(parts[0])+ 1;
int secondIndexPlusOne = words.indexOf(parts[1]) +1;
outputFile.write(firstIndexPlusOne +" "+secondIndexPlusOne+" "+parts[2]+"\n");
} else {
notFound++;
outputFile.write("NULL\n");
}
}
outputFile.close();
I am assuming you add words to your words ArrayList as you go.
You correctly state that words.indexOf is O(N) and that is the cause of your issue. As N increases (you add words to the list) these operations take longer and longer.
To avoid this keep your list sorted and use binarySearch.
To keep it sorted use binarySearch on each word to work out where to insert it. This takes your complexity from O(n) to O(log(N)).
I think, words is meant to collect unique words, hence use Set.
Set<String> words = new HashSet<>();
Map<String, Integer> DIC = new HashMap<>();
Also DIC seems something like a frequency table, in which case dic.keySet() would be the same as words. A LinkedHashMap maintains an extra list to keep the entries sorted on order of insertion.
The writing of separate strings, instead of first creating new strings is faster.
outputFile.write(firstIndexPlusOne);
outputFile.write(" ");
outputFile.write(secondIndexPlusOne);
outputFile.write(" ");
outputFile.write(parts[2]);
outputFile.write("\n");
I think one of your problem is that line:
outputFile.write(firstIndexPlusOne +" "+secondIndexPlusOne+" "+parts[2]+"\n");
Since strings are immutable, you are cluttering the memory. Also, maybe try to flush the write buffer every turn in the loop it maybe improve a bit (my hypothesis here)
Try something like:
String line = "";
StringBuilder sb = new StringBuilder();
while ...
...
sb.append(firstIndexPlusOne);
sb.append(" ");
sb.append(secondIndexPlusOne);
sb.append(" ");
sb.append(parts[2]);
sb.append("\n");
outputFile.write(sb.toString());
sb.setLength(0);
outputFile.flush();
Also, maybe a good read: Tuning Java I/O Performance (Oracle)
If the corpus and the word list are both sorted, the linear search performed by the words.indexOf(..) call would become slower in each iteration.
Building a HashMap(..) from your word list before processing the corpus would even things out. It might be a good idea to do so for optimization, even if that is not the problem.
Assuming that you don't update neither words nor DIC in your loop, obviously the most runtime is consumed when DIC.containsKey(parts[0]) && DIC.containsKey(parts[1]) evaluates to true.
If your question is "why is it slowing down", and not "how can I speed it up", I'd suggest that you take the first 10M lines of your file, copy them into another file and duplicate them so you receive 70M lines consisting of copies of your first 10M lines. Then, execute your code. If it slows down even though the same content is examined again and again, you may check the other answers regarding string builders and such.
If you don't experience the slowing down, then obviously it's dependent on the actual content of your 70M file. Propably, for the remaining 60M lines of your original file, DIC.containsKey(parts[0]) && DIC.containsKey(parts[1]) evaluates to true more often and therefore the inner loop is executed more often, taking more time.
In the latter case, I doubt that you can trick the I/O load by applying single writes such that a performance gain is obtained, but of course I may be very wrong there. You'd have to try. But first, I'd recommend exploring the source of the problem, which I think lies in the file content's structure. After you understand how your code performs with respect to the input given, you may try to optimize (althoug I would try to keep the whole string in memory and write its contents in one operation after the loop instead of performing very many small write operations).
I have a very large file (10^8 lines) with counts of events as follows,
A 10
B 11
C 23
A 11
I need to accumulate the counts for each event, so that my map contains
A 21
B 11
C 23
My current approach:
Read the lines, maintain a map, and update the counts in the map as follows
updateCount(Map<String, Long> countMap, String key,
Long c) {
if (countMap.containsKey(key)) {
Long val = countMap.get(key);
countMap.put(key, val + c);
} else {
countMap.put(key, c);
}
}
Currently this is the slowest part of the code, (takes around 25 ms).
Note that the map is based on MapDB, but I doubt that updates are slow due to that (are they?)
This is the mapdb configs for the map,
DBMaker.newFileDB(dbFile).freeSpaceReclaimQ(3)
.mmapFileEnablePartial()
.transactionDisable()
.cacheLRUEnable()
.closeOnJvmShutdown();
Are there ways to speed this up?
EDIT:
The number of unique keys is of the order of the pages in wikipedia. The data is actually page traffic data from here.
You might try
class Counter {
long count;
}
void updateCount(Map<String, Counter> countMap, String key, int c) {
Counter counter = countMap.get(key);
if (counter == null) {
counter = new Counter();
countMap.put(key, counter);
counter.count = c;
} else {
counter.count += c;
}
}
This does not create many Long wrappers, but just allocates Counters the number of keys.
Note: do not create Long's. Above I made c an int to not oversee long/Long.
As a starting point, I'd suggest thinking about:
What is yardstick by which you're saying that 25ms is actually an unreasonable amount of time for the amount of data involved and for a generic map implementation? if you quantify that, it might help you work out if there is anything wrong.
How much time is being spent re-hashing the map versus other operations (e.g. calculation of hash codes on each put)?
What do your "events" as you call them consist of? How many unique events-- and hence unique keys-- are there? How are keys to the map being generated, and is there a more efficient way to do so? (In a standard hash map, for example, you create additional objects for each association, and actually store the key objects increasing the memory footprint.)
Depending on the answers to the previous, you could potentially roll a more efficient map structure yourself (see this example that you might be able to adapt). Essentially, you need to look specifically at what is taking the time (e.g. hash code calculation per put / cost of rehashing) and try and optimise that part.
If you are using a TreeMap, there are performance tuning options like
The number of entries in each node.
You could also use specific key and value serializer that will speed up the serialization and de-serilization.
You could use Pump mode to build the tree, which is very very fast. But one caveat is that this is useful when you are building a new map from scratch. You can find the full example here
https://github.com/jankotek/MapDB/blob/master/src/test/java/examples/Huge_Insert.java
First of all let me tell you that i have read the following questions that has been asked before Java HashMap performance optimization / alternative and i have a similar question.
What i want to do is take a LOT of dependencies from New york times text that will be processed by stanford parser to give dependencies and store the dependencies in a hashmap along with their scores, i.e. if i see a dependency twice i will increment the score from the hashmap by 1.
The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.
How will i be able to increase the performance of my hashmap? What kind of hashkey can i use?
Thanks a lot
Martinos
EDIT 1:
ok guys maybe i phrased my question wrongly ok , well the byte arrays are not used in MY project but in the similar question of another person above. I dont know what they are using it for hence thats why i asked.
secondly: i will not post code as i consider it will make things very hard to understand but here is a sample:
With sentence : "i am going to bed" i have dependencies:
(i , am , -1)
(i, going, -2)
(i,to,-3)
(am, going, -1)
.
.
.
(to,bed,-1)
These dependencies of all sentences(1 000 000 sentences) will be stored in a hashmap.
If i see a dependency twice i will get the score of the existing dependency and add 1.
And that is pretty much it. All is well but the rate of adding sentences in hashmap(or retrieving) scales down on this line:
dependancyBank.put(newDependancy, dependancyBank.get(newDependancy) + 1);
Can anyone tell me why?
Regards
Martinos
Trove has optimized hashmaps for the case where key or value are of primitive type.
However, much will still depend on smart choice of structure and hash code for your keys.
This part of your question is unclear: The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.. But you don't say what the performance is for the larger data. Your map grows, which is kind of obvious. Hashmaps are O(1) only in theory, in practice you will see some performance changes with size, due to less cache locality, and due to occasional jumps caused by rehashing. So, put() and get() times will not be constant, but still they should be close to that. Perhaps you are using the hashmap in a way which doesn't guarantee fast access, e.g. by iterating over it? In that case your time will grow linearly with size and you can't change that unless you change your algorithm.
Google 'fastutil' and you will find a superior solution for mapping object keys to scores.
Take a look at the Guava multimaps: http://www.coffee-bytes.com/2011/12/22/guava-multimaps They are designed to basically keep a list of things that all map to the same key. That might solve your need.
How will i be able to increase the performance of my hashmap?
If its taking more than 1 micro-second per get() or put(), you have a bug IMHO. You need to determine why its taking as long as it is. Even in the worst case where every object has the same hasCode, you won't have performance this bad.
What kind of hashkey can i use?
That depends on the data type of the key. What is it?
and finally what are byte[] a = new byte[2]; byte[] b = new byte[3]; in the question that was posted above?
They are arrays of bytes. They can be used as values to look up but its likely that you need a different value type.
An HashMap has an overloaded constructor which takes initial capacity as input. The scale off you see is because of rehashing during which the HashMap will virtually not be usable. To prevent frequent rehashing you need to start with a HashMap of greater initial capacity. You can also set a loading factor which indicates how much percentage do you load the hashes before rehashing.
public HashMap(int initialCapacity).
Pass the initial capacity to the HashMap during object construction. It is preferable to set a capacity to almost twice the number of elements you would want to add in the map during the course of execution of your program.
I want to use an ArrayList (or some other collection) like how I would use a standard array.
Specifically, I want it to start with an intial size (say, SIZE), and be able to set elements explicitly right off the bat,
e.g.
array[4] = "stuff";
could be written
array.set(4, "stuff");
However, the following code throws an IndexOutOfBoundsException:
ArrayList<Object> array = new ArrayList<Object>(SIZE);
array.set(4, "stuff"); //wah wahhh
I know there are a couple of ways to do this, but I was wondering if there was one that people like, or perhaps a better collection to use. Currently, I'm using code like the following:
ArrayList<Object> array = new ArrayList<Object>(SIZE);
for(int i = 0; i < SIZE; i++) {
array.add(null);
}
array.set(4, "stuff"); //hooray...
The only reason I even ask is because I am doing this in a loop that could potentially run a bunch of times (tens of thousands). Given that the ArrayList resizing behavior is "not specified," I'd rather it not waste any time resizing itself, or memory on extra, unused spots in the Array that backs it. This may be a moot point, though, since I will be filling the array (almost always every cell in the array) entirely with calls to array.set(), and will never exceed the capacity?
I'd rather just use a normal array, but my specs are requiring me to use a Collection.
The initial capacity means how big the array is. It does not mean there are elements there. So size != capacity.
In fact, you can use an array, and then use Arrays.asList(array) to get a collection.
I recomend a HashMap
HashMap hash = new HasMap();
hash.put(4,"Hi");
Considering that your main point is memory. Then you could manually do what the Java arraylist do, but it doesn't allow you to resize as much you want. So you can do the following:
1) Create a vector.
2) If the vector is full, create a vector with the old vector size + as much you want.
3) Copy all items from the old vector to your new vector.
This way, you will not waste memory.
Or you can implement a List (not vector) struct. I think Java already has one.
Yes, hashmap would be a great ideia.
Other way, you could just start the array with a big capacity for you purpose.