What is the quickest way to search in a large directory - java

I'm implementing a file transfer tool in Java that will transfer some 'X' no. of files, where 'X' is configurable by user from one SFTP server to another. The transfer bit works but it can potentially pick up duplicate files (logic for which is not yet in place).
Now the SFTP_source server receives several hundred thousand files everyday and I'm not able to figure out how to perform a quick search to avoid duplicate file transfer in this behemoth list of files on the source server.
Or please also suggest if there's any better, faster way to achieve this without performing an expensive search operation? If searching through file names is the only way to go then what search paradigm to use?
Thanks.

6M files is not that much memory. Experimentally, adding the string representations of the first 6M natural numbers to a HashSet<String> works with -Xmx1G and fails with -Xmx512M; and it only takes 2.5s on my machine (Java 8, 64-bit). Using a HashSet is therefore definitely feasible.
You can drastically lower the memory footprint if you are willing to sacrifice speed, by using the disk to store an index. In that case, you may be better of using an actual database - they are very well optimized to index and search large collections that would not fit in memory.
The code that I used for testing:
import java.util.*;
public class C {
public static void main(String ... args) {
HashSet<String> hs = new HashSet<>();
long t = System.currentTimeMillis();
for (int i=0; i< 6 * 1000 * 1000; i++) {
hs.add("" + i); // add returns "false" if key is already present
}
System.out.println("Added " + hs.size() + " keys in "
+ (System.currentTimeMillis()-t));
}
}

Related

java Program runtime is too fast? Issue with Memory

So I am running some simulations that require some sample datasets. For the sake of simplicity I am using this http://loremipsum.sourceforge.net/ Lorem Ipsum generator. I am setting a test parameter called DATASIZE that sets the amount of words or paragraphs this generator creates. I am using this generated data to create an "input" and "output" hash. The output data will use a slightly different hash. For example,
String input = hash(new LoremIpsum().getWords(DATASIZE))
String output = hash(new LoremIpsum().getWords(DATASIZE-2))
My question is, does Java keep the first data set in memory and then slightly modify it to quickly produce output? Maybe I was just pessemistic on the runtime but it seems very small. Virtually zero in System.currentTimeMillis(); Could it be the jar?
I also noticed something odd with my output. I am creating several objects that store this input and output hash. On some of these that I generate, for some reason the runtime is 16. Otherwise it is 0. Something with memory or just shoddy code?
It uses StringBuilder. So answer to your question is NO. There is no reuse/cache in getWords(..). - https://sourceforge.net/p/loremipsum/code/HEAD/tree/trunk/src/main/java/de/svenjacobs/loremipsum/LoremIpsum.java
Having said that, if you give really large number - say 1000000 then you may see difference. I checked using my latest all powerful macbook pro
public static void main(String[] args) {
LoremIpsum loremipsum = new LoremIpsum();
long start;
int number = 100000;
for(int i=0;i<5;i++) {
start = System.currentTimeMillis();
loremipsum.getWords(number);
System.out.println("getWords():" +(System.currentTimeMillis()-start));
}
}
Output in ms
getWords():11
getWords():7
getWords():5
getWords():4
getWords():4

Significant slower processing as Set size goes beyond 500.000

I'm not used to working with really large datasets and I'm kind of stumped here.
I have the following code:
private static Set<String> extractWords(BufferedReader br) throws IOException {
String strLine;
String tempWord;
Set<String> words = new HashSet<String>();
Utils utils = new Utils();
int articleCounter = 0;
while(((strLine = br.readLine()) != null)){
if(utils.lineIsNotCommentOrLineChange(strLine)){
articleCounter++;
System.out.println("Working article : " + utils.getArticleName(strLine) + " *** Article #" + articleCounter + " of 3.769.926");
strLine = utils.removeURLs(strLine);
strLine = utils.convertUnicode(strLine);
String[] temp = strLine.split("\\W+");
for(int i = 0; i < temp.length; i++){
tempWord = temp[i].trim().toLowerCase();
if(utils.validateWord(tempWord)){
words.add(tempWord);
System.out.println("Added word " + tempWord + " to list");
}
}
}
}
return words;
}
This basically gets a huge text file from the BufferedReader where each line of text is a text from an article. I want to make a list of unique words in this text file, but there are 3.769.926 articles in there, so the word count is quite immense.
From what I understand about Sets, or specifically HashSets, this should be the man for the job so to speak. Everything runs quite smoothly at first, but after 500.000 articles it starts slowing down a bit. When it reaches 700.000 its beginning to get slow enough that it basically stops for a second of two before going on again. There's a bottleneck here somewhere, and I can't see what it is..
Any ideas?
I believe the issue you may be facing is that a Hash Table(set or map) has to be backed by a fixed number of entries it can hold. So your first declaration may have a table able to hold 16 entries. Putting aside things like load factors, once you tried to put 17 entries into the table, it has to grow to accommodate more entries to prevent collisions, so Java will expand it for you.
This expansion includes creating a new table with 2 * previousSize entries, then copying over the old entries. So if you are constantly expanding, you may end up hitting an area, like
524,288 where it will have to grow, but it will create a new table able to handle 1,048,576 entries, but it will have to copy over the entire previous table.
If you don't mind the extra look up time, you might think about using a TreeSet instead of a HashSet. You lookups will now be logarithmic time, but a Tree doesn't have a pre-allocated table and can grow dynamically easily. Either use this, or declare the size of your HashSet so it won't grow dynamically.
Honestly for that sort of scale you are better off going over to a database. You can embed Derby inside your application if you don't want to use a separate one.
Their indexing systems are optimised for this sort of scale, and while HashSet etc will cope if you massage them right you are better off using the right tool for it.
As noted by TheSageMage, the HashSet implementation will constantly resize the underlying HashMap as the data grows. There are a couple of ways of getting around that: initial capacity and load factor. You can set both by using the 2-arg constructor: HashSet(int, float). If you know the approximate number of words you are going to need, you can set the initial capacity to be bigger than that number. This will make smaller maps work a little slower, but will prevent dramatic slow-down for larger maps. The load factor is how full the map must get before increasing the underlying size rehashing. Since this is a relatively time-consuming operation for large maps, you may want to set it to a large fraction, say 0.9. If your initial capacity was set so that you may exceed it but will never exceed twice that size, a large load factor will guarantee that you rehash only once and as late as possible.

How to generate incremental identifier in java

I have requirement in which I continuously receive messages that needs to be written in a file. Every time a new message is received it needs to be written in a separate file. What I want is to generate an unique identifier to be used as a file-name. I also want to preserve the order of the messages as well. By this I mean, the identifier generated as a file-name should always be incremental.
I was using UUID.randomUUID() to generate file-names but the problem with this approach is that UUID only assures randomness of the identifier but is not incremental. As a result I am losing the ordering of the file (I want file generated first should appear first in the list).
Approaches known
Can use System.currentTimeMillis() but I can receive multiple messages at same time stamp.
2.Another approach could be to implement static long value and increment it whenever a file is to be created and use the long value as a file-name. But I am not sure about this approach. Also it doesn't seem to be a proper solution to my problem. I think there could be far better solutions than this one.
If someone could suggest me a better solution to this problem, will be highly appreciated.
If you want your id value to uniformly rise even between server restarts, then you must either base it on the system time or have some elaborately robust logic that persists the last ID used. Note that achieving robustness on its own is not hard, but achieving it in a performant and scalable way is.
If you additionally need the id to be unique across multiple nodes in a redundant server cluster, then you need even more elaborate logic, which definitely involves a persistent store to which all the boxes synchronize access. Making this performant is, of course, even harder.
The best option I can see is to have a quite long ID so there's room for these parts:
System.currentTimeMillis for long-term uniqueness (across restarts);
System.nanotime for finer granularity;
a unique id of each server node (determined in a platform-specific way).
The method will still have to remember the last value generated and retry in case of a duplicate. It won't have to retry too many times, though, just until the next nanoTime clock tickā€”it could even busy-wait for it.
Sketch of code without point 3 (single-node implementation):
private static long lastNanos;
public static synchronized String uniqueId() {
for (;/*ever*/;) {
final long n = System.nanoTime();
if (n == lastNanos) continue;
lastNanos = n;
return "" + System.currentTimeMillis() + n;
}
}
Ok, my hands up. My last answer was fairly flaky and I've deleted it.
Keeping with the spirit of the site, I thought I'd try a different tac.
If you say you are keeping these messages in a single file then you could try something like creating an unique Id out of the size of the file?
Before you write the message to the file it's id could be the current size of the file.
You could add the filename + size as the id if these messages need to be unique across a number of files.
I'll leave the hot potato of synchronization to another day. But you could wrap all of this up in a syncronized object that keeps track of things.
Also, I am assuming that any messages written to the file will not be removed in the future.
ADDITIONAL NOTE:
You could create an message processing object that opens the file on construction (or via a create method).
This object will get the initial size of the file and this will be used as the unique id.
As each message is added (in a synchronized manner), the id is incremented by the size of the message.
This would address the performance issues. Will not work if more than one JVM/Node accesses the same file.
Skeletal Idea:
public class MessageSink {
private long id = 0;
public MessageSink(String filename) {
id = ... get file size ..
}
public synchronized addMessage(Message msg) {
msg.setId(id);
.. write to file + flush ..
.. or add to stack of messages that need to be written to file
.. at a later stage.
id = id + msg.getSize();
}
public void flushMessages() {
.. open file
.. for each message in stack write ...
.. flush and close file
}
}
I had the same requirement and found a suitable solution. Twitter Snowflake uses a simple algorithm to generate sortable 64bit (long) ids. Snowflake is written on Scala but the approach is simple and could be easily used in a Java code.
id is composed of:
timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years);
machine id - 10 bits (MAC address could be used as a hardware id);
sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
Formula looks like: ((timestamp - customEpoch) << timestampShift) | (machineId << machineIdShift) | sequenceNumber;
Shift for each component depends on it's bits position in ID.
Detailed description and source code could be found at github:
Twitter Snowflake
Basic Java implementation of the Snowflake algorithm

JAVA processing file with java.lang.OutOfMemoryError: GC overhead limit exceeded error

I have the following JAVA class to read from a file containing many lines of tab delimited strings. An example line is like the following:
GO:0085044 GO:0085044 GO:0085044
The code read each line and use split function to put three sub strings into an array, then it put them into a two level hash.
public class LCAReader {
public static void main(String[] args) {
Map<String, Map<String, String>> termPairLCA = new HashMap<String, Map<String, String>>();
File ifile = new File("LCA1.txt");
try {
BufferedReader reader = new BufferedReader(new FileReader(ifile));
String line = null;
while( (line=reader.readLine()) != null ) {
String[] arr = line.split("\t");
if( termPairLCA.containsKey(arr[0]) ) {
if( termPairLCA.get(arr[0]).containsKey(arr[1]) ) {
System.out.println("Error: Duplicate term in LCACache");
} else {
termPairLCA.get(arr[0]).put(new String(arr[1]), new String(arr[2]));
}
} else {
Map<String, String> tempMap = new HashMap<String, String>();
tempMap.put( new String(arr[1]), new String(arr[2]) );
termPairLCA.put( new String(arr[0]), tempMap );
}
}
reader.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
When I ran the program, I got the following run time error after some time of running. I noticed the memory usage kept increasing.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.compile(Pattern.java:1469)
at java.util.regex.Pattern.(Pattern.java:1150)
at java.util.regex.Pattern.compile(Pattern.java:840)
at java.lang.String.split(String.java:2304)
at java.lang.String.split(String.java:2346)
at LCAReader.main(LCAReader.java:17)
The input file is almost 2G and the machine I ran the program has 8G memory. I also tried -Xmx4096m parameter to run the program but that did not help. So I guess there is some memory leak in my code, but I cannot find them.
Can anyone help me on this? Thanks in advance!
There's no memory leak; you're just trying to store too much data. 2GB of text will take 4GB of RAM as Java characters; plus there's about 48 bytes per String object overhead. Assuming the text is in 100 character lines, there's about another GB, for a total of 5GB -- and we haven't even counted the Map.Entry objects yet! You'd need a Java heap of at least, conservatively, 6GB to run this program on your data, and maybe more.
There are a couple of easy things you can do to improve this. First, lose the new String() constructors -- they're useless and just make the garbage collector work harder. Strings are immutable so you never need to copy them. Second, you could use the intern pool to share duplicate strings -- this may or may not help, depending on what the data actually looks like. But you could try, for example,
tempMap.put(arr[1].intern(), arr[2].intern() );
These simple steps might help a lot.
I don't see any leak, you simply need a very huge amount of memory to store your map.
There is a very good tool for verifying this: making a heap dump with the option -XX:+HeapDumpOnOutOfMemoryError and import it into Eclipse Memory Analyzer which comes in a standalone version. It can show you the biggest retained objects and the references tree that could prevent the garbage collector to do its job.
In addition a profiler such as Netbeans Profiler can give you a lot of interesting real-time informations (for instance to check the number of String and Char instances).
Also it is a good practice to split your code into different classes each having a different responsability: the "two keys map" class (TreeMap) on one side and a "parser" class on the other side, it should make debugging easier...
This is definitely not a good idea to store this huge map inside the RAM... or you need to make a benchkmark with some smaller files and extrapolate to obtain the estimated RAM you need to have on your system to fit your worste case... And set Xmx to the proper value.
Why don't you use a Key Value store such as Berckley DB: simpler than a Relational DB and should fit exactly you need of two levels indexing.
Check this post for the choice of the store: key-value store suggestion
Good luck
You probably shouldn't use String.split and store the information as pure String as this generates lots of String objects on the fly.
Try using a char based approach since your format seems rather fixed so you know the exact indizes of the different data points on one line.
If your a bit more into experimenting you could try to use a NIO-backed approach with a memory mapped DirectByteBuffer or a CharBuffer that is used to traverse the file. There you could just mark the indizes of different data points into Marker-objects and only load the real String-data later on in the process when needed.

Tokenize big files to hashtable in Java

I'm having this problem: I'm reading 900 files and, after processing the files, my final output will be an HashMap<String, <HashMap<String, Double>>. First string is fileName, second string is word and the double is word frequency. The processing order is as follows:
read the first file
read the first line of the file
split the important tokens to a string array
copy the string array to my final map, incrementing word frequencies
repeat for all files
I'm using string BufferedReader. The problem is, after processing the first files, the Hash becomes so big that the performance is very low after a while. I would like to hear solution for this. My idea is to create a limited hash, after the limit reached store into a file. do that until everything is processed, mix all the hashs at the end.
Why not just read one file at a time, and dump that file's results to disk, then read the next file etc? Clearly each file is independent of the others in terms of the mapping, so why keep the results of the first file while you're writing the second?
You could possibly write the results for each file to another file (e.g. foo.txt => foo.txt.map), or you could create a single file with some sort of delimiter between results, e.g.
==== foo.txt ====
word - 1
the - 3
get - 3
==== bar.txt ====
apple - 2
// etc
By the way, why are you using double for the frequency? Surely it should be an integer value...
The time for a hash map to process shouldn't increase significantly as it grows. It is possible that your map is skewing because of an unsuited hashing function or filling up too much. Unless you're using more RAM than you can get from the system, you shouldn't have to break things up.
What I have seen with Java when running huge hash maps (or any collection) with a lots of objects in memory is that the VM goes crazy trying to run the garbage collector. It gets to the point where 90% of the time is spent with the JVM kicking off the garbage collector which takes a while and finds almost every object has a reference.
I suggest profiling your application, and if it is the garbage collector, then increasing heap space and tuning the garbage collector. Also, it will help if you can approximate the needed size of your hash maps and provide sufficiently large allocations (see initialCapacity and loadFactor options in the constructor).
I am trying to rethink your problem:
Since you are trying to construct an inverted index:
Use Multimap rather then Map<String, Map<String, Integer>>
Multimap<word, frequency, fileName, .some thing else tomorrow>
Now, read one file, construct the Multimap and save it on disk. (similar to Jon's answer)
After reading x files, merge all the Multimaps together: putAll(multimap) if you really need one common map of all the values.
You could try using this library to improve your performance.
http://high-scale-lib.sourceforge.net/
It is similar to the java collections api, but for high performance. It would be ideal if you can batch and merge these results after processing them in small batches.
Here is an article that will help you with some more inputs.
http://www.javaspecialists.eu/archive/Issue193.html
Why not use a custom class,
public class CustomData {
private String word;
private double frequency;
//Setters and Getters
}
and use your map as
Map<fileName, List<CustomData>>
this way atleast you will have only 900 keys in your map.
-Ivar

Categories