Search multiple words in a string + java - java

I have 10 pages of txt file, in which I have to search a series of 10,000 words to check how many words exists in the file content.
I am using AhoCorasic algorithm for search. However to check if each word exists or not, I have add 10,000 terms to the list. That is I have to iterate 10,000 times to know if each word exists. (10,000 can grow to n)
Problems with above approach are --> CPU boosts as it has to loop 10,000 times and Time it takes to complete the task is more.
I am looking at alternative approach, where I can give all 10,000 words at a time ( to avoid looping) and get the result for each word.
Is there a way to implement this. Or if there is any other alternative to Ahocorasic search to achieve above scenario.

Invert the search: create a Set of the words in the text, then looking up whether a term is in the source material is O(1). Still, as others have suggested, if you need to do more complex matching than simple term existence, I'd also recommend using Lucene.

Related

Efficient string searching in Java

I am working with two big lists of data and I need to efficiently check for matches between the two. This is the scenario:
Reading from a file line by line (this file has 1 million lines)
For each line, check within an ArrayList of strings whether it has a match (this ArrayList also has a huge number of elements)
If a match is found, replace the line from the file with a new value
Any ideas what would be a good way to tackle this problem in terms of efficiency? Obviously looping through that number of records is hopelessly inefficient and process heavy.
Thanks for any help!
UPDATE
It's worth noting, I'm not specifically saying I need to use an ArrayList, that is just something I was using for testing. Any suggestions of more efficient Collections would be welcome.
Without more details (such as the nature of the keys) it is difficult to be certain but you may find using a Bloom filter useful to minimise the number of times you do check within an ArrayList of strings whether it has a match.
Obviously this would not help much if the lookup list changes over time.
You would use the Bloom filter to do a pre-check before searching the list because it can very quickly give you a straight no answer if the key does not exist in the list. You will still need to search you list if the bloom filter says maybe.
You may consider reading the file partially by different threads.
Similar issue is discussed here.
You may process the text in chunks (say x bytes or one line) , each chunk can be executed by different threads , ie one thread per chunk.
you should use HashMap it's approximately O(1), or if your strings have a lot of collisions than you need to use TreeSet O(logN), or Bloom filter.

Which data structure should I use to search a string from CSV?

I have a csv file with nearly 200000 rows containing two columns- name & job. The user then inputs a name, say user_name, and I have to search the entire csv to find the names that contain the pattern user_name and finally print the output to screen. I have implemented this using ArrayList in Java where I put the entire names from csv to ArrayList and then searched for the pattern in it. But in that case the overall time complexity for searching is O(n). Is there any other data strucure in Java that I can use to perform the searching in o(logn) or something more efficient than ArrayList? I can't use any database approach by the way. Also if there is a good data structure in any other language that I can use to accomplish my goal, then kindly suggest it to me?
Edit- The output should be the names in the csv that contains the pattern user_name as the last part. Eg: If my input is "son", then it should return "jackson",etc. Now what I have done so far is read the name column of csv to a string ArrayList, then read each element of the ArrayList and using the regular expression (pattern-matcher of Java) to see if the element has the user_name as the last part. If yes, then print it. If I implement this in a multi-threaded environment, will it increase the scalability and performance of my program?
You can use:
TreeMap, it is sorted red-black tree,
If you are unable to use a commercial database then you are going to have to write code to mimic some of a database's functionality.
To search the entire dataset sequentially in O(n) time you just read it and search each line. If you write a program that loads the data into an in-memory Map, you could search the Map in amortized O(1) time but you'd still be loading it into memory each time, which is an O(n) operation, gaining you nothing.
So the next approach is to build a disk-based index of some kind that you can search efficiently without reading the entire file, and then use the index to tell you where the record you want is located. This would be O(log n), but now you are at significant complexity, building, maintaining and managing the disk-based index. This is what database systems are optimized to do.
If you had 200 MILLION rows, then the only feasible solution would be to use a database. For 200 THOUSAND rows, my recommendation is to just scan the file each time (i.e. use grep or if that's not available write a simple program to do something similar).
BTW, if your allusion to finding a "pattern" means you need to search for a regular expression, then you MUST scan the entire file every time since without knowing the pattern you cannot build an index.
In summary: use grep

How to remove duplicate words using Java when words are more than 200 million?

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.
In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.
Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:
With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.
I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.
Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)
Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.
Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).
Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.
Process each of the letter files separately using a Set to remove duplicates.
You might be able to use a trie data structure to do the job in one pass. It has advantages that recommend it for this type of problem. Lookup and insert are quick. And its representation is relatively space efficient. You might be able to represent all of your words in RAM.
If you sort the items, duplicates will be easy to detect and remove, as the duplicates will bunch together.
There is code here you could use to mergesort the large file:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
For large files I try not to read the data into memory but instead operate on a memory mapped file and let the OS page in/out memory as needed. If your set structures contain offsets into this memory mapped file instead of the actual strings it would consume significantly less memory.
Check out this article:
http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html
Question: Are these really WORDS, or are they something else -- phrases, part numbers, etc?
For WORDS in a common spoken language one would expect that after the first couple of thousand you'd have found most of the unique words, so all you really need to do is read a word in, check it against a dictionary, if found skip it, if not found add it to the dictionary and write it out.
In this case your dictionary is only a few thousand words large. And you don't need to retain the source file since you write out the unique words as soon as you find them (or you can simply dump the dictionary when you're done).
If you have the posibility to insert the words in a temporary table of a database (using batch inserts), then it would be a select distinct towards that table.
One classic way to solve this kind of problem is a Bloom filter. Basically you hash your word a number of times and for each hash result set some bits in a bit vector. If you're checking a word and all the bits from its hashes are set in the vector you've probably (you can set this probability arbitrarily low by increasing the number of hashes/bits in the vector) seen it before and it's a duplicate.
This was how early spell checkers worked. They knew if a word was in the dictionary, but they couldn't tell you what the correct spelling was because it only tell you if the current word is seen.
There are a number of open source implementations out there including java-bloomfilter
I'd tackle this in Java the same way as in every other language: Write a deduplication filter and pipe it as often as necessary.
This is what I mean (in pseudo code):
Input parameters: Offset, Size
Allocate searchable structure of size Size (=Set, but need not be one)
Read Offset (or EOF is encountered) elements from stdin and just copy them to stdout
Read Size elments from stdin (or EOF), store them in Set. If duplicate, drop, else write to stdout.
Read elements from stdin until EOF, if they are in Set then drop, else write to stdout
Now pipe as many instances as you need (If storage is no problem, maybe only as many as you have cores) with increasing Offsets and sane Size. This lets you use more cores, as I suspect the process is CPU bound. You can even use netcat and spread processing over more machines, if you are in a hurry.
Even in English, which has a huge number of words for a natural language, the upper estimates are only about 80000 words. Based on that, you could just use a HashSet and add all your words it (probably in all lower case to avoid case issues):
Set<String> words = new HashSet<String>();
while (read-next-word) {
words.add(word.toLowerCase());
}
If they are real words, this isn't going to cause memory problems, will will be pretty fast too!
To not have to worry to much about implementation you should use a database system, either plain old relational SQL or a No-SQL solution. Im pretty sure you could use e.g. Berkeley DB java edition and then do (pseudo code)
for(word : stream) {
if(!DB.exists(word)) {
DB.put(word)
outstream.add(word)
}
}
The problem is in essence easy, you need to store things on disk because there is not enough memory, then either use sorting O(N log N) (unecessary) or hashing O(N) to find the unique words.
If you want a solution that will very likely work but is not guaranteed to do so use a LRU type hash table. According to the empirical Zpif's law you should be OK.
A follow up question to some smart guy out there, what if I have 64-bit machine and set heap size to say 12GB, shouldn't virtual memory take care of the problem (although not in an optimal way) or is java not designed this way?
Quicksort would be a good option over Mergesort in this case because it needs less memory. This thread has a good explanation as to why.
Most performant solutions arise from omiting unecessary stuff. You look only for duplicates, so just do not store words itself, store hashes. But wait, you are not interested in hashes either, only if they awere seen already - do not store them. Treat hash as really large number, and use bitset to see whether you already seen this number.
So your problem boils down to really big sparse populated bitmap - with size depending on hash width. If your hash is up to 32 bit, you can use riak bitmap.
... gone thinking about really big bitmap for 128+ bit hashes %) (I'll be back )

Tagging of names using lucene/java

I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.
As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.
Is there a way to reduce these queries? Or a completely other way of achieving the same?
Thanks in advance
If you have only 5000 names, I would just stick them into a hash table in memory instead of bothering with Lucene. You can hash them several ways (e.g., nicknames, first-last or last-first, etc.) and still have a relatively small memory footprint and really efficient performance.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.

Given a list of words - what would be a good algorithm for word completion in java? Tradeoffs: Speed/efficiency/memory footprint

I'm exploring the hardware/software requirements (ultimate goal is mobile Java app) for a potential free/paid application.
The application will start with this simple goal: Given a list of relevant words in a database, to be able to do word completion on a single string input.
In other words I already know the contents of the database - but the memory footprint/speed/search efficiency of the algorithm will determine the amount of data supported.
I have been starting at the beginning with suffix-based tree searches, but am wondering if anyone has experience with the speed/memory size tradeoffs of this simple approach vs. the more complex ones being talked about in the conferences.
Honestly the initial application only has probably less than 500 words in context so it might not matter, but ultimately the application could expand to tens of thousands or hundreds of thousands of record - thus the question about speed vs. memory footprint.
I suppose I could start with something simple and switch over later, but I hope to understand the tradeoff earlier!
Word completion suggests that you want to find all the words that start with a given prefix.
Tries are good for this, and particularly good if you're adding or removing elements - other nodes do not need to be reallocated.
If the dictionary is fairly static, and retrieval is important, consider a far simpler data structure: put your words in an ordered vector! You can do binary-search to discover a candidate starting with the correct prefix, and a linear search each side of it to discover all other candidates.

Categories