Most efficient way to find unique entries in a large data set

Most efficient way to find unique entries in a large data set - java

Before anything, I am making it clear that this is an assignment and I do not expect full coded answers. All I seek is advice and maybe snippets of code that helps me.
So, I am reading in about 900,000 words all stored in a arrayList. I need to count unique words using a sorted array (or arraylist) in java.
So far, I am simply looping over the given arrayList and use
Collections.sort(words);
and Collections.binarySearch(words, wordToLook); to achieve it like the following:
OrderedSet set = new OrderedSet();
for(String a : words){
if(!set.contains(a)){
set.add(a);
}
}
and
public boolean contains(String word) {
Collections.sort(uniqueWords);
int result = Collections.binarySearch(uniqueWords, word);
if(result<0){
return false;
}else{
return true;
}
}
This code has a running time of about 60 seconds but I was wondering if there is any better way to do this because running a sort every time an element is added seems very inefficient (but of couse necessary if I were to use binary search).
Any sort of feedback would be greatly appreciated. Thanks.

So, you are required to use a sorted array. That is ok, since you are (not yet) programming in the real world.
I will suggest two alternatives:
The first uses binary search (which you are using in your current code).
I would create a class that contains two fields: the word (a String) and the count for that word (an int). You will build a sorted array of these classes.
Start with an empty array and add to it as you read each word. For each word, do a binary search for the word in the array you are building. The search will either find the entry containing the word (and you will increment the count), or you will determine that the word is not yet in the array.
When your binary search ends without finding the word, you will create a new object to hold the word+count and add it to the array in the location where your search ended (be careful to make sure that your logic really puts it in the right spot to keep your list sorted). Of course, your count is set to 1 for new words.
Another alternative:
Read all of your words into a list and sort it. After sorting, all duplicates will be next to each other in the list.
You will walk down this sorted list once and create a list of word+count as you go. If the next word you see is the same as the last word+count, increment the count. If it is a new word, add a new word+count to your result list with count=1.

I would not use a sorted array. I would create a Map<String, Integer> where the key is your word and the value is the count of the number of occurrences of the word. As you read each word, do something like this:
Integer count = map.get(word);
if (count == null) {
count = 0;
}
map.put(word, count + 1);
Then just iterate over the map's entry set and do whatever you need to do with the counts.
If you know, or can estimate, the number of unique words then you should use this number in the HashMap constructor (so you don't grow the map many times).
If you use a sorted array, your run time cannot be better than proportional to NlogN (where N is the number of words in your list). If you use a HashMap, you can achieve a runtime that grows linearly with N (you save yourself the factor of logN).
Another advantage of using a Map is the memory used is proportional to the number of unique words, rather than the total number of words (assuming that you build the map while reading the words, rather than reading all words into a collection and then adding them to the map).

public static int countUnique(array) {
if(array.length == 0) return 0;
int count = 1;
for i from 1 to array.length - 1 {
if(!array[i].equals(array[i - 1])) count++;
}
return count;
}
This is a O(N) algorithm in pseudocode for counting the number of unique entries in a sorted array. The idea behind it is that we count the number of transitions between groups of equal elements. Then, the number of unique entries is the number of transitions plus one (for the first entry).
Hopefully you see how to apply this algorithm to your array after the elements are sorted.

You could always use comparator to get unique values.
List newList = new ArrayList(new Comparator() {
#Override
public int compare(words o1, words o2) {
if(o1.equalsIgnoreCase(o2)){
return 0;
}
return 1;
}
});
Now count:
words - newList = no. of repeated values.
Hope this helps!!!!

Related

First non-repeating character in a stream

My answer to this question is as follows, but I want to know if I can use this code and what will be the complexity:
import java.util.LinkedHashMap;
import java.util.Map.Entry;
public class FirstNonRepeatingCharacterinAString {
private char firstNonRepeatingCharacter(String str) {
LinkedHashMap<Character, Integer> hash =
new LinkedHashMap<Character, Integer>();
for(int i = 0 ; i< str.length() ; i++)
{
if(hash.get(str.charAt(i))==null)
hash.put(str.charAt(i), 1);
else
hash.put(str.charAt(i), hash.get(str.charAt(i))+1);
}
System.out.println(hash.toString());
for(Entry<Character, Integer> c : hash.entrySet())
{
if(c.getValue() == 1)
return c.getKey();
}
return 0 ;
}
public static void main(String args[])
{
String str = "geeksforgeeks";
FirstNonRepeatingCharacterinAString obj =
new FirstNonRepeatingCharacterinAString();
char c = obj.firstNonRepeatingCharacter(str);
System.out.println(c);
}
}

Your question about whether you "can use this code" is a little ambiguous - if you wrote it, I'd think you can use it :)
As for the complexity, it is O(n) where n is the number of characters in the String. To count the number of occurrences, you must iterate over the entire String, plus iterate over them again to find the first one with a count of 1. In the worst case, you have no non-repeating characters, or the only non-repeating character is the last one. In either case, you have to iterate over the whole String once more. So it's O(n+n) = O(n).
EDIT
There is a bug in your code, by the way. Because you are using an insertion-order LinkedHashMap, each call to put(Character,Integer) results in a re-ordering of the underlying list. You should probably use a LinkedHashMap<Character,int[]> instead, and check for the presence of keys before putting. If they exist, then merely increment the value stored in the int[] to avoid re-ording the map by making another put call. Even so, the resulting list will be in reverse order from the way you iterate over it, so the first non-repeating character will be the last one you find when iterating over it whose value is 1. Alternatively, you could just iterate in reverse in your first for loop, then you avoid having to always go through the entire Entry set if the first non-repeating character comes sooner than the final character in the original String.

Looping through an ArrayList with another Arraylist in Java

I have a large array list of sentences and another array list of words.
My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.
The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
for (int i = 0; i < SENTENCES.size(); i++) {
for (int k = 0; k < WORDS.size(); k++) {
if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {
//Do something
}
}
}
Is there a more efficient way of doing this then a nested for loop?

There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.
That said, there are couple of things to try.
First, make WORDS a HashSet, the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.
Second, switch the logic about a bit like this:
Iterator<String> sentenceIterator = SENTENCES.iterator();
sentenceLoop:
while (sentenceIterator.hasNext())
{
String sentence = sentenceIterator.next();
for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
{
if (WORDS.contains(word))
{
sentenceIterator.remove();
continue sentenceLoop;
}
}
}
This code (which assumes you're trying to remove sentences that contain certain words) uses Iterators and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.
But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.

I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:
The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).
I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList.
In case you cannot change the design to LinkedList, I recommend storing the sentences that are valid in a new List, and in the end replace the contents of your original List with this new List, thus saving lot of time.
Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).

What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.
HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
wordSet.add(word);
}
Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.
for (String sentence : SENTENCES) {
String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
for (String word : sentenceWords) {
if (wordSet.contains(word)) {
// The sentence contains one of the special words.
// DO SOMETHING
break;
}
}
}

I will create a set of words from second ArrayList:
Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");
I will then iterate over the set and the first ArrayList and use Contains:
for (String word : listOfWords) {
for(String sentence : Sentences) {
if (sentence.contains(word)) {
// do something
}
}
}
Also, if you are free to use any open source jar, check this out:
searching string in another string

First, your program has a bug: it would not count words at the beginning and at the end of a sentence.
Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.
If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:
StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
if (!first) {
regex.append('|');
} else {
first = false;
}
regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
if (p.matcher(SENTENCES.get(i)).find()) {
// Do something
}
}
Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.

If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.

Well, looking at your code I would suggest two things that will improve the performance from each iteration:
Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
Do not concatenate Strings inside a loop (" " + WORDS.get(k) + " ") as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0);.
Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.
Hope it helps!

If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence, which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).

I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort
say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)
We can create a mapping of words - position in sentences.
Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X). That's O(2M1.M2.X) = O(M1.M2.X)
Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(N.X)
Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)
Total complexity is O(M1.M2.X) + O(N.X) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

Java - Search performantly for subset of String in String list

I want to search through a list of Strings and return the values, which contains which contain the search string.
The list could look like this (can have up to 1000 entries). Although it is not guranteed that it is always letters and then a digit. It could be digits only, words only or even both mixed up:
entry 1
entry 2
entry 3
entry 4
test 1
test 2
test 3
tst 4
If the user does search for 1, these should be returned:
entry 1
test 1
The situation is that the user has a search bar and can enter a search string. This search string is used to search through the list.
How can this be done performantly?
Currently, I have got:
for (String s : strings) {
if (s.contains(searchedText)) result.add(s);
}
It is O(N) and really slow. Especially if the user types many characters at a time.

Maybe I don't understand your question, but as you know n Java, String objects are immutable, but also can represent collection(array) of chars. So one thing what you can do is to perform search with better algorithms as binary_search, Aho-Corasick, Rabin–Karp, Boyer–Moore string search, StringSearch or one of these. Also you may consider some usage of Abstract_data_types with better performance (hashing, trees etc.).

This is very simple if you use streams:
final List<String> items = Arrays.asList("entry 1", "entry 2", "entry 3", "test 1", "test 2", "test 3");
final String searchString = "1";
final List<String> results = items.parallelStream() // work in parallel
.filter(s -> s.contains(searchString)) // pick out items that match
.collect(Collectors.toList()); // and turn those into a result list
results.forEach(System.out::println);
Notice the parallelStream() which will cause the list to be filtered and traversed using all available CPUs.
In your case you can use the results when the user expands the search term (while typing) to reduce the amount of items that need to be filtered, because if 's' matches all items in result, all those that match 'se' will be a sub-list of result.

If you don't use any additional structures, you cannot perform faster, than look though your data. That takes O(N).
If you can do some preparations, like building text index, you can increase performance of search. General information: http://en.wikipedia.org/wiki/Full_text_search. If you can make some assumptions about your data (like the last symbol is number and you are going to search only by it), it'll be easy to create such index.

Depending on the upper limit of the number in the string and if you have no concerns about space, use an Array of ArrayLists where the array index is the number of the string:
ArrayList<String>[] data = new ArrayList<String>[1000];
for ( int i = 0; i < 1000; i++ )
data[i] = new ArrayList<String>();
//inserting data
int num = Integer.parseInt(datastring.substring(datastring.length-1));
data[i].add(datastring);
//getting all data that has a 1
for ( String s: data[1] )
result.add(s);
Using a Hashmap would overwrite previous mapped values when trying to put new values into it.
i.e. if 1 maps to entry, then you try to add 1 mapping to test, the entry would get replaced with test.
As another idea, you could just keep a count of the number of strings with each number, so when you're searching, you know how many to look for, so as soon as you find all of them, you stop searching:
int[] str_count = new int[1000];
for ( int i = 0; i < 1000; i++ )
str_count[i] = 0;
//when storing data into the list:
int num = Integer.parseInt(datastring.substring(datastring.length-1));
str_count[i]++;
//when searching the list for 1s:
int count = str_count[1];
for (String s : strings) {
if (s.contains(searchedText))
result.add(s);
if (result.size() == count)
break;
}
While the first idea would be much faster, it would take up more space. Yet, the second idea takes up less space, the worst case scenario would search O(N) still.

Fastest way to find String in List in Java

I want to use the fastest possible method to match a String with a String in List.
Im iterating trough a list to match productname and set price for that product.
Im trying to match every 400 000 items by name in another list where i could find the price, that list also contains 400 000 items.
Doing a "contains()" on String to match 400 000 items 400 000 times takes a long time to finish.
I did also try startsWith() as i dont search by substring, im using the String because there is for sure a full match in the second list.
It just has to be a faster way to find a match in the inner for loop to get the price?
ProductData t = null;
for (int i = 0; i < ParseCSV.products.size(); i++) { // List of 400K+ items
t = ParseCSV.products.get(i);
for (int j = 0; j < ParseCSVprice.productPrice.size(); j++) { // another List of 400K+ items
if (ParseCSVprice.productPrice.get(i).getpairID()
.contains(t.pairID)) {
t.price = ParseCSVprice.productPrice.get(i).getPrice();
}
}

You need to use another structure probably.
Possibly a HashMap or a HashSet.
There's no much faster way by using a List.
Searching in a List is O(N).

If you only expect one or zero matches, you can increase the speed of your code in some cases by stopping your loops using the break keyword after you have found the match.
Also you might consider changing your id fields to contain numeric values which would be faster to compare than strings.
Because you are having to call a method on each object in the List in order to make the comparison, there isn't much else you can do to speed this up

Array of linked lists of arrays for hash table

So I am creating a Hash Table that uses an Array of Linked Lists of Arrays. Let me take a second to explain why this is.
So I have previously implemented Hash Tables by creating an Array, and each element of the array is a Linked List. This way I could quickly look up a LL of 450,000 elements by searching for the hash value first in the array, and searching the elements of this LL. I should add that this is a project for school and I cannot just use the Hash Tables that comes with java.
Now I want to do something similar... but I massive have a LL of Arrays that I need to search. Here each element of the LL is line of a text file, which represented by a 4 element array, where each of the 4 elements is a different string that was tab delimited in the input file. I need to be able to quickly access the 2nd, 3rd, and 4th string that was located in each line, and that is now an element of this array.
So What I want is to be able to create an Array of LL of Arrays... first I will find the sum of the ascii values of the second element of an array. Then I will hash the entire array using this value into by Hash Table. Then when I later need to find this element, I will go to the corresponding element of the array, where I have a list of arrays. I will the search for the 2nd value of each array in the list. If i find the one I want, then I return that array, and use the 3rd and 4th element of this array.
As I said, I have this working fine for an Array of LL, but adding the extra dimension of Arrays inside has thrown me off completely. I think it is mostly just figuring out syntax, since I have successfully initialized a Array of LL of Arrays (public static LinkedList[] RdHashLL) so it appears that Java is okay with this in principal. However, I have no idea how to put elements into the Hash Table, and how to read them out.
Below is my code for a ARRAY OF LINKED LISTS that works FINE. I just need help getting it to work for an ARRAY OF LL OF ARRAYS!
public class TableOfHash{
public static LinkedList<String>[] HashLL;
//HASH FUNCTION - Finds sum of ascii values for string
public static int charSum(String s){
int hashVal = 0;
int size = 1019; //Prime Number around size of 8 char of 'z', (8 chars is amoung largest consistantly in dictionary)
for(int i = 0; i < s.length(); i++){
hashVal += s.charAt(i);
}
return hashVal % size;
}
//CREATE EMPTY HASH TABLE - Creates an array of LL
public static void makeHash(){
HashLL = new LinkedList[1019];
for(int i=0; i<HashLL.length; i++){
HashLL[i] = new LinkedList<String>();
}
}
//HASH VALUES INTO TABLE!
public static void dictionary2Hash(LinkedList<String> Dict){
for(String s : Dict){
HashLL[charSum(s)].add(s);
//Finds sum of char vales of dictionary element i,
//and then word at i to the HashLL at point defined
//by the char sum.
}
//Print out part of Hash Table (for testing! for SCIENCE!)
//System.out.println("HASH TABLE::");
//printHashTab();
}
//SEARCH HashTable for input word, return true if found
public boolean isWord(String s){
if(HashLL[charSum(s)].contains(s)){
wordsfound++;
return true;
}
return false;
}
}
I have made some attempts to change this, but for things like if(HashLL[charSum(s)].contains(s)) which searches the LL at the element returned by charsum(s)... I have no idea how to get it to work when it is a LL of Arrays and not of Strings. I have tired HashLL[charSum(s)].[1].contains(s)), and HashLL[charSum(s)][1].contains(s)), and various other things.
The fact that a Google search for "Array of Linked Lists of Arrays" (with quotes) turns up empty has not helped.
Last bit. I realize there might be another data structure that would do what I want, but unless you believe that a Array of LL of Arrays is a totally hopeless cause, I'd like to get it to work as is.

if you have
LinkedList<String[]>[] hashLL;
you can read a specific String like this (one of many ways)
String str = hashLL[outerArrayIndex].get(listIndex)[innerArrayIndex];
To write into the fields, this is possible (assuming everything is initialized correctly).
String[] arr = hashLL[outerArrayIndex].get(listIndex);
arr[index] = "value";

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Most efficient way to find unique entries in a large data set - java

You could always use comparator to get unique values. List newList = new ArrayList(new Comparator() { #Override public int compare(words o1, words o2) { if(o1.equalsIgnoreCase(o2)){ return 0; } return 1; } }); Now count: words - newList = no. of repeated values. Hope this helps!!!!

Related

First non-repeating character in a stream

Looping through an ArrayList with another Arraylist in Java

Java - Search performantly for subset of String in String list

Fastest way to find String in List in Java

Array of linked lists of arrays for hash table

Categories

Resources