How to count different elements in Vector using java? - java

I have a lot of words at hand. What I need to do is to save them and count every different word. The original data may contain some duplicate words.Firstly, I want to use Set, then I can guarantee that I only get the different wrods. But how can I count their times? Is there someone having any "clever" idea?

You can use MultiSet from the Guava library.
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Multiset.html

You can use Map to solve this problem.
String sample = " I have a problem here. I have a lot of words at hand. What I need to do is to save them and count every different word. The original data may contains duplicate words.Firstly, I want to use Set, then I can guarantee that I only get the different wrods. But how can I count their times? Is there someone having any clever idea?";
String[] array = sample.split("[\\s\\.,\\?]");
Map<String,Integer> statistic = new HashMap<String,Integer>();
for (String elem:array){
String trimElem = elem.trim();
Integer count = 0;
if(!"".equals(trimElem)){
if(statistic.containsKey(trimElem)){
count = statistic.get(trimElem);
}
count++;
statistic.put(trimElem,count);
}
}

maybe you can use hash, in java, it's HashMap(or HashSet?)
you can hash every word, and if that word has been hashed, increment some value associated with it by one, that is the idea.

Related

Efficient way for checking if a string is present in an array of strings [duplicate]

This question already has answers here:
How do I determine whether an array contains a particular value in Java?
(30 answers)
Closed 2 years ago.
I'm working on a little project in java, and I want to make my algorithm more efficient.
What I'm trying to do is check if a given string is present in an array of strings.
The thing is, I know a few ways to check if a string is present in an array of strings, but the array I am working with is pretty big (around 90,000 strings) and I am looking for a way to make the search more efficient, and the only ways I know are linear search based, which is not good for an array of this magnitude.
Edit: So I tried implementing the advices that were given to me, but the code i wrote accordingly is not working properly, would love to hear your thoughts.`
public static int binaryStringSearch(String[] strArr, String str) {
int low = 0;
int high = strArr.length -1;
int result = -1;
while (low <= high) {
int mid = (low + high) / 2;
if (strArr[mid].equals(str)) {
result = mid;
return result;
}else if (strArr[mid].compareTo(str) < 0) {
low = mid + 1;
}else {
high = mid - 1;
}
}
return result;
}
Basically what it's supposed to do is return the index at which the string is present in the array, and if it is not in the array then return -1.
So you have a more or less fixed array of strings and then you throw a string at the code and it should tell you if the string you gave it is in the array, do I get that right?
So if your array pretty much never changes, it should be possible to just sort them by alphabet and then use binary search. Tom Scott did a good video on that (if you don't want to read a long, messy text written by someone who isn't a native english speaker, just watch this, that's all you need). You just look right in the middle and then check - is the string you have before or after the string in the middle you just read? If it is already precisely the right one, you can just stop. But in case it isn't, you can eliminate every string after that string in case it's after the string you want to find, otherwise every string that's before the just checked string. Of course, you also eliminate the string itself if it's not equal because - logic. And then you just do it all over again, check the string in the middle of the ones which are left (btw you don't have to actually delete the array items, it's enough just to set a variable for the lower and upper boundary because you don't randomly delete elements in the middle) and eliminate based on the result. And you do that until you don't have a single string in the list left. Then you can be sure that your input isn't in the array. So this basically means that by checking and comparing one string, you can't just eliminate 1 item like you could with checking one after the other, you can remove more then half of the array, so with a list of 256, it should only take 8 compares (or 9, not quite sure but I think it takes one more if you don't want to find the item but know if it exists) and for 65k (which almost matches your number) it takes 16. That's a lot more optimised.
If it's not already sorted and you can't because that would take way too long or for some reason I don't get, then I don't quite know and I think there would be no way to make it faster if it's not ordered, then you have to check them one by one.
Hope that helped!
Edit: If you don't want to really sort all the items and just want to make it a bit (26 times (if language would be random)) faster, just make 26 arrays for all letters (in case you only use normal letters, otherwise make more and the speed boost will increase too) and then loop through all strings and put them into the right array matching their first letter. That way it is much faster then sorting them normally, but it's a trade-off, since it's not so neat then binary search. You pretty much still use linear search (= looping through all of them and checking if they match) but you already kinda ordered the items. You can imagine that like two ways you can sort a buncha cards on a table if you want to find them quicker, the lazy one and the not so lazy one. One way would be to sort all the cards by number, let's just say the cards are from 1-100, but not continuously, there are missing cards. But nicely sorting them so you can find any card really quickly takes some time, so what you can do instead is making 10 rows of cards. In each one you just put your cards in some random order, so when someone wants card 38, you just go to the third row and then linearly search through all of them, that way it is much faster to find items then just having them randomly on your table because you only have to search through a tenth of the cards, but you can't take shortcuts once you're in that row of cards.
Depending on the requirements, there can be so many ways to deal with it. It's better to use a collection class for the rich API available OOTB.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and the insertion order does not matter: Use Set<String> set = new HashSet<>() and then you can use Set#contains to check the presence of a particular string.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and also the insertion order needs to be preserved: Use Set<String> set = new LinkedHashSet<>() and then you can use Set#contains to check the presence of a particular string.
Can the list contain duplicate strings. If yes, you can use a List<String> list = new ArrayList<>() to benefit from its rich API as well as get rid of the limitation of fixed size (Note: the maximum number of elements can be Integer.MAX_VALUE) beforehand. However, a List is navigated always in a sequential way. Despite this limitation (or feature), the can gain some efficiency by sorting the list (again, it's subject to your requirement). Check Why is processing a sorted array faster than processing an unsorted array? to learn more about it.
You could use a HashMap which stores all the strings if
Contains query is very frequent and lookup strings do not change frequently.
Memory is not a problem (:D) .

Split an array of common English words into separate lists/arrays based on word length in Java

I'm trying to search an array of common English words to see if a specific word is contained in it, based on a text file. Since this array has >700,000 words and around 1000 words need to be checked if in the array multiple times, I thought it would be more efficient to separate the words into separate arrays or lists based on length. Is there an easy way to do this without using a switch or lots of if statements? Like so:
for(int i = 0; i < commonWordArray.length; i++) {
if(commonWordArray[i].length == 2) {
twoLetterList.add(commonWordArray[i]);
else if(commonWordArray[i].length == 3) {
threeLetterList.add(commonWordArray[i]);
else if(commonWordArray[i].length == 4) {
fourLetterList.add(commonWordArray[i]);
}
...etc
}
Then doing the same thing when checking the words:
for(int i = 0; i < checkWords.length; i++) {
if(checkWords[i].length == 2) {
if(twoLetterList.contains(checkWords[i])) {
...etc
}
Step 1
Create word buckets.
ArrayList<ArrayList<String>> buckets = new ArrayList<>();
for(int i = 0; i < maxWordLength; i++) {
buckets.add(new ArrayList<String>());
}
Step 2
Add words to your buckets.
buckets.get(word.length()).add(word);
This approach has the downside that some of your buckets may go unused. This is not an issue if you are only filtering common English words, as they do not exceed 30 characters in length. Creating 10-15 extra lists is a trivial overhead for a computer. The largest uncommon but non-technical word is 183 characters. Technical words exceed 180,000 characters, by which point this approach is clearly not practical.
The upside of this approach is that ArrayList.get() and ArrayList.add() both run in constant (O(1)) time.
Use a List<Set<String>> sets. That is, given a String word, find first the proper set (Set<String> set = sets.get(word.length)) - create the set if needed, extend the list if needed. Then just do a set.add(word). Done!
Edit/Hint: a (good) programmer should be lazy - if you need to do/write the same thing twice, you're doing something wrong.
Assuming you've got memory for it (which your current approach relies on), why not just a single Set<String>? Simpler, faster.
If you want to use multiple strings to search, you may want to try something like the Aho Corasick algorithm.
Alternatively, you may want to turn the problem around and check if a string from the 700k array is in the 1k array. To this, you won't have memory issues (imho) and you may do it with a simple dictionary (balanced tree). so you'd have 700k log2(1000).
Use a Trie, which is a memory-efficient storage mechanism which excels at storing words and checking for whether they exist or not.
Implementing one on your own is a fun exercise, or look at existing implementations.

Efficiently checking for substrings and replacing them - can I improve performance here?

I need to examine millions of strings for abbreviations and replace them with the full version. Due to the data, only abbreviations terminated by a comma should be replaced. Strings can contain multiple abbreviations.
I have a lookup table that contains Abbreviation->Fullversion pairs, it contains about 600 pairs.
My current setup looks like something this. On startup I create a list of ShortForm instances from a csv file using Jackson and hold them in a singleton:
public static class ShortForm{
public String fullword;
public String abbreviation;
}
List<ShortForm> shortForms = new ArrayList<ShortForm>();
//csv code ommited
And some code that uses the list
for (ShortForm f: shortForms){
if (address.contains(f.abbreviation+","))
address = address.replace(f.abbreviation+",", f.fullword+",");
}
Now this works, but it's slow. Is there a way I can speed it up? The first step is to load the ShortForm objects with commas in place, but what else could I do?
====== UPDATE
Changed code to work the other way around. Splits strings into words and checks a set to see if the string is an abbreviation.
StringBuilder fullFormed = new StringBuilder();
for (String s: Splitter.on(" ").split(add)){
if (shortFormMap.containsKey(s))
fullFormed.append(shortFormMap.get(s));
else
fullFormed.append(s);
fullFormed.append(" ");
}
return fullFormed.toString().trim();
Testing shows this to be over 13x faster that the original approach. Cheers davecom!
It would already be a bit faster if you skip contains() part :)
What could really improve performance would be to use a better data structure than a simple array for storing your ShortForms. All of the shortForms could be stored sorted alphabetically by abbreviation. You could therefore reduce the lookup time from O(N) to something looking more like a binary search.
I haven't used it before, but perhaps the standard library's SortedMap fits the bill instead of using a custom object at all:
http://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html
Here's what I'm thinking:
Put abbreviation/full word pairs into TreeMap
Tokenize the address into words.
Check each word to see if it is a key in the TreeMap
Replace it if it is
Put the corrected tokens back together as an address
I think I'd do this with a HashMap. The key would be the abbreviation and the value would be the full term. Then just search through a string for a comma and see if the text that precedes the comma is in the dictionary. You could probably map all the replacements in a single string in one pass and then make all the replacements after that.
This makes each lookup O(1) for a total of O(n) lookups where n is the number of abbreviations found and I don't think there's likely a more efficient method.

Make multiple HashSets using a loop Java

I'm trying to write code that will create multiple HashSets using a for loop. I'm trying to store occurrences of unique words based on their length. For example, a word of length 4 would go in HashSet A, while a word of length 20 would go in HashSet B. Instead of creating 16 HashSets manually, is there a way for me to use a for loop (int i=4; i<21; i++)? Thank you!
Rather than having 16 different HashSet's, you can have a Map<Integer, Set<String>>.
So, while adding, you can just test whether a key is already there or not. If a key is there, just add the word to the Set for that key, else add a new entry.
So, here're the steps you need to follow: -
Get the length of the word. Say length.
Test if Map contains key length - Map#containsKey(Object)
If length key is there, get the Set for that key - Map#get(Object). And add the word to that Set.
If length key is not there, create a new HashSet, add the current word in it. And add a new entry in your Map with the current length as key - Map#put(K, V)
HashSet<String>[] sets= HashSet<String>[21];
for(int i=4; i<21; i++)
sets[i]= new HashSet<String>();
Later when you want to add words:
for(String word: words){
sets[word.length()].add(word);
}
P.s. I do not use the array indexes 0..3 but the code looks nicer this way and it is really only very little wast of memory.
You can make them in a loop and put them into a list or an array...
List<HashSet<String>> sets = new ArrayList<HashSet<String>>()
for (int x=0;x<16;x++) {
sets.add(new HashSet<String>());
}

Java: Sorting different types of arrays to one another

I need to sort an array based on the positions held in another array.
What I have works, but it is kinda slow, is there a faster/better way to implement this?
2 Parts:
Part1
int i = mArrayName.size();
int temp = 0;
for(int j=0;j<i;j++){
temp = mArrayPosition.get(j);
mArrayName.set(temp, mArrayNameOriginal.get(j));
}
In this part, mArrayPosition is the position I would like the mArrayName to be in.
Ex.
input:
mArrayName= (one, two, three)
mArrayPosition = (2,0,1)
output:
mArrayName= (three, one two)
Part 2
int k=0;
int j=0;
do{
if(mArrayName.get(k)!=mArrayNameOriginal.get(j)){
j++;
}else{
mArrayIdNewOrder.set(k, mArrayId.get(j));
k++;
j=0;
}
}while(k < mArrayName.size());
}
In this part, mArrayName is the reordered name array, mArrayNameOriginal is the original name array.
Ex.
mArrayName = (three, one, two)
mArrayNameOriginal = (one, two, three)
Now I want to compare these two arrays, find which entries are equal and relate that to a new array that has their rowId number in it.
Ex.
input:
mArrayId = (001,002,003)
output:
mArrayIdNewOrder = (003,001,002)
So then I will have mArrayIdNewOrder id's matching up with the correct names in mArrayName.
Like I said these methods work, but is there a faster/better way to do it? I tried looking at Arrays.sort and comparators but they only seem to sort alphabetically or numerically. I saw something like I can create my own rules inside the comparator but it would probably end up being similar to what I already have.
Sorry for the confusing question. I'll try to clear up any ambiguities if needed.
The best performance read I've found is Android's Designing For Performance doc. You are violating a couple of the "Android way" style of doing things that will help you.
You are using multiple internal getters inside each loop for what looks like a simple value. Redo this by accessing the fields directly.
For extra credit, post your performance comparison results! I'd love to see em!
You could use some form of tuple, some class to hold both id and name. You'll just to have a java.util.Comparator that compares it accordingly, both elements will move together and your code will be cleaner.
This data structure might be convenient for the rest of your program... if not, just take things off it again and you're done.
If your order indexes are compact, i.e. from index 0 to size - 1, then just use an array and create the updated list afterwards? About something like
MyArray[] array = new MyArray[size];
for(int j=0;j< size;j++) {
array[ mArrayPosition.get(j) ] = mArrayName.get(j);
}
// create ArrayList from array

Categories