Algorithm for Permutations without Repetition? - java

In a program I am making that generates anagrams for a given set of letters, my current approach is to:
Get all the the combinations of all the letters
Get the permutations of each combination group
Sort the resulting permutations alphabetically
Remove duplicate entries
My question pertains to the mathematics of permutations. I am wondering if it is possible to flat-out calculate the array size needed to store all of the remaining entries after removal of duplicate entries (using, say, the number of repeated letters in conjunction with the permutation formula or something).
I apologize about the vagueness of my question, I am still researching more about combinations and permutations. I will try to elaborate my goal as my understanding of combinations and permutations expands, and once I re-familiarize myself with my program (it was a spare-time project of mine last summer).

If you have n elements, and a[0] duplicates of one element, a[1] duplicates of another element, and so on up to a[k], then the total number of distinct permutations (up to duplicates) is n!/(a[0]! a[1]! ... a[k]!).
FYI, if you're interested, with Guava you could write
Collection<List<Character>> uniquePermutations =
Collections2.orderedPermutations(Lists.charactersOf(string));
and the result would be the unique permutations of the characters, accounting for duplicates and everything. You could even call its .size() method -- or just look at its implementation for hints. (Disclosure: I contribute to Guava.)

Generating all the permutations is really bad idea.The word "overflow" for instance has 40320 permutations. So the memory consumption gets really high as your word length grows.
I believe that the problem you are trying to solve can be reduced to finding out if one word is an anagram of another.
Then you can solve it by counting how many times each letter occurs (it will be a 26-tuple) and comparing these tupples against each other.

Related

Partition array into K subsets of same sum value

trying to figure out following problem:
Given a set S of N positive integers the task is to divide them into K subsets such that the sum of the elements values in every of the K subsets is equal.
I want to do this with a set of values not more than 10 integers, with values not bigger than 10 , and less than 5 subsets.
All integers need to be distributed, and only perfect solutions (meaning all subsets are equal, no approximations) are accepted.
I want to solve it recursively using backtracking. Most ressources I found online were using other approaches I did not understand, using bitmasks or something, or only being for two subsets rather than K subsets.
My first idea was to
Sort the set by ascending order, check all base cases (e.g. an even distribution is not possible), calculate the average value all subsets have to have so that all subsets are equal.
Going through each subset, filling each (starting with the biggest values first) until that average value (meaning theyre full) is achieved.
If the average value for a subset can't be met (undistributed values are too big etc.), go back and try another combination for the previous subset.
Keep going back if dead ends are encountered.
stop if all dead ends have been encountered or a perfect solution was found.
Unfortunately I am really struggling with this, especially with implementing the backtrack and retrying new combinations.
Any help is appreciated!
the given set: S with N elements has 2^N subsets. (well explained here: https://www.mathsisfun.com/activity/subsets.html ) A partition is is a grouping of the set's elements into non-empty subsets, in such a way that every element is included in one and only one of the subsets. The total number of partitions of an n-element set is the Bell number Bn.
A solution for this problem can be implemented as follows:
1) create all possible partitions of the set S, called P(S).
2) loop over P(S) and filter out if the sum of the elements values in every subsets do not match.

Suffix array nlogn creation

I have been learning suffix arrays creation, & i understand that We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is smaller than 2n.
But my doubt is why don't we choose the first 3 characters, then 9... and so on. Why only 2 characters are taken into account since the strings are a part of same strings and not different random strings?
I haven't analyzed the suffix array construction algorithm thoroughly, but still would like to share my thoughts.
In my humble opinion, your question is similar to the following ones:
Why do computers use binary encoding of information instead of ternary?
Why does binary search bisect the range instead of trisecting it?
Why are there two sexes rather than three?
The reason is that the number 2 is special - it is the smallest plural number. The difference between 1 and 2 is qualitative, whereas the difference between 2 and 3 (as well as any other positive integer) is quantitative and therefore not as drastic.
As a result, binary formulation of many algorithms and data structures turns out to be the simplest one, though some of them may be generalized, with various degrees of added complexity, for an arbitrary base.
Answer is given from the post you linked. And as #Leon answered, the algorithm work because it use a dichotomous approach to solve the sorting problem. if you correctly read the answer, the main purpose is to divide word be small 2 character fragments. So that 4 characters can be easily sort base on the arrangement of the 2 pair of characters, 6 characters with 4-2 or 2-4 or 2-2-2 and so one. Thus have a word of 3 letters in the table is non-sense since word of 3 characters may be seen has 2 characters + the position in the alphabet of the last character.
I think you are considering only the speed of 2^x versus 3^x where you obviously would prefer the latter.
But you have to consider the effort you need for each step.
Since 3^x needs about 1.58 less steps than 2^x you would need to be able to compute a single step for the 3^x growth in less than 1.58 times what you need for a single step in the 2^x growth to perform better.
Generally the problems will get much more complex when you have to handle three elements in each step instead of two.
Also if you could expand it to 3^x you could also do it for a bigger n^x and then with big n your algorithm is suddenly not exponential but effectively linear.

Use hashing to find a subarray of strings with minimum total length which contain all the distinct strings in the original array

Hi this a java exercise on hashing. First we have an array of N strings (1<=N<=100000), the program will find the minimum length of the consecutive subseries which contains all distinct strings which present in the original array.
For example, original array is {apple,orange,orange pear,pear apple,pear}
the consecutive subarrays can be {orange, pear, pear, apple}
so answer is 19
I've written a code which visit every element in the array and create a new hash table to find the length of the subarray which contain all the distinct strings. It becomes very very slow once N is larger than 1000. So I hope there is a faster algorithm. Thank you!
Pass through the array once, using a hash to keep track of whether you've seen a word before or not. Count the distinct words in the array by adding to your count only when you're seeing a word for the first time.
Pass through the array a second time, using a hash to keep track of the number of times you've seen each word. Also keep track of the sum of the lengths of all the words you've seen. Keep going until you have seen all words at least once.
Now move the start of the range forward as long as you can do so without reducing a word's count to zero. Remember to adjust your hash and letter count accordingly. This gives you the first range which includes every word at least once, and can't be reduced without excluding a word.
Repeatedly do the following: Move the left end of your range forward by one, and then move the right end forward until you find another instance of the word that you just booted from the left end. Each time you do this, you have another minimal range that includes each word once.
While doing steps 3 and 4, keep track of the minimum length so far, and the start and end of the associated range. You're done when you need to move the right end of your range past the end of the array. At this point you have the right minimum length, and the range that achieves it.
This runs in linear time.

Iterating over Permutations of an Array

I'm working on some java code for some research I'm working on, and need to have a way to iterate over all permutations of an ArrayList. I've looked over some previous questions asked here, but most were not quite what I want to do, and the ones that were close had answers dealing with strings and example code written in Perl, or in the case of the one implementation that seemed like it would work ... do not actually work.
Ideally I'm looking for tips/code snippets to help me write a function permute(list, i) that as i goes from 0 to list.size()! gives me every permutation of my ArrayList.
There is a way of counting from 0 to (n! - 1) that will list off all permutations of a list of n elements. The idea is to rewrite the numbers as you go using the factorial number system and interpreting the number as an encoded way of determining which permutation to use. If you're curious about this, I have a C++ implementation of this algorithm. I also once gave a talk about this, in case you'd like some visuals on the topic.
Hope this helps!
If iterating over all permutations is enough for you, see this answer: Stepping through all permutations one swap at a time .
For a given n the iterator produces all permutations of numbers 0 to (n-1).
You can simply wrap it into another iterator that converts the permutation of numbers into a permutation of your array elements. (Note that you cannot just replace int[] within the iterator with an arbitrary array/list. The algorithm needs to work with numbers.)

Most frequent words

What's the most efficient way in Java to get the 50 most frequent words with their frequency out of a text?
I want to search around ~1,000,000 texts with each have around ~10,000 words and hope that it works in a reasonable time frame.
Most efficient would probably be using a Patricia trie that links to a max-heap. Every time you read a word, find it on the trie, go to the heap and increase-key. If it's not in the trie, add it and set its key in the heap appropriately.
With a Fibonacci heap, increase-key is O(1).
A not so unreasonable solution is to use a Map<String, Integer>, adding the count every time a word is encountered, and then custom-sorting its entrySet() based on the count to get the top 50.
If the O(N log N) sort is unacceptable, use selection algorithm to find the top 50 in O(N).
Which technique is better really depends on what you're asking for (i.e. the comment whether this is more of an [algorithm] question than a [java] question is very telling).
The Map<String, Integer> followed by selection algorithm is most practical, but the Patricia trie solution clearly beats it in space efficiency alone (since common prefixes are not stored redundantly).
Following pseudocode should do the trick:
build a map<word, count>
build a tokenizer that gives you a word per iteration
for each word*,
if word in map, increment its count
otherwise add with count = 1
sort words by count
for each of the first 50 words,
output word, frequency = count / total_words
This is essentially O(N), and what jpabluz suggested. However, if you are going to use this on any sort of "in the wild" text, you will notice lots of garbage: uppercase/lowercase, punctuation, URLs, stop-words such as 'the' or 'and' with very high counts, multiple variations of the same word... The right way to do it is to lowercase all words, remove all punctuation (and things such as URLs), and add stop-word removal and stemming at the point marked with the asterisk in the above pseudocode.
Your best chance would be an O(n) algorithm, I would go for a text reader that will split the words and then, add it to an ordered tree, which you would order by number of appearences and link them to a word. After that just do a 50-iterations traverse to get the highest values.
O(n):
Count the number of words
Split your text word wise into list of words
Create a map of word=>number_of_occurences
Traverse the map and select max. 50.
Divide them by total number of words to get frequency
Of course some of this steps may be done at the same time or unnecessary depending on data structures you'll use.

Categories