Improving Collections.Sort

Improving Collections.Sort - java

I am sorting 1 million strings (each string 50 chars) in ArrayList with
final Comparator comparator= new Comparator<String>() {
public int compare(String s1, String s2) {
if (s2 == null || s1 == null)
return 0;
return s1.compareTo(s2);
}
};
Collections.Sort(list,comparator);
The average time for this is: 1300 millisec
How can I speed it up?

If you're using Java 6 or below you might get a speedup by switching to Java 7. In Java 7 they changed the sort algorithm to TimSort which performs better in some cases (in particular, it works well with partially sorted input). Java 6 and below used MergeSort.
But let's assume you're using Java 6. I tried three versions:
Collections.sort(): Repeated runs of the comparator you provided take about 3.0 seconds on my machine (including reading the input of 1,000,000 randomly generated lowercase ascii strings).
Radix Sort: Other answers suggested a Radix sort. I tried the following code (which assumes the strings are all the same length, and only lowercase ascii):
String [] A = list.toArray(new String[0]);
for(int i = stringLength - 1; i >=0; i--) {
int[] buckets = new int[26];
int[] starts = new int[26];
for (int k = 0 ; k < A.length;k++) {
buckets[A[k].charAt(i) - 'a']++;
}
for(int k = 1; k < buckets.length;k++) {
starts[k] = buckets[k -1] + starts[k-1];
}
String [] temp = new String[A.length];
for(int k = 0; k < A.length; k++) {
temp[starts[A[k].charAt(i) - 'a']] = A[k];
starts[A[k].charAt(i) - 'a']++;
}
A = temp;
}
It takes about 29.0 seconds to complete on my machine. I don't think this is the best way to implement radix sort for this problem - for example, if you did a most-significant digit sort then you could terminate early on unique prefixes. And there'd also be some benefit in using an in-place sort instead (There's a good quote about this - “The troubles with radix sort are in
implementation, not in conception”). I'd like to write a better radix sort based solution that does this - if I get time I'll update my answer.
Bucket Sort: I also implemented a slightly modified version of Peter Lawrey's bucket sort solution. Here's the code:
Map<Integer, List<String>> buckets = new TreeMap<Integer,List<String>>();
for(String s : l) {
int key = s.charAt(0) * 256 + s.charAt(1);
List<String> list = buckets.get(key);
if(list == null) buckets.put(key, list = new ArrayList<String>());
list.add(s);
}
l.clear();
for(List<String> list: buckets.values()) {
Collections.sort(list);
l.addAll(list);
}
It takes about 2.5 seconds to complete on my machine. I believe this win comes from the partitioning.
So, if switching to Java 7's TimSort doesn't help you, then I'd recommend partitioning the data (using something like bucket sort). If you need even better performance, then you can also multi-thread the processing of the partitions.

You didn't specify the sort algorithm you use some are quicker than others(quick/merge vs. bubble)
Also If you are running on a multi-core/multi-processor machine you can divide the sort between multiple thread (again exactly how depends on the sort algorithm but here's an example)

You can use a radix sort for the first two characters. If you first two characters are distinctive you can use something like.
List<String> strings =
Map<Integer, List<String>> radixSort =
for(String s: strings) {
int key = (s.charAt(0) << 16) + s.charAt(1);
List<String> list = radixSort.get(key);
if(list == null) radixSort.put(key, list = new ArrayList<String>());
list.add(s);
}
strings.clear();
for(List<String> list: new TreeMap<Integer, List<String>>(radixSort).values()) {
Collections.sort(list);
strings.addAll(list);
}

Related

Is there an efficient algorithm for outputting all strings stored in a sorted lexicographically list that are a permutation of an input string?

I would like to find the most efficient algorithm for this problem:
Given a string str and a list of strings lst that consists of only lowercase English characters and is sorted lexicographically, find all the words in lst that are a permutation of str.
for example:
str = "cat", lst = {"aca", "acc", "act", "cta", "tac"}
would return: {"act", "cta", "tac"}
I already have an algorithm that doesn't take advantage of the fact that lst is lexicographically ordered, and I am looking for the most efficient algorithm that takes advantage of this fact.
My algorithm goes like this:
public List<String> getPermutations(String str, List<String> lst){
List<String> res = new ArrayList<>();
for (String word : lst)
if (checkPermutation(word, str))
res.add(word);
return res;
}
public boolean checkPermutation(String word1, String word2) {
if (word1.length() != word2.length())
return false;
int[] count = new int[26];
int i;
for (i = 0; i < word1.length(); i++) {
count[word1.charAt(i) - 'a']++;
count[word2.charAt(i) - 'a']--;
}
for (i = 0; i < 26; i++)
if (count[i] != 0) {
return false;
}
return true;
}
Total runtime is O(NK) where N is the number of strings in lst, and k is the length of str.

One simple optimisation (that only becomes meaningful for really large data sets, as it doesn't really improve the O(NK):
put all the characters of your incoming str into a Set strChars
now: when iterating the words in your list: fetch the first character of each entry
if strChars.contains(charFromListEntry): check whether it is a permutation
else: obviously, that list word can't be a permutation
Note: the sorted ordering doesn't help much here: because you still have to check the first char of the next string from your list.
There might be other checks to avoid the costly checkPermutation() run, for example to first compare the lengths of the words: when the list string is shorter than the input string, it obviously can't be a permutation of all chars.
But as said, in the end you have to iterate over all entries in your list and determine whether an entry is a permutation. There is no way avoiding the corresponding "looping". The only thing you can affect is the cost that occurs within your loop.
Finally: if your List of strings would be a Set, then you could "simply" compute all permutations of your incoming str, and check for each permutation whether it is contained in that Set. But of course, in order to turn a list into a set, you have to iterate that thing.

Instead of iterating over the list and checking each element for being a permutation of your string, you can iterate over all permutations of the string and check each presence in the list using binary search.
E.g.
public List<String> getPermutations(String str, List<String> lst){
List<String> res = new ArrayList<>();
perm(str, (1L << str.length()) - 1, new StringBuilder(), lst, res);
return res;
}
private void perm(String source, long unused,
StringBuilder sb, List<String> lst, List<String> result) {
if(unused == 0) {
int i = Collections.binarySearch(lst, sb.toString());
if(i >= 0) result.add(lst.get(i));
}
for(long r = unused, l; (l = Long.highestOneBit(r)) != 0; r-=l) {
sb.append(source.charAt(Long.numberOfTrailingZeros(l)));
perm(source, unused & ~l, sb, lst, result);
sb.setLength(sb.length() - 1);
}
}
Now, the time complexity is O(K! × log N) which is not necessarily better than the O(NK) of your approach. It heavily depends on the magnitude of K and N. If the string is really short and the list really large, it may have an advantage.
There are a lot of optimizations imaginable. E.g. instead constructing each permutation, followed by a binary search, each recursion step could do a partial search to identify the potential search range for the next step and skip when it’s clear that the permutations can’t be contained. While this could raise the performance significantly, it can’t change the fundamental time complexity, i.e. the worst case.

ArrayList vs HashMap time complexity

The scenario is the following:
You have 2 strings (s1, s2) and want to check whether one is a permutation of the other so you generate all permutations of lets say s1 and store them and then iterate over and compare against s2 until either it's found or not.
Now, in this scenario, i am deliberating whether an ArrayList is better to use or a HashMap when considering strictly time complexity as i believe both have O(N) space complexity.
According to the javadocs, ArrayList has a search complexity of O(N) whereas HashMap is O(1). If this is the case, is there any reason to favor using ArrayList over HashMap here since HashMap would be faster?
The only potential downside i could think of is that your (k,v) pairs might be a bit weird if you did something like where the key = value, i.e. {k = "ABCD", v = "ABCD"}, etc..

As shown here:
import java.io.*;
import java.util.*;
class GFG{
static int NO_OF_CHARS = 256;
/* function to check whether two strings
are Permutation of each other */
static boolean arePermutation(char str1[], char str2[])
{
// Create 2 count arrays and initialize
// all values as 0
int count1[] = new int [NO_OF_CHARS];
Arrays.fill(count1, 0);
int count2[] = new int [NO_OF_CHARS];
Arrays.fill(count2, 0);
int i;
// For each character in input strings,
// increment count in the corresponding
// count array
for (i = 0; i <str1.length && i < str2.length ;
i++)
{
count1[str1[i]]++;
count2[str2[i]]++;
}
// If both strings are of different length.
// Removing this condition will make the program
// fail for strings like "aaca" and "aca"
if (str1.length != str2.length)
return false;
// Compare count arrays
for (i = 0; i < NO_OF_CHARS; i++)
if (count1[i] != count2[i])
return false;
return true;
}
/* Driver program to test to print printDups*/
public static void main(String args[])
{
char str1[] = ("geeksforgeeks").toCharArray();
char str2[] = ("forgeeksgeeks").toCharArray();
if ( arePermutation(str1, str2) )
System.out.println("Yes");
else
System.out.println("No");
}
}
// This code is contributed by Nikita Tiwari.
If you're glued to your implementation, use a HashSet, it still has O(1) lookup time, just without keys

You can use HashSet as you need only one parameter.

Is there an array_intersect() equivalent in java?

I want to find the first repeated character from a string. I usually do it using array_intersect in php. Is there something similar in Java?
For example:
String a=zxcvbnmz
Desired output : z

array_intersect — Computes the intersection of arrays (source)
So in this case you can use Set::retainAll :
Integer[] a = {1,2,3,4,5};
Integer[] b = {2,4,5,6,7,8,9};
Set<Integer> s1 = new HashSet<>(Arrays.asList(a));
Set<Integer> s2 = new HashSet<>(Arrays.asList(b));
s1.retainAll(s2);
Integer[] result = s1.toArray(new Integer[s1.size()]);
System.out.println(Arrays.toString(result));
Output
[2, 4, 5]
You can read about this here Java, find intersection of two arrays

There's no default implementation for this behavior; however, you can code your own solution! Since you want to find the first repeated character, you can make a HashSet of Characters. As you iterate through the array, you add each character to the HashSet until you come across a character already in the HashSet - this must be the first repeated character. Example code below:
public char arrayIntersect(String string) {
HashSet<Character> hashSet = new HashSet<>();
for (int i = 0; i < string.length(); i++) {
char c = string.charAt(i);
if (hashSet.contains(c))
return c;
else
hashSet.add(c);
}
return null;
}
This runs in O(n) time, as HashSet lookups run in O(1) time.

Sort Characters By Frequency Java (Optimal Solution)

I'm trying to solve this question using Java. The goal is to sort a string in decreasing order based on the frequency of characters. For example "Aabb" is going to be "bbaA" or "bbAa". I have implemented a working solution but it's in O(n^2). I was wondering if someone out there has a better and more optimal solution.
Here is the code:
public class Solution
{
public String frequencySort(String s)
{
Map<Character,Integer> map =new HashMap<Character,Integer>();
for(int i=0;i<s.length();i++)
{
if(map.containsKey(s.charAt(i)))
map.put(s.charAt(i),map.get(s.charAt(i))+1);
else
map.put(s.charAt(i),1);
}
List<Map.Entry<Character,Integer>> sortedlist = new ArrayList<>(map.entrySet());
Collections.sort(sortedlist, new Comparator<Map.Entry<Character,Integer>>() {
#Override
public int compare(Map.Entry<Character, Integer> o1,
Map.Entry<Character, Integer> o2) {
return o2.getValue() - o1.getValue();
}
});
String lastString="";
for (Map.Entry<Character,Integer> e : sortedlist)
{
for(Integer j=0;j < e.getValue();j++)
lastString+= e.getKey().toString();
}
return lastString;
}
}

Your algorithm is actually O(n) (thanks #andreas, twice!):
Building the map of counts is O(n), where n is the length of the input
Sorting the list of entries is O(m log m), where m is the number of unique characters in the input
Rebuilding the sorted string is O(n)
Although the slowest step by magnitude may appear to be the sorting, most probably it isn't the dominant operation when the input is very large. "Probably", because m is bound by the size of the alphabet, which is normally expected to be much smaller than the size of a very large input. Hence the overall time complexity of O(n).
Some minor optimizations are possible, but won't change the order of complexity:
You can first get a character array from the input string. It uses more memory, but you will save the boundary checks of .charAt, and the array can be useful at a later step (see below).
If you know the size of the alphabet, then you can use an int[] instead of a hash map.
Instead of rebuilding the sorted string manually and with string concatenation, you could write into the character array and return new String(chars).

Your code doesn't pass because of string concatenation, use StringBuilder instead and I bet you will pass.
StringBuilder builder = bew StringBuilder();
builder.append(e.getKey());
return builder.toString();
There are a couple of other ideas how to sort elements by frequency.
Use a sorting algorithm to sort the elements O(nlogn)
Scan the sorted array and construct a 2D array of element and count
O(n).
Sort the 2D array according to count O(nlogn)
Input 2 5 2 8 5 6 8 8
After sorting we get
2 2 5 5 6 8 8 8
Now construct the 2D array as
2, 2
5, 2
6, 1
8, 3
Sort by count
8, 3
2, 2
5, 2
6, 1
copyright
Follow the link to take a look at other possible approaches.

If you have to display the characters according to their frequency, we can use a map or dictionary in Python.
I am solving this using Python:
# sort characters by frequency
def fix(s):
d = {}
res=""
for ch in a: d[ch] = d.get(ch,0)+1
# use this lambda whenever you have to sort with values
for val in sorted(d.items(),reverse = True,key = lambda ch : ch[1]):
res = res + val[0]*val[1]
return res
a = "GiniGinaProtijayi"
print(fix(a))
Method 2 :Using collections.Counter().most_common()
# sort characeters by frequency
a = "GiniGinaProtijayi"
def sortByFrequency(a):
aa = [ch*count for ch, count in collections.Counter(a).most_common()]
print(aa)
print(''.join(aa))
sortByFrequency(a)

Detecting duplicates in a file generated using the sliding window concept

I am working on a project where I have to parse a text file and divide the strings into substrings of a length that the user specifies. Then I need to detect the duplicates in the results.
So the original file would look like this:
ORIGIN
1 gatccaccca tctcggtctc ccaaagtgct aggattgcag gcctgagcca ccgcgcccag
61 ctgccttgtg cttttaatcc cagcactttc agaggccaag gcaggcgatc agctgaggtc
121 aggagttcaa gaccagcctg gccaacatgg tgaaacccca tctctaatac aaatacaaaa
181 aaaaaacaaa aaacgttagc caggaatgag gcccggtgct tgtaatccta aggaaggaga
241 ccaccactcc tcctgctgcc cttcccttcc ccacaccgct tccttagttt ataaaacagg
301 gaaaaaggga gaaagcaaaa agcttaaaaa aaaaaaaaaa cagaagtaag ataaatagct
I loop over the file and generate a line of the strings then use line.toCharArray() to slide over the resulting line and divide according to the user specification. So if the substrings are of length 4 the result would look like this:
GATC
ATCC
TCCA
CCAC
CACC
ACCC
CCCA
CCAT
CATC
ATCT
TCTC
CTCG
TCGG
CGGT
GGTC
GTCT
TCTC
CTCC
TCCC
CCCA
CCAA
Here is my code for splitting:
try {
scanner = new Scanner(toSplit);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
char[] chars = line.toCharArray();
for (int i = 0; i < chars.length - (k - 1); i++) {
String s = "";
for(int j = i; j < i + k; j++) {
s += chars[j];
}
if (!s.contains("N")) {
System.out.println(s);
}
}
}
}
My question is: given that the input file can be huge, how can I detect duplicates in the results?

If You want to check duplicates a Set would be a good choice to hold and test data. Please tell in which context You want to detect the duplicates: words, lines or "output chars".

You can use a bloom filter or a table of hashes to detect possible duplicates and then make a second pass over the file to check if those "duplicate candidates" are true duplicates or not.
Example with hash tables:
// First we make a list of candidates so we count the times a hash is seen
int hashSpace = 65536;
int[] substringHashes = new int[hashSpace];
for (String s: tokens) {
substringHashes[s.hashCode % hashSpace]++; // inc
}
// Then we look for words that have a hash that seems to be repeated and actually see if they are repeated. We use a set but only of candidates so we save a lot of memory
Set<String> set = new HashSet<String>();
for (String s: tokens) {
if (substringHashes[s.hashCode % hashSpace] > 1) {
boolean repeated = !set.add(s);
if (repeated) {
// TODO whatever
}
}
}

You could do something like this:
Map<String, Integer> substringMap = new HashMap<>();
int index = 0;
Set<String> duplicates = new HashSet<>();
For each substring you pull out of the file, add it to substringMap only if it's not a duplicate (or if it is a duplicate, add it to duplicates):
if (substringMap.putIfAbsent(substring, index) == null) {
++index;
} else {
duplicates.add(substring);
}
You can then pull out all the substrings with ease:
String[] substringArray = new String[substringMap.size()];
for (Map.Entry<String, Integer> substringEntry : substringMap.entrySet()) {
substringArray[substringEntry.getValue()] = substringEntry.getKey();
}
And voila! An array of output in the original order with no duplicates, plus a set of all the substrings that were duplicates, with very nice performance.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Improving Collections.Sort - java

You didn't specify the sort algorithm you use some are quicker than others(quick/merge vs. bubble) Also If you are running on a multi-core/multi-processor machine you can divide the sort between multiple thread (again exactly how depends on the sort algorithm but here's an example)

Related

Is there an efficient algorithm for outputting all strings stored in a sorted lexicographically list that are a permutation of an input string?

ArrayList vs HashMap time complexity

Is there an array_intersect() equivalent in java?

Sort Characters By Frequency Java (Optimal Solution)

Detecting duplicates in a file generated using the sliding window concept

Categories

Resources