How to reverse hashmap compression (index method) (Java) [duplicate] - java

Background of question
I have been developing some code that focuses on firstly, reading a string and creating a file. Secondly, spliting a string into an array. Then getting the indexes for each word in the array and finally, removing the duplicates and printing it to a different file.
I currently have made the code for this here is a link https://pastebin.com/gqWH0x0 (there is a menu system as well) but it is rather long so I have refrained from implementing it in this question.
The compression method is done via hashmaps, getting indexes of the array and mapping them to the relevant word. Here is an example:
Original: "sea sea see sea see see"
Output: see[2, 4, 5],sea[0, 1, 3],
Question
The next stage is getting the output back into the original state. I am currently relatively new to java so I am not aware of the techniques required. The code should be able to take the output file (shown above) and put it back into the original.
My current thinking is that you would just rewrite this hashmap (below). Would I be correct in thinking this? I thought I should check with stack overflow first!
Map<String, Set<Integer>> seaMap = new HashMap<>(); //new hashmap
for (int seaInt = 0; seaInt < sealist.length; seaInt++) {
if (seaMap.keySet().contains(sealist[seaInt])) {
Set<Integer> index = seaMap.get(sealist[seaInt]);
index.add(seaInt);
} else {
Set<Integer> index = new HashSet<>();
index.add(seaInt);
seaMap.put(sealist[seaInt], index);
}
}
System.out.print("Compressed: ");
seaMap.forEach((seawords, seavalues) -> System.out.print(seawords + seavalues + ","));
System.out.println("\n");
If anyone has any good ideas / answers then please let me know, I am really desperate for a solution!
Link to current code: https://pastebin.com/gqWH0x0K

first you will have to separate the words with index(es) from your compressed line, using your example:
"see[2, 4, 5],sea[0, 1, 3],"
to obtain following Strings:
"see[2, 4, 5]" and "sea[0, 1, 3]"
for each you must read the indexes, e.g. for first:
2, 4 and 5
now just write the word in an ArrayList (or array) at the given index.
For the first two steps you can use a regular expression to find each word and the index list. Then use String.split and Integer.parseInt to get all indexes.
Pattern pattern = Pattern.compile("(.*?)\\[(.*?)\\],");
String line = "see[2, 4, 5],sea[0, 1, 3],";
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
String word = matcher.group(1);
String[] indexes = matcher.group(2).split(", ");
for (String str : indexes) {
int index = Integer.parseInt(str);
Now just check that the result List is big enough and set the word at the found indexes.

Related

How do I take a compressed file (through indexes) and re-create the original file? (Java)

Background of question
I have been developing some code that focuses on firstly, reading a string and creating a file. Secondly, spliting a string into an array. Then getting the indexes for each word in the array and finally, removing the duplicates and printing it to a different file.
I currently have made the code for this here is a link https://pastebin.com/gqWH0x0 (there is a menu system as well) but it is rather long so I have refrained from implementing it in this question.
The compression method is done via hashmaps, getting indexes of the array and mapping them to the relevant word. Here is an example:
Original: "sea sea see sea see see"
Output: see[2, 4, 5],sea[0, 1, 3],
Question
The next stage is getting the output back into the original state. I am currently relatively new to java so I am not aware of the techniques required. The code should be able to take the output file (shown above) and put it back into the original.
My current thinking is that you would just rewrite this hashmap (below). Would I be correct in thinking this? I thought I should check with stack overflow first!
Map<String, Set<Integer>> seaMap = new HashMap<>(); //new hashmap
for (int seaInt = 0; seaInt < sealist.length; seaInt++) {
if (seaMap.keySet().contains(sealist[seaInt])) {
Set<Integer> index = seaMap.get(sealist[seaInt]);
index.add(seaInt);
} else {
Set<Integer> index = new HashSet<>();
index.add(seaInt);
seaMap.put(sealist[seaInt], index);
}
}
System.out.print("Compressed: ");
seaMap.forEach((seawords, seavalues) -> System.out.print(seawords + seavalues + ","));
System.out.println("\n");
If anyone has any good ideas / answers then please let me know, I am really desperate for a solution!
Link to current code: https://pastebin.com/gqWH0x0K
first you will have to separate the words with index(es) from your compressed line, using your example:
"see[2, 4, 5],sea[0, 1, 3],"
to obtain following Strings:
"see[2, 4, 5]" and "sea[0, 1, 3]"
for each you must read the indexes, e.g. for first:
2, 4 and 5
now just write the word in an ArrayList (or array) at the given index.
For the first two steps you can use a regular expression to find each word and the index list. Then use String.split and Integer.parseInt to get all indexes.
Pattern pattern = Pattern.compile("(.*?)\\[(.*?)\\],");
String line = "see[2, 4, 5],sea[0, 1, 3],";
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
String word = matcher.group(1);
String[] indexes = matcher.group(2).split(", ");
for (String str : indexes) {
int index = Integer.parseInt(str);
Now just check that the result List is big enough and set the word at the found indexes.

Find words sequence in a document

Using Java (on Android) I try to find a way (fast one...) to resolve this problem :
I have a list of words (around 10 to 30) and a document. The length of the document can vary too, maybe around 2500 to 10000 words. This document is part of a book.
The thing i want is to find in this document the string (sentence...) who contains the higher quantity of the words in my list. The words in the document has to be in the same order as my words list. Normally the words should not be so far one from the other in the document, maybe max 2 or 3 words between each words of my list.
To be more clear, lets take an example with small data.
My word list is :
harm piece work day
my document :
just so, with the greatest care. You must see to it that you pull up
regularly all the baobabs, at the very first moment when they can be
distinguished from the rosebushes which they resemble so closely in
their earliest youth. It is very tedious work," the little prince
added, "but very easy." And one day he said to me: "You ought to
make a beautiful drawing, so that the children where you live can see
exactly how all this is. That would be very useful to them if they
were to travel some day. Sometimes," he added, "there is no harm
in putting off a piece of work until another day. But
when it is a matter of baobabs, that always means a catastrophe. I
knew a planet that was inhabited by a lazy man. He neglected three
little bushes..." So, as the little prince described it to me, I
have made a drawing of that planet. I do not much like to take the
tone of a moralist. But the danger of the baobabs is so little
understood, and such considerable risks would be run by anyone who
might get lost on an asteroid, that for once I am breaking through my
reserve. "Children," I say plainly, "watch out for the baobabs!"
The goal is to find the string "there is no harm in putting off a piece of work until another day" in the document.
For now, the only way i think about is :
1 - find the first occurrence of the first word in my list in the document.
2 - multiply the number of words in my list by 2 or 3 to get the string length i have to check in my document (regarding the max number of words between the words of my list in the document).
3 - search for the occurrence of the other words in my list in this document string (having the string length I got in step 2) by split and loop.
If I consider the occurrence of my words in this string is not enough (maybe around 50%) then continu searching in the document starting by the next occurrence of the first word in my list.
But I'm afraid this could be very long, too much long, specially because I'm working on a mobile device... So i'm here to grab some ideas I maybe didn't think about, or some libs who could help me with this task. I thought about regular expressions too but I'm not sure if it would be a better way.
#gukoff proposition
Regarding that finally my words list can't be in a different order than my text it simplify the algorithm. The beginning of #gukoff answer is enough. No need to implement the LIS algorithm or reverse the list.
//Section = input text
//wordsToFind = words to find in text separated by space
private ArrayList<ArrayList<Integer>> test1(String wordsToFind, Section section) {
//1. Create the index of your words array.
String[] wordsArray = wordsToFind.split(" ");
ArrayList<Integer> indexesSentences = new ArrayList<>();
ArrayList<ArrayList<Integer>> sentenceArrayIndexes = new ArrayList<>();
ArrayList<Integer> wordsToFindIndexes = new ArrayList<>();
for(Sentence sentence:section.getSentences()) {
indexesSentences.clear();
for(String sentenceWord:sentence.getWords()) {
wordsToFindIndexes.clear();
int j = 0;
for(String word:wordsArray) {
if(word.equals(sentenceWord)) {
wordsToFindIndexes.add(j+1);
}
j++;
}
//Collections.reverse(wordsToFindIndexes);
for(int idx:wordsToFindIndexes) {
indexesSentences.add(idx);
}
}
sentenceArrayIndexes.add((ArrayList<Integer>)indexesSentences.clone());
}
return sentenceArrayIndexes;
}
public class Section {
private ArrayList<Sentence> sentences;
public Section (String text) {
sentences = new ArrayList<>();
if(text == null || text.trim() == "") {
throw new IllegalArgumentException("Text not valid");
}
String formattedText = text.trim().replaceAll("[^a-zA-Z. ]", "").toLowerCase();
String[] sentencesArray = formattedText.split("\\.");
for(String sentenceStr:sentencesArray) {
if(sentenceStr.trim() != "") {
sentences.add(new Sentence(sentenceStr));
}
}
}
public ArrayList<Sentence> getSentences() {
return sentences;
}
public void addSentence(Sentence sentence) {
sentences.add(sentence);
}
}
So, you have the words to be found and a text, which consists of sentences to be examined.
Create the index of your words array.
For example, if words = a dog is not a human:
{
"a": [1, 5],
"dog": [2],
"is": [3],
"not": [4],
"human": [6]
}
In every sentence replace every word by its index value in descending order. That said, "a" gets replaced by [5, 1], "human" gets replaced by [6] and "tree" gets replaced by [].
For example, the sentence "not a cat is a human" should turn into [4, 5,1, 3, 5,1, 6]
Find the Longest increasing subsequence(LIS) in every array. Essentially, LIS would be the longest sub-match of your words array in the sentence.
For example, LIS of [4, 5,1, 3, 5,1, 6] is [1, 3, 5, 6], which maps to the sub-match "a is a human".
But generally, in case the words shouldn't be very far from each other, I suggest to find LIS using dynamic programming with corresponding modifications.
Here is a simple approach which should be good enough given your document size:
make an array (call it words) of size n where n is number of words in your document.
Now populate this array such that
words[i] = 0 if no words in your list match this word
words[i] = k if kth word in your list matches this word (1 based indexing )
Example: If your document is there is no harm in putting off a piece of work until another day. and word list is work day harm piece (in that order) then your wordsarray will look like this [0,0,0,3,0,0,0,0,4,0,1,0,0,2]
2.Now you will have an array of size 2000~3000 of integers.You can use a variant of Longest common subsequence problem or modify your algorithm a little to find the best match.

Removing last comma or other delimiter when printing list to string [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Let's assume that I have a datastructure with n elements (0, 1, ... , n-1).
And assume these elements have some property I want to print (assume an int). And I want to print the datastructure on one line, with a delimiter (comma, space). So, the result would be something like this
13, 14, 15, 16, 17
What is the best way to add the comma and space (in Java)?
These are approaches I've taken:
If the datastructure is Iterable, iterate from 0 to n-2 and append the delimiter to each of those, then print the last element.
Loop through all elements and append everything to empty String (or Stringbuilder), then at the very end remove the last deliminator (hardcoded).
I feel like there's something better.
Edit: The question is unique. The 'possible duplicate' involves only printing, not messing with delimiters.
Edit2: Attempts as requested
// 1st approach
public void printStuff(ArrayList<Integer> input) {
for (int i=0; i<input.size()-1; i++) {
System.out.print(input.get(i) + ", ");
}
System.out.println(input.get(input.size()-1);
}
Another attempt
// 2nd approach
public void printStuff(ArrayList<Integer> input) {
StringBuilder sb = new StringBuilder();
for (int i=0; i<input.size(); i++) {
sb.append(input.get(i)+", ");
}
sb.substring(0, input.size()-2);
System.out.println(sb.toString());
}
Java 8's streaming APIs finally provide an elegant, native, way of doing this without resorting to ugly if-else structures or using third parties.
Assuming you have a list of MyClass objects, and assuming MyClass has a getPropery() access method that returns an int, you could do something like this:
List<MyClass> list = ...;
String concatination = list.stream()
.map(p -> String.valueOf(p.getProperty()))
.collect(Collectors.joining(", "));
Use some Java 8 Stream magic:
List<Integer> list = Arrays.asList(1, 2, 3) //just so we have a list to work on
String result = list.stream()
.map(it -> it.toString())
.collect(Collectors.joining(", "))
Using an iterator I would check whether there are any items left.
If so, add a separator:
if (i.hasNext()) s.append(", ");
You could use a framework, like this:
https://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/base/Joiner.html
Joiner.on(",").join(list)
You could remove the first element, add it to your String and add , <string> to it:
List<String> lst = new ArrayList<String>();
// add elements
StringBuilder buffer = new StringBuilder();
buffer.append(lst.remove(0));
for(String s : lst) {
buffer.append(", ").append(s);
}

Exporting specific pattern of string using split method in a most efficient way

I want to export pattern of bit stream in a String varilable. Assume our bit stream is something like bitStream="111000001010000100001111". I am looking for a Java code to save this bit stream in a specific array (assume bitArray) in a way that all continous "0"s or "1"s be saved in one array element. In this example output would be somethins like this:
bitArray[0]="111"
bitArray[1]="00000"
bitArray[2]="1"
bitArray[3]="0"
bitArray[4]="1"
bitArray[5]="0000"
bitArray[6]="1"
bitArray[7]="0000"
bitArray[8]="1111"
I want to using bitArray to calculate the number of bit which is stored in each continous stream. For example in this case the final output would be, "3,5,1,1,1,4,1,4,4". I figure it out that probably "split" method would solve this for me. But I dont know what splitting pattern would do that for me, if i Using bitStream.split("1+") it would split on contious "1" pattern, if i using bitStream.split("0+") it will do that base on continous"0" but how it could be based on both?
Mathew suggested this solution and it works:
var wholeString = "111000001010000100001111";
wholeString = wholeString.replace('10', '1,0');
wholeString = wholeString.replace('01', '0,1');
stringSplit = wholeString.split(',');
My question is "Is this solution the most efficient one?"
Try replacing any occurrence of "01" and "10" with "0,1" and "1,0" respectively. Then once you've injected the commas, split the string using the comma as the delimiting character.
String wholeString = "111000001010000100001111"
wholeString = wholeString.replace("10", "1,0");
wholeString = wholeString.replace("01", "0,1");
String stringSplit[] = wholeString.split(",");
You can do this with a simple regular expression. It matches 1s and 0s and will return each in the order they occur in the stream. How you store or manipulate the results is up to you. Here is some example code.
String testString = "111000001010000100001111";
Pattern pattern = Pattern.compile("1+|0+");
Matcher matcher = pattern.matcher(testString);
while (matcher.find())
{
System.out.print(matcher.group().length());
System.out.print(" ");
}
This will result in the following output:
3 5 1 1 1 4 1 4 4
One option for storing the results is to put them in an ArrayList<Integer>
Since the OP wanted most efficient, I did some tests to see how long each answer takes to iterate over a large stream 10000 times and came up with the following results. In each test the times were different but the order of fastest to slowest remained the same. I know tick performance testing has it's issues like not accounting for system load but I just wanted a quick test.
My answer completed in 1145 ms
Alessio's answer completed in 1202 ms
Matthew Lee Keith's answer completed in 2002 ms
Evgeniy Dorofeev's answer completed in 2556 ms
Hope this helps
I won't give you a code, but I'll guide you to a possible solution:
Construct an ArrayList<Integer>, iterate on the array of bits, as long as you have 1's, increment a counter and as soon as you have 0, add the counter to the ArrayList. After this procedure, you'll have an ArrayList that contain numbers, etc: [1,2,2,3,4] - Representing a serieses of 1's and 0's.
This will represent the sequences of 1's and 0's. Then you construct an array of the size of the ArrayList, and fill it accordingly.
The time complexity is O(n) because you need to iterate on the array only once.
This code works for any String and patterns, not only 1s and 0s. Iterate char by char, and if the current char is equal to the previous one, append the last char to the last element of the List, otherwise create a new element in the list.
public List<String> getArray(String input){
List<String> output = new ArrayList<String>();
if(input==null || input.length==0) return output;
int count = 0;
char [] inputA = input.toCharArray();
output.add(inputA[0]+"");
for(int i = 1; i <inputA.length;i++){
if(inputA[i]==inputA[i-1]){
String current = output.get(count)+inputA[i];
output.remove(count);
output.add(current);
}
else{
output.add(inputA[i]+"");
count++;
}
}
return output;
}
try this
String[] a = s.replaceAll("(.)(?!\\1)", "$1,").split(",");
I tried to implement #Maroun Maroun solution.
public static void main(String args[]){
long start = System.currentTimeMillis();
String bitStream ="0111000001010000100001111";
int length = bitStream.length();
char base = bitStream.charAt(0);
ArrayList<Integer> counts = new ArrayList<Integer>();
int count = -1;
char currChar = ' ';
for (int i=0;i<length;i++){
currChar = bitStream.charAt(i);
if (currChar == base){
count++;
}else {
base = currChar;
counts.add(count+1);
count = 0;
}
}
counts.add(count+1);
System.out.println("Time taken :" + (System.currentTimeMillis()-start ) +"ms");
System.out.println(counts.toString());
}
I believe it is more effecient way, as he said it is O(n) , you are iterating only once. Since the goal to get the count only not to store it as array. i woul recommen this. Even if we use Regular Expression ( internal it would have to iterate any way )
Result out put is
Time taken :0ms
[1, 3, 5, 1, 1, 1, 4, 1, 4, 4]
Try this one:
String[] parts = input.split("(?<=1)(?=0)|(?<=0)(?=1)");
See in action here: http://rubular.com/r/qyyfHNAo0T

How to find all permutations of a given word in a given text?

This is an interview question (phone screen): write a function (in Java) to find all permutations of a given word that appear in a given text. For example, for word abc and text abcxyaxbcayxycab the function should return abc, bca, cab.
I would answer this question as follows:
Obviously I can loop over all permutations of the given word and use a standard substring function. However it might be difficult (for me right now) to write code to generate all word permutations.
It is easier to loop over all text substrings of the word size, sort each substring and compare it with the "sorted" given word. I can code such a function immediately.
I can probably modify some substring search algorithm but I do not remember these algorithms now.
How would you answer this question?
This is probably not the most efficient solution algorithmically, but it is clean from a class design point of view. This solution takes the approach of comparing "sorted" given words.
We can say that a word is a permutation of another if it contains the same letters in the same number. This means that you can convert the word from a String to a Map<Character,Integer>. Such conversion will have complexity O(n) where n is the length of the String, assuming that insertions in your Map implementation cost O(1).
The Map will contain as keys all the characters found in the word and as values the frequencies of the characters.
Example. abbc is converted to [a->1, b->2, c->1]
bacb is converted to [a->1, b->2, c->1]
So if you have to know if two words are one the permutation of the other, you can convert them both into maps and then invoke Map.equals.
Then you have to iterate over the text string and apply the transformation to all the substrings of the same length of the words that you are looking for.
Improvement proposed by Inerdial
This approach can be improved by updating the Map in a "rolling" fashion.
I.e. if you're matching at index i=3 in the example haystack in the OP (the substring xya), the map will be [a->1, x->1, y->1]. When advancing in the haystack, decrement the character count for haystack[i], and increment the count for haystack[i+needle.length()].
(Dropping zeroes to make sure Map.equals() works, or just implementing a custom comparison.)
Improvement proposed by Max
What if we also introduce matchedCharactersCnt variable? At the beginning of the haystack it will be 0. Every time you change your map towards the desired value - you increment the variable. Every time you change it away from the desired value - you decrement the variable. Each iteration you check if the variable is equal to the length of needle. If it is - you've found a match. It would be faster than comparing the full map every time.
Pseudocode provided by Max:
needle = "abbc"
text = "abbcbbabbcaabbca"
needleSize = needle.length()
//Map of needle character counts
targetMap = [a->1, b->2, c->1]
matchedLength = 0
curMap = [a->0, b->0, c->0]
//Initial map initialization
for (int i=0;i<needle.length();i++) {
if (curMap.contains(haystack[i])) {
matchedLength++
curMap[haystack[i]]++
}
}
if (matchedLength == needleSize) {
System.out.println("Match found at: 0");
}
//Search itself
for (int i=0;i<haystack.length()-needle.length();i++) {
int targetValue1 = targetMap[haystack[i]]; //Reading from hashmap, O(1)
int curValue1 = curMap[haystack[i]]; //Another read
//If we are removing beneficial character
if (targetValue1 > 0 && curValue1 > 0 && curValue1 <= targetValue1) {
matchedLength--;
}
curMap[haystack[i]] = curValue1 + 1; //Write to hashmap, O(1)
int targetValue2 = targetMap[haystack[i+needle.length()]] //Read
int curValue2 = curMap[haystack[i+needle.length()]] //Read
//We are adding a beneficial character
if (targetValue2 > 0 && curValue2 < targetValue2) { //If we don't need this letter at all, the amount of matched letters decreases
matchedLength++;
}
curMap[haystack[i+needle.length()]] = curValue2 + 1; //Write
if (matchedLength == needleSize) {
System.out.println("Match found at: "+(i+1));
}
}
//Basically with 4 reads and 2 writes which are
//independent of the size of the needle,
//we get to the maximal possible performance: O(n)
To find a permutation of a string you can use number theory.
But you will have to know the 'theory' behind this algorithm in advance before you can answer the question using this algorithm.
There is a method where you can calculate a hash of a string using prime numbers.
Every permutation of the same string will give the same hash value. All other string combination which is not a permutation will give some other hash value.
The hash-value is calculated by c1 * p1 + c2 * p2 + ... + cn * pn
where ci is a unique value for the current char in the string and where pi is a unique prime number value for the ci char.
Here is the implementation.
public class Main {
static int[] primes = new int[] { 2, 3, 5, 7, 11, 13, 17,
19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103 };
public static void main(String[] args) {
final char[] text = "abcxaaabbbccyaxbcayaaaxycab"
.toCharArray();
char[] abc = new char[]{'a','b','c'};
int match = val(abc);
for (int i = 0; i < text.length - 2; i++) {
char[] _123 = new char[]{text[i],text[i+1],text[i+2]};
if(val(_123)==match){
System.out.println(new String(_123) );
}
}
}
static int p(char c) {
return primes[(int)c - (int)'a'];
}
static int val(char[] cs) {
return
p(cs[0])*(int)cs[0] + p(cs[1])*(int)cs[1] + p(cs[2])*(int)cs[2];
}
}
The output of this is:
abc
bca
cab
You should be able to do this in a single pass. Start by building a map that contains all the characters in the word you're searching for. So initially the map contains [a, b, c].
Now, go through the text one character at a time. The loop looks something like this, in pseudo-code.
found_string = "";
for each character in text
if character is in map
remove character from map
append character to found_string
if map is empty
output found_string
found_string = ""
add all characters back to map
end if
else
// not a permutation of the string you're searching for
refresh map with characters from found_string
found_string = ""
end if
end for
If you want unique occurrences, change the output step so that it adds the found strings to a map. That'll eliminate duplicates.
There's the issue of words that contain duplicated letters. If that's a problem, make the key the letter and the value a count. 'Removing' a character means decrementing its count in the map. If the count goes to 0, then the character is in effect removed from the map.
The algorithm as written won't find overlapping occurrences. That is, given the text abcba, it will only find abc. If you want to handle overlapping occurrences, you can modify the algorithm so that when it finds a match, it decrements the index by one minus the length of the found string.
That was a fun puzzle. Thanks.
This is what I would do - set up a flag array with one
element equal to 0 or 1 to indicate whether that character
in STR had been matched
Set the first result string RESULT to empty.
for each character C in TEXT:
Set an array X equal to the length of STR to all zeroes.
for each character S in STR:
If C is the JTH character in STR, and
X[J] == 0, then set X[J] <= 1 and add
C to RESULT.
If the length of RESULT is equal to STR,
add RESULT to a list of permutations
and set the elements of X[] to zeroes again.
If C is not any character J in STR having X[J]==0,
then set the elements of X[] to zeroes again.
The second approach seems very elegant to me and should be perfectly acceptable. I think it scales at O(M * N log N), where N is word length and M is text length.
I can come up with a somewhat more complex O(M) algorithm:
Count the occurrence of each character in the word
Do the same for the first N (i.e. length(word)) characters of the text
Subtract the two frequency vectors, yielding subFreq
Count the number of non-zeroes in subFreq, yielding numDiff
If numDiff equals zero, there is a match
Update subFreq and numDiff in constant time by updating for the first and after-last character in the text
Go to 5 until reaching the end of the text
EDIT: See that several similar answers have been posted. Most of this algorithm is equivalent to the rolling frequency counting suggested by others. My humble addition is also updating the number of differences in a rolling fashion, yielding an O(M+N) algorithm rather than an O(M*N) one.
EDIT2: Just saw that Max has basically suggested this in the comments, so brownie points to him.
This code should do the work:
import java.util.ArrayList;
import java.util.List;
public class Permutations {
public static void main(String[] args) {
final String word = "abc";
final String text = "abcxaaabbbccyaxbcayxycab";
List<Character> charsActuallyFound = new ArrayList<Character>();
StringBuilder match = new StringBuilder(3);
for (Character c : text.toCharArray()) {
if (word.contains(c.toString()) && !charsActuallyFound.contains(c)) {
charsActuallyFound.add(c);
match.append(c);
if (match.length()==word.length())
{
System.out.println(match);
match = new StringBuilder(3);
charsActuallyFound.clear();
}
} else {
match = new StringBuilder(3);
charsActuallyFound.clear();
}
}
}
}
The charsActuallyFound List is used to keep track of character already found in the loop. It is needed to avoid mathing "aaa" "bbb" "ccc" (added by me to the text you specified).
After further reflection, I think my code only work if the given word has no duplicate characters.
The code above correctly print
abc
bca
cab
but if you seaarch for the word "aaa", then nothing is printed, because each char can not be matched more than one time. Inspired from Jim Mischel answer, I edit my code, ending with this:
import java.util.ArrayList;
import java.util.List;
public class Permutations {
public static void main(String[] args) {
final String text = "abcxaaabbbccyaxbcayaaaxycab";
printMatches("aaa", text);
printMatches("abc", text);
}
private static void printMatches(String word, String text) {
System.out.println("matches for "+word +" in "+text+":");
StringBuilder match = new StringBuilder(3);
StringBuilder notYetFounds=new StringBuilder(word);
for (Character c : text.toCharArray()) {
int idx = notYetFounds.indexOf(c.toString());
if (idx!=-1) {
notYetFounds.replace(idx,idx+1,"");
match.append(c);
if (match.length()==word.length())
{
System.out.println(match);
match = new StringBuilder(3);
notYetFounds=new StringBuilder(word);
}
} else {
match = new StringBuilder(3);
notYetFounds=new StringBuilder(word);
}
}
System.out.println();
}
}
This give me following output:
matches for aaa in abcxaaabbbccyaxbcayaaaxycab:
aaa
aaa
matches for abc in abcxaaabbbccyaxbcayaaaxycab:
abc
bca
cab
Did some benchmark, the code above found 30815 matches of "abc" in a random string of 36M in just 4,5 seconds. As Jim already said, thanks for this puzzle...

Categories