How should I find repeated word sequences

How should I find repeated word sequences - java

I need to detect the presence of multiple blocks of columnar data given only their headings. Nothing else is known about the data except the heading words, which are different for every set of data.
Importantly, it is not known before hand how many words are in each block nor, therefore, how many blocks there are.
Equally important, the word list is always relatively short - less than 20.
So, given a list or array of heading words such as:
Opt
Object
Type
Opt
Object
Type
Opt
Object
Type
what's the most processing-efficient way to determine that it consists entirely of the repeating sequence:
Opt
Object
Type
It must be an exact match, so my first thought is to search [1+] looking for matches to [0], calling them index n,m,... Then if they are equidistant check [1] == [n+1] == [m+1], and [2] == [n+2] == [m+2] etc.
EDIT: It must work for word sets where some of the words are themselves repeated within a block, so
Opt
Opt
Object
Opt
Opt
Object
is a set of 2
Opt
Opt
Object

If the list is made of x repeating groups such that each group contains n elements...
We know there is at least 1 group so we will see if there 2 repeating groups, test by comparing the first half of the list and the second half.
1) If the above is true we know that that the solution is a factor of 2
2) If the above is false we move to the next largest prime number which is divisible by the total number of words...
At each step we check for equality among the lists, if we find it then know we have a solution with that factor in it.
We want to return a list of words for which we have the greatest factor of the first prime number for which we find equality among sub lists.
So we apply the above formula on the sub list knowing all sub lists are equal... therefore the solution is best solved recursively. That is we only need to consider the current sub list in isolation.
The solution will be extremely efficient if loaded with a short table of primes... after this it will be necessary to compute them but the list would have to be non trivial if even a list of only a few dozen primes are taken into account.

Can the unit sequence contain repetitions of its own? Do you know the length of the unit sequence?
e.g.
ABCABCABCDEFABCABCABCDEFABCABCABCDEF
where the unit sequence is ABCABCABCDEF
If the answer is yes, you've got a difficult problem, I think, unless you know the length of the unit sequence (in which case the solution is trivial, you just make a state machine that first stores the unit sequence, then verifies that the each element rest of the sequence corresponds to each element of the unit sequence).
If the answer is no, use this variant Floyd's cycle-finding algorithm to identify the unit sequence:
Initialize pointers P1 and P2 to the beginning of the sequence.
For each new element, increment pointer P1 every time, and increment pointer P2 every other time (keep a counter around to do this).
If P1 points to an identical elements of P2, you've found a unit sequence.
Now repeat through the rest of the sequence to verify that it consists of duplicates.
UPDATE: you've clarified your problem to state that the unit sequence may contain repetitions of its own. In this case, use the cycle-finding algorithm, but it's only guaranteed to find potential cycles. Keep it running throughout the length of the sequence, and use the following state machine, starting in state 1:
State 1: no cycle found that works; keep looking. When the cycle-finding algorithm finds a potential cycle, verify that you've gotten 2 copies of a preliminary unit sequence from P, and go to state 2. If you reach the end of the input, go to state 4.
State 2: preliminary unit sequence found. Run through the input as long as the cycle repeats identically. If you reach the end of the input, go to state 3. If you find an input element that is different from the corresponding element of the unit sequence, go back to state 1.
State 3: The input is a repetition of a unit sequence if the end of the input consists of complete repetitions of the unit sequence. (If it's midway through a unit sequence, e.g. ABCABCABCABCAB then a unit sequence found, but it does not consist of complete repetitions.)
State 4: No unit sequence found.
In my example (repeating ABCABCABCDEF) the algorithm starts by finding ABCABC, which would put it in state 2, and it would stay there until it hit the first DEF, which would put it back in state 1, then probably jump back and forth between states 1 and 2, until it reached the 2nd ABCABCABCDEF, at which point it would re-enter state 2, and at the end of the input it would be in state 3.

A better answer than my other one: a Java implementation which works, should be straightforward to understand, and is generic:
package com.example.algorithms;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
interface Processor<T> {
public void process(T element);
}
public class RepeatingListFinder<T> implements Processor<T> {
private List<T> unit_sequence = new ArrayList<T>();
private int repeat_count = 0;
private int partial_matches = 0;
private Iterator<T> iterator = null;
/* Class invariant:
*
* The sequence of elements passed through process()
* can be expressed as the concatenation of
* the unit_sequence repeated "repeat_count" times,
* plus the first "element_matches" of the unit_sequence.
*
* The iterator points to the remaining elements of the unit_sequence,
* or null if there have not been any elements processed yet.
*/
public void process(T element) {
if (unit_sequence.isEmpty() || !iterator.next().equals(element))
{
revise_unit_sequence(element);
iterator = unit_sequence.iterator();
repeat_count = 1;
partial_matches = 0;
}
else if (!iterator.hasNext())
{
iterator = unit_sequence.iterator();
++repeat_count;
partial_matches = 0;
}
else
{
++partial_matches;
}
}
/* Unit sequence has changed.
* Restructure and add the new non-matching element.
*/
private void revise_unit_sequence(T element) {
if (repeat_count > 1 || partial_matches > 0)
{
List<T> new_sequence = new ArrayList<T>();
for (int i = 0; i < repeat_count; ++i)
new_sequence.addAll(unit_sequence);
new_sequence.addAll(
unit_sequence.subList(0, partial_matches));
unit_sequence = new_sequence;
}
unit_sequence.add(element);
}
public List<T> getUnitSequence() {
return Collections.unmodifiableList(unit_sequence);
}
public int getRepeatCount() { return repeat_count; }
public int getPartialMatchCount() { return partial_matches; }
public String toString()
{
return "("+getRepeatCount()
+(getPartialMatchCount() > 0
? (" "+getPartialMatchCount()
+"/"+unit_sequence.size())
: "")
+") x "+unit_sequence;
}
/********** static methods below for testing **********/
static public List<Character> stringToCharList(String s)
{
List<Character> result = new ArrayList<Character>();
for (char c : s.toCharArray())
result.add(c);
return result;
}
static public <T> void test(List<T> list)
{
RepeatingListFinder<T> listFinder
= new RepeatingListFinder<T>();
for (T element : list)
listFinder.process(element);
System.out.println(listFinder);
}
static public void test(String testCase)
{
test(stringToCharList(testCase));
}
static public void main(String[] args)
{
test("ABCABCABCABC");
test("ABCDFTBAT");
test("ABABA");
test("ABACABADABACABAEABACABADABACABAEABACABADABAC");
test("ABCABCABCDEFABCABCABCDEFABCABCABCDEF");
test("ABABCABABCABABDABABDABABC");
}
}
This is a stream-oriented approach (with O(N) execution time and O(N) worst-case space requirements); if the List<T> to be processed already exists in memory, it should be possible to rewrite this class to process the List<T> without any additional space requirements, just keeping track of the repeat count and partial match count, using List.subList() to create a unit sequence that is a view of the first K elements of the input list.

My solution, which works as desired, is perhaps naive. It does have the advantage of being simple.
String[] wta; // word text array
...
INTERVAL:
for(int xa=1,max=(wta.length/2); xa<=max; xa++) {
if((wta.length%xa)!=0) { continue; } // ignore intervals which don't divide evenly into the words
for(int xb=0; xb<xa; xb++) { // iterate the words within the current interval
for(int xc=xb+xa; xc<wta.length; xc+=xa) { // iterate the corresponding words in each section
if(!wta[xb].equalsIgnoreCase(wta[xc])) { continue INTERVAL; } // not a cycle
}
}
ivl=xa;
break;
}

Related

When to use for, while, or do-while loops/ How to start it

Started learning about loops and the different types today. My question is in this situation which type would i try to use? And what would be the advantage over the others? After looking over my lecture notes it seems that do-while should always be used but I'm certain that it is not the case.
Also how would I start that first one about returning a sum of the "given array." what is the given array? Is it just whatever I'm supposed to be plugging into the run arguments line?
public class SumMinMaxArgs {
// TODO - write your code below this comment.
// You will need to write three methods:
//
// 1.) A method named sumArray, which will return the sum
// of the given array. If the given array is empty,
// it should return a sum of 0.
//
// 2.) A method named minArray, which will return the
// smallest element in the given array. You may
// assume that the array contains at least one element.
// You may use your min method defined in lab 6, or
// Java's Math.min method.
//
// 3.) A method named maxArray, which will return the
// largest element in the given array. You may
// assume that the array contains at least one element.
// You may use your max method defined in lab 6, or
// Java's Math.max method.
//
// DO NOT MODIFY parseStrings!
public static int[] parseStrings(String[] strings) {
int[] retval = new int[strings.length];
for (int x = 0; x < strings.length; x++) {
retval[x] = Integer.parseInt(strings[x]);
}
return retval;
}
// DO NOT MODIFY main!
public static void main(String[] args) {
int[] ints = parseStrings(args);
System.out.println("Sum: " + sumArray(ints));
System.out.println("Min: " + minArray(ints));
System.out.println("Max: " + maxArray(ints));
}
}

All three forms have exactly the same expressive power. What you use in a certain situation depends on style, convention, and convenience. This is much like you can express the same meaning with different english sentences.
That said, do - while is mostly used when the loop should run at least once (i.e. the condition is checked only after the first iteration).
for is mostly used when you are iterating over some collection or index range.

The four kinds of loops supported in Java:
C-style for loop: for (int i = 0 ; i < list.size() ; ++i) { ... } when you want to access the index of some kind of list or array directly, or to do an operation multiple times.
foreach loop, when you want to iterate over a collection, but don't care about the index: for (Customer c : customers) { ... }
while loop: while (some_condition) { ... } when some code must executed as long as the condition is true. If the condition is false to start with, the code inside the block (i.e. inside the brackets) will never be executed.
do while loop: do { statement1; } while (condition); will execute statement1 even if the condition is false to begin with, but it will do so only once.

Determining Base case of a recursive java program

This code generates the powerset of a set of numbers. For example if we have (0,1,2) the power set is {(0,1,2),(0,2),(1,2),(0,1),(2),(1),(0),()}
public static List<List<Integer>> generatePowerSet(List<Integer> inputSet){
List<List<Integer>> powerSet = new ArrayList<>();
directedPowerSet(inputSet,0,new ArrayList<Integer>(), powerSet);
return powerSet;
}
public static void directedPowerSet(List<Integer> inputSet, int toBeSelected, List<Integer> selectedSoFar,List<List<Integer>> powerSet){
if(toBeSelected == inputSet.size()){
powerSet.add(new ArrayList<Integer>(selectedSoFar));
return;
}
//Generate all subsets that contain inputSet[toBeSelected].
selectedSoFar.add(inputSet.get(toBeSelected));
directedPowerSet(inputSet,toBeSelected+1,selectedSoFar,powerSet);
//Generate all subsets that do not contain inputSet[toBeSelected].
selectedSoFar.remove(selectedSoFar.size()-1);
directedPowerSet(inputSet,toBeSelected+1,selectedSoFar,powerSet);
}
Why is the base case when toBeSelected == inputSet.size()?

The recursive code goes through valid indexes into inputSet list one-by-one, starting at zero. The current invocation uses toBeSelected as an index into inputSet, and passes toBeSelected+1 to the invocation of the next level.
Therefore, the meaning of the base case is that there is nothing else to be selected, which happens when toBeSelected becomes invalid.
The last valid value of toBeSelected is inputSet.size()-1; toBeSelected==inputSet.size() detects the first invalid value of toBeSelected, serving as a base case for the recursion.

Because the code is trying to build the power set of an n element set, starting with 0 element empty set, then moving to 1 element sets, then moving to 2 element sets and so on.
When should this end? When you are finally trying to build a n elements set, because there is only one and that's the input set itself.

java multithread loop with collecting results

sorry for limited code, as i have quite no idea how to do it, and parts of the code are not a code, just an explanation what i need. The base is:
arrayList<double> resultTopTen = new arrayList<double();
arrayList<double> conditions = new arrayList<double(); // this arrayList can be of a very large size milion+, gets filled by different code
double result = 0;
for (int i = 0, i < conditions.size(), i++){ //multithread this
loopResult = conditions.get(i) + 5;
if (result.size() < 10){
resultTopTen.add(loopResult);
}
else{
//this part i don't know, if this loopResult belongs to the TOP 10 loopResults so far, just by size, replace the smallest one with current, so that i will get updated resultTopTen in this point of loop.
}
}
loopResult = conditions.get(i) + 5; part is just an example, calculation is different, in fact it is not even double, so it is not possible simply to sort conditions and go from there.
for (int i = 0, i < conditions.size(), i++) part means i have to iterate through input condition list, and execute the calculation and get result for every condition in conditionlist, Don't have to be in order at all.
The multithreading part is the thing i have really no idea how to do, but as the conditions arrayList is really large, i would like to calculate it somehow in parallel, as if i do it just as it is in the code in a simple loop in 1 thread, i wont get my computing resources utilized fully. The trick here is how to split the conditions, and then collect result. For simplicity if i would like to do it in 2 threads, i would split conditions in half, make 1 thread do the same loop for 1st half and second for second, i would get 2 resultTopTen, which i can put together afterwards, But much better would be to split the thing in to as many threads as system resources provide(for example until cpu ut <90%, ram <90%). Is that possible?

Use parallel stream of Java 8.
static class TopN<T> {
final TreeSet<T> max;
final int size;
TopN(int size, Comparator<T> comparator) {
this.max = new TreeSet<>(comparator);
this.size = size;
}
void add(T n) {
max.add(n);
if (max.size() > size)
max.remove(max.last());
}
void combine(TopN<T> o) {
for (T e : o.max)
add(e);
}
}
public static void main(String[] args) {
List<Double> conditions = new ArrayList<>();
// add elements to conditions
TopN<Double> maxN = conditions.parallelStream()
.map(d -> d + 5) // some calculation
.collect(() -> new TopN<Double>(10, (a, b) -> Double.compare(a, b)),
TopN::add, TopN::combine);
System.out.println(maxN.max);
}
Class TopN holds top n items of T.
This code prints minimum top 10 in conditions (add 5 to each element).

Let me simplify your question, from what I understand, please confirm or add:
Requirement: You want to find top10 results from list called conditions.
Procedure: You want multiple threads to process your logic of finding the top10 results and accumulate the results to give top10.
Please also share the logic you want to implement to get top10 elements or it is just a descending order of list and it's top 10 elements.

recursion using a hashmap

I have an array that has the following numbers
int[] arr = {2,4,3,1,5,6,0,7,8,9,10,11,12,13,14,15};
Or any other order for that matter.
I need to make all the possible combinations for the numbers using a recursion but satisfying a condition that the next number clubbed with the present one can only be from specific numbers given by a hashmap:
ex When the recursion takes 1 the next number can be from {0,4,5,2,6} (from the HaspMap),and then if i make 10,the next number can be from {1,4,5} and so on
static HashMap<Integer,Integer[]> possibleSeq = new HashMap<Integer,Integer[] >();
private static void initialize(HashMap<Integer,Integer[]> possibleSeq) {
possibleSeq.put(0,new Integer[]{1,4,5});
possibleSeq.put(1,new Integer[]{0,4,5,2,6});
possibleSeq.put(2,new Integer[]{1,3,5,6,7});
possibleSeq.put(3,new Integer[]{2,6,7});
possibleSeq.put(4,new Integer[]{0,1,5,8,9});
possibleSeq.put(5,new Integer[]{0,1,2,4,6,8,9,10});
possibleSeq.put(6,new Integer[]{1,2,3,5,7,9,10,11});
possibleSeq.put(7,new Integer[]{2,3,6,10,11});
possibleSeq.put(8,new Integer[]{9,4,5,12,13});
possibleSeq.put(9,new Integer[]{10,4,5,8,6,12,13,14});
possibleSeq.put(10,new Integer[]{7,6,5,9,11,15,13,14});
possibleSeq.put(11,new Integer[]{6,7,10,14,15});
possibleSeq.put(12,new Integer[]{8,9,13});
possibleSeq.put(13,new Integer[]{8,9,10,12,14});
possibleSeq.put(14,new Integer[]{9,10,11,13,15});
possibleSeq.put(15,new Integer[]{10,11,14});
}
Note: I am required to make all the possible numbers beginning from digit length 1 to 10.
Help!

Try with something like this, for starters:
void findPath(Set paths, Stack path, int[] nextSteps, Set numbersLeft) {
if (numbersLeft.isEmpty()) {
//Done
paths.add(new ArrayList(path));
return;
}
for (int step:nextSteps) {
if (numbersLeft.contains(step)) {
// We can move on
path.push(step);
numbersLeft.remove(step);
findPath(paths, path, possiblePaths.get(step), numbersLeft);
numbersLeft.add(path.pop());
}
}
}
Starting values should be an empty Set, and empty Stack, a nextSteps identical to you initial array, and a set created from your initial array. When this returns, the paths Set should be filled with the possible paths.
I haven't tested this, and there are bugs as well as more elegant solutions.

Dictionary of unknown size - find whether a word is in dictionary

Here is an interesting problem.
Given an interface to a dictionary. It is unknown size, distribution, and content. Sorted ascending.
Also we have just a one method
String getWord(long index) throws IndexOutOfBoundsException
Add one method to the API:
boolean isInDictionary(String word)
What would be the best implementation for this problem.

Here is my implementation
boolean isWordInTheDictionary(String word){
if (word == null){
return false;
}
// estimate the length of the dictionary array
long len=2;
String temp= getWord(len);
while(true){
len = len * 2;
try{
temp = getWord(len);
}catch(IndexOutOfBoundsException e){
// found upped bound break from loop
break;
}
}
// Do a modified binary search using the estimated length
long beg = 0 ;
long end = len;
String tempWrd;
while(true){
System.out.println(String.format("beg: %s, end=%s, (beg+end)/2=%s ", beg,end,(beg+end)/2));
if(end - beg <= 1){
return false;
}
long idx = (beg+end)/2;
tempWrd = getWord(idx);
if(tempWrd == null){
end=idx;
continue;
}
if ( word.compareTo(tempWrd) > 0){
beg = idx;
}
else if(word.compareTo(tempWrd) < 0){
end= idx;
}else{
// found the word..
System.out.println(String.format("getword at index: %s, =%s", idx,getWord(idx)));
return true;
}
}
}
Let me know if this is correct

Let's suppose that your hypothetical data structure, with its single method, String getWord(long index), is based on a Dictionary that implements the usual Dictionary operations:
addition of pairs to the collection
removal of pairs from the collection
modification of the values of existing pairs
lookup of the value associated with a particular key
but the methods for all but the last have been hidden from you.
If that is the case, then your code definitely is not correct, because there is no reason to suppose that the dictionary stores values in any particular order, hence your binary search for items, using word.compareTo(), cannot be expected to work.
Also, you don't have catch code for index numbers between the dictionary size and len, the power of two that you found to be larger than the dictionary size, which need not be a power of two, so even if you changed to linear search instead of binary, you'd have an unhandled exception for words not in dictionary.

No, the words inside the dictionary probably aren't sorted. So you have to iterate through the dictionary and check every word if it is the one you're looking for.
If it is sorted, you're solution can be improved. The first loop only has to find out the right most entry after your word, you're searching.

duedl0r is correct, you can't assume that the dictionary will be ordered.
not having any other information, probably random search is the best algorithm that you can choose (after having estimated the size or during the estimation)
just for correcteness, in the second part of your algorithm you should check for exceptions and handle them, because, as you had said in the comment, your estimate is only an upper bound and during getWord there is the possibility that you will catch one
edit: just to give a better explanation
search in an unsorted list has lower bound for time complexity equals to O(n)
randomized search has complexity equals to O(k), where k is the iterations in search. so, you can decide k. but it is important to understand that randomized search does not guarantee success
when n, size of the dictionary, is very big, you can set k to a number of some orders lower than n and having high probability to find the word

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How should I find repeated word sequences - java

Related

When to use for, while, or do-while loops/ How to start it

Determining Base case of a recursive java program

java multithread loop with collecting results

recursion using a hashmap

Dictionary of unknown size - find whether a word is in dictionary

Categories

Resources