RLE sequence, setting a value - java

Say I have an arbitrary RLE Sequence. (For those who don't know, RLE compresses an array like [4 4 4 4 4 6 6 1 1] into [(5,4) (2,6) (2,1)]. First comes the number of a particular integer in a run, then the number itself.)
How can I determine an algorithm to set a value at a given index without decompressing the whole thing? For example, if you do set(0,1) the RLE would become [(1,1) (4,4) (2,6) (2,1)]. (In set, the first value is the index, the second is the value)
Also, I've divided this compressed sequence into an ArrayList of Entries. That is, each Entry is one of these: (1,1) where it has an amount and value.
I'm trying to find an efficient way to do this, right now I can just think of methods that have WAY too many if statements to be considered clean. There are so many possible variations: for example, if the given value splits an existing entry, or if it has the same value as an existing entry, etc...
Any help would be much appreciated. I'm working on an algorithm now, here is some of it:
while(i<rleAL.size() && count != index)
{
indexToStop=0;
while(count<index || indexToStop == rleAL.get(i).getAmount())
{
count++;
indexToStop++;
}
if(count != index)
{
i++;
}
}
As you can see this is getting increasingly sloppy...
Thanks!

RLE is generally bad at updates, exactly for the reason stated. Changing ArrayList to LinkedList won't help much, as LinkedList is awfully slow in everything but inserts (and even with inserts you must already hold a reference to a specific location using e.g. ListIterator).
Talking about the original question, though, there's no need to decompress all. All you need is find the right place (summing up the counts), which is linear-time (consider skip list to make it faster) after which you'll have just four options:
You're in a block, and this block is the same as a number you're trying to save.
You're inside a block, and the number is different.
You're in the beginning or the end of the block, the number differs from the block's but same as neighbour has.
You're in the beginning or the end of the block, the number is neither the same as block's nor the one of neighbour's.
The actions are the same, obviously:
Do nothing
Change counter in the block, add two blocks
Change counters in two blocks
Change counter in one block, insert a new one
(Note though if you have skip lists, you must update those as well.)
Update: it gets more interesting if the block to update is of length 1, true. Still, it all stays as trivial: in any case, the changes are limited to maximum of three blocks.

Related

How to make zeros for next n-1 elements of n similar elements in an array?

I have an integer 667778 and I need to output it as 607008.
I used an array 6,6,7,7,7,8 and xor next similar elements.
I need to this in constant time.
suppose
int arr[]={6,6,7,7,7,8}
int ele=arr[0];
for (int i=1;i<arr.length;i++)
{
if(arr[i]==ele)
arr[i]=0;
else
ele=arr[i];
}
output array arr has [6,0,7,0,0,8]
It is taking O(n) n is size of the array
How can i do this in constant time?
Unless the number given will always be 6 digits (in which case you can hard code it, which is technically constant time, but will give equal performance to the loop), then you can't get constant time because the basis of the problem requires looping through the array in the first place.
Is there are reason you want it to work in constant time anyways, as O(n) is the fastest a program can read the data anyways.
Edit:
After reading your comments, I think you need to come up with a different approach so calculating the XORs won't be inside the loop. I can't provide much more help without the original problem.

A good data structure for storing and searching integers?

Edit: Typos fixed and ambiguity tried to fix.
I have a list of five digit integers in a text file. The expected amount can only be as large as what a 5-digit integer can store. Regardless of how many there are, the FIRST line in this file tells me how many integers are present, so resizing will never be necessary. Example:
3
11111
22222
33333
There are 4 lines. The first says there are three 5-digit integers in the file. The next three lines hold these integers.
I want to read this file and store the integers (not the first line). I then want to be able to search this data structure A LOT, nothing else. All I want to do, is read the data, put it in the structure, and then be able to determine if there is a specific integer in there. Deletions will never occur. The only things done on this structure will be insertions and searching.
What would you suggest as an appropriate data structure? My initial thought was a binary tree of sorts; however, upon thinking, a HashTable may be the best implementation. Thoughts and help please?
It seems like the requirements you have are
store a bunch of integers,
where insertions are fast,
where lookups are fast, and
where absolutely nothing else matters.
If you are dealing with a "sufficiently small" range of integers - say, integers up to around 16,000,000 or so, you could just use a bitvector for this. You'd store one bit per number, all initially zero, and then set the bits to active whenever a number is entered. This has extremely fast lookups and extremely fast setting, but is very memory-intensive and infeasible if the integers can be totally arbitrary. This would probably be modeled with by BitSet.
If you are dealing with arbitrary integers, a hash table is probably the best option here. With a good hash function you'll get a great distribution across the table slots and very, very fast lookups. You'd want a HashSet for this.
If you absolutely must guarantee worst-case performance at all costs and you're dealing with arbitrary integers, use a balanced BST. The indirection costs in BSTs make them a bit slower than other data structures, but balanced BSTs can guarantee worst-case efficiency that hash tables can't. This would be represented by TreeSet.
Given that
All numbers are <= 99,999
You only want to check for existence of a number
You can simply use some form of bitmap.
e.g. create a byte[12500] (it is 100,000 bits which means 100,000 booleans to store existence of 0-99,999 )
"Inserting" a number N means turning the N-th bit on. Searching a number N means checking if N-th bit is on.
Pseduo code of the insertion logic is:
bitmap[number / 8] |= (1>> (number %8) );
searching looks like:
bitmap[number/8] & (1 >> (number %8) );
If you understand the rationale, then a even better news for you: In Java we already have BitSet which is doing what I was describing above.
So code looks like this:
BitSet bitset = new BitSet(12500);
// inserting number
bitset.set(number);
// search if number exists
bitset.get(number); // true if exists
If the number of times each number occurs don't matter (as you said, only inserts and see if the number exists), then you'll only have a maximum of 100,000. Just create an array of booleans:
boolean numbers = new boolean[100000];
This should take only 100 kilobytes of memory.
Then instead of add a number, like 11111, 22222, 33333 do:
numbers[11111]=true;
numbers[22222]=true;
numbers[33333]=true;
To see if a number exists, just do:
int whichNumber = 11111;
numberExists = numbers[whichNumber];
There you are. Easy to read, easier to mantain.
A Set is the go-to data structure to "find", and here's a tiny amount of code you need to make it happen:
Scanner scanner = new Scanner(new FileInputStream("myfile.txt"));
Set<Integer> numbers = Stream.generate(scanner::nextInt)
.limit(scanner.nextInt())
.collect(Collectors.toSet());

How to try all possible way of traversing Array of integer

I'm new in programming and I'd like some help for an assignment, I only need a little clue to help me getting started (no answer, I only want to get a direction of how to do it then work my way).
The rules of the game : on the initial square is a marker that can move to other squares along the row. At each step in the game, you may move the marker the number of squares indicated by the integer in the square it currently occupies. The marker may move either left or right along the row but may not move past either end.
The goal of the game is to move the marker to the cave, the “0” at the far end of the row. In this configuration, you can solve the game by making the following set of moves:
eg: 2 3 1 2 4 0
first: move of 2 (to the right since it's impossible to go to the left) then: either move 1 to the right or 1 to the left then: if we moved left before: move 3 to the right (cannot go 3 to the left) if we moved right before then it's either 2 to the left or 2 to the right. 2 to the right is the right answer since then the next value is 0.
I must write the program that will try all the possibilities (so using a loop or a recursive I guess?) and return TRUE if there's a solution or FALSE if there are no solution. I had to choose the data structure from a few given by the prof for this assignment and decided to use an Array Based List since the get() is O(1) and it's mainly the only method used once the arraylist is created.
My program is only missing the (static?) method that will evaluate if it's possible or not, I dont need help for the rest of the assignment.
Assume that the ArrayBasedList is already given.
Thanks for your help!
I think your idea about using a recursive method makes a lot of sense. I appreciate you don't want the full answer, so this should just give you an idea:
Make a method (that passes in an integer x of the current position and your list) that returns a boolean.
In the method, check the value of list.get(x).
If it's 0, return true.
If it's -1, return false (more on that later).
Otherwise, store the value in a variable n, replace it with a -1, and return method(x+n) || method(x-n).
What the -1 does is it eventually fills everything in the array with a value that will terminate the program. Otherwise, you could end up with an infinite loop (like if you passed in [1, 2, 4, 2, 3]) with the value going back and forth.
A couple of things to keep in mind:
You'd need to make a new copy of the array every time, so you don't keep overwriting the original array.
You'd need to incorporate a try-catch (or equivalent logic) to make sure you are not trying to measure out of bounds.
If you have any further questions, feel free to ask!

Huffman Code Decoder Encoder In Java Source Generation

I want to create a fast Huffman Code decoder in Java and therefore thought about lookup tables. Since those tables consume memory and we use Java code to navigate and access the tables one can easily (or not) write a programm / method that expresses the same table.
The problem with that approach is, I dont know what is the best strategy. I know it is a lot about what fits in the cache and branch prediction. Also the switch case implementation meaning the actual ASM is beyond me. If I have a in memory lookup table (or a hierarchy of it) I will be able to simply jump in and out but I doupt that for my purposal that table would fit in the cache.
Since I actually walk a tree one could implement it as if else statements requireing a certain number of comparisms but for each comparism it would need additional binary operations.
So the following options exist:
General Algorithm using in Memory lookup tables
If/else representation of the decision tree
If/else representation with small switch statements to find the correct group of symboles (same bit pattern length) (fewer if statements, might be more code).
Switch statement representation of the code
Writing and benchmarking is quite tricky so any initial thoughts would be great.
One additional problem that comes into play is the order of bits. The most significant bit comes always first meaning it is stored in reverse order.
If your tree is A = 0, B = 10, C = 11 to write BAC it would actually be 01 + 0 + 11 (plus means append).
So actually the code have to be written in reverse order. using if /else or switch approach for groups it would not be a problem since masking out the bits is simple and the reverse of bit is simply possible but it would lose the idea of getting the index within the group out of the mask since in reverse bit order add and remove have different meaning and also a simple lookup is not possible.
Reversing the bits is a costly operation (I use 4bit lookup tables) not outweighting the performance penality of binary operations.
But reversing the bits on the go is better suited for this and require four operations per bit (shifting up, Masking out, add and also shifting the input down). Since I read bits ahead all those operations will be done in registers so they might take only a few cycles.
This way I can use switch, sub and if to find the right symbol group and also to return those.
So finaly I need advices. Since my codes are global for language processing, they can be hardwired (ie be in source).
I wonder what the parser generators like ANTRL use to express those decisions. Since they also seam to switch or if/else based on the input symbole it would might give me a clue.
[Updates]
I found a simplification that avoids the reverse bit problem but still adds costs per group. So I end up in writing the bits in the order of the groups to traverse. So I will not need four modifications per bit but per group (different bit lengths).
For each group we have:
1. The value for the first element, the size (and therefore the value for the last element within that group.
Therefore for each group the algorithm looks like:
1. Read mbits and combine with the current read value.
2. Compare the value with the last value of that group is it smaller its within that group if not its outside. -> read next
3. If it is inside the group aan array of values can be accessed or use a switch statement.
This is totally generic and can be used without loops making it efficient. Also if the group was detected, the bit length of the code is known and the bits can be consumed from source since the code looks far ahead (reading from stream).
[Update 2]
To access the actual value one could use a single big array of elements grouped by group. Since the propability reduces for group to group it is very likely that a significant part fits L2 or L1 cache speeding up access here.
Or one uses switch statements.
[Update 3]
Depending on the cases of a switch the compiler generates either a tableswitch or a lookup switch. The lookup switch has a complexity of O(log n) and stores key, jmp offset pairs which is not preferable. Therefore checking for groups is better suited for if/else.
The tableswitch itself uses only a table of jump offsets and it only takes substract, compare, access, jmp to reach the destination, than it must executes a return value on a constant.
Therefore a table access looks more promising. Also to avoid an unnecessary jump each group might contain the logic to access and return the group symbols table. Storing everything in a big table is promising since it might be int or short per symbole and my codes often do only have 1000 to 4000 symbols at most making it actually short.
I will check if 1 - pattern will give me the opportunity to store and access the masks in a better way allowing for binary searching the correct group instead of advancing in O(n) and might even avoid any shift operations at all during the processing.
I couldn't make sense of most of what you wrote in your (long) question, but there is a simple approach.
We'll start with a single table. Let's say your longest Huffman code is 15 bits. (In fact, deflate limits the size of its Huffman codes to 15 bits.) Then construct a table with 32768 entries, where each entry is the number of bits in the next code, and the symbol for that code. For codes less than 15 bits, there is more than one entry in the table for the same code. E.g. if the code is 10010110 (7 bits) for the symbol 'C', then all of the indexes of the table xxxxxxxx10010110 have the same thing. Those entries all have {7, 'C'}.
Then you get 15 bits from the stream, and look up the next code in the table. You remove the number of bits from that table entry, and use the resulting symbol. Now you get as many bits from the stream as you need to have 15, and repeat. So if you used 7 bits, then get 8 more to get back to 15 and look up the next code.
The next subtlety is that if your Huffman code changes often, you might end up spending more time filling up that large table for each new Huffman code than you spend actually decoding. To avoid that, you can make a two-level table which has, say, a 9-bit lookup (512 entries) for the first portion of the code. If the code is 9-bits or less, then you proceed as above. That will be the most common case, since shorter codes are more frequent (that being the whole point of Huffman coding). If the table entry says that there are 10 or more bits in the code (and you don't know yet how much more), then you consume the first nine bits and go to a second-level table for those initial nine bits pointed to by the entry in the first table, that has entries for the remaining six bits (64 entries). That resolves the remainder of the code and so tells you how many more bits to consume and what the symbol is. This approach can greatly reduce the time spent filling tables, and is very nearly as fast since short codes are more common. This is the approach used by inflate in zlib.
In the end it was quite simple. I support almost all solutions now. One can test every symbol group (same bit length), use a lookup table (10bit + 10bit + 10bit (just tables of 10bit, symbolscount + 1 is the reference to those talbes)) and generating java (and if needed javascript but currently I use GWT to translate it).
I even use long reads and shift operations to reduce the access to binary information. This way the code gets more efficiently since I only support a maximum bit size (20bit (so a table of a table) which makes 2^20 symbols and therefore at most a million).
For the ordering I use a generator for the bit masks just using shift operations and no requirement of reversing bit orders or such.
The table lookups can also be expressed in Java storing the tables as arrays of arrays (its interesting how big the java files can be without compilers to complain)).
Also I found it interesting that since comparing is expressing an ordering (half order I guess) one can sort the symbols and instead of mapping the symbols mapping the comparison index. By comparing two index one can simply sort streams of codes without touching to much. By also storing the first or first two comparison index (16 or 32bit) one can efficiently sort and therefore binary sort compressed strings using the same Huffman code, which makes it ideal to compress strings in a certain language.

How do I count repeated words?

Given a 1GB(very large) file containing words (some repeated), we need to read the file and output how many times each word is repeated. Please let me know if my solution is high performant or not.
(For simplicity lets assume we have already captured the words in an arraylist<string>)
I think the big O(n) is "n". Am I correct??
public static void main(String[] args) {
ArrayList al = new ArrayList();
al.add("math1");
al.add("raj1");
al.add("raj2");
al.add("math");
al.add("rj2");
al.add("math");
al.add("rj3");
al.add("math2");
al.add("rj1");
al.add("is");
Map<String,Integer> map= new HashMap<String,Integer>();
for (int i=0;i<al.size();i++)
{
String s= (String)al.get(i);
map.put(s,null);
}
for (int i=0;i<al.size();i++)
{
String s= (String)al.get(i);
if(map.get(s)==null)
map.put(s,1);
else
{
int count =(int)map.get(s);
count=count+1;
map.put(s,count);
}
}
System.out.println("");
}
I think you could do better than using a HashMap.
Food for thought on the hashmap solution
Your anwser is acceptable but consider this: For simplicity's sake lets assume you read the file one byte at a time into a StringBuffer until you hit a space. At which point you'll call toString() to convert the StringBuffer into a String. You then check if the string is in the HashMap and either it gets stored or the counter get incremented.
The English dic. included with linux has 400k words and is about 5MBs in size. So of the "1GB" of text you read, we can guess that you'll only be storing about 5MBs of it in your HashMap. The rest of the file, will be converted into strings that will need to be Garbage Collected after your finished looking for them in your map. I could be wrong, but I believe the bytes will be iterated over again during the construction of the String since the byte array needs to be copied internally and again for calculating the HashCode. So, the solution may waste a fair amount of CPU cycles and force GC to occur often.
Its OK to point things like this out in your interview, even if it's the only solution you can think of.
I may consider using a custom RadixTree or Trie like structure
Keep in mind how the insert method of a RadixT/Trie works. Which is to take a stream of chars/bytes (usually a string) and compares each element against the current position in the tree. If the prefix exists it just advances down the tree and byte-stream in lock step. When it hits a new suffix it begins adding nodes into the tree. Once the end of stream is reached it marks that node as EOW. Now consider we could do the same thing while reading a much larger stream, by resetting the current position to the root of the tree anytime we hit a space.
If we wrote our own Radix tree (or maybe a Trie), who's nodes had end-of-word counters (instead of markers) and had the insert method read directly from the file. We could insert nodes into the tree one byte/char at a time until we read a space. At which point the insert method would increment the end-of-word counter (if it's an existing word) and reset the current position in the tree back to the head and start inserting bytes/chars again. The way a radix tree works is to collapse the duplicated prefixs of words. For example:
The following file:
math1 raj1 raj2 math rj2 math rj3
would be converted to:
(root)-math->1->(eow=1)
| |-(eow=2)
|
raj->1->(eow=1)
| |->2->(eow=1)
| |->3->(eow=1)
j2->(eow=1)
The insertion time into a tree like this would be O(k), where k is the length of the longest word. But since we are inserting/comparing as we read each byte. We aren't any more inefficient than just reading the file as we have to already.
Also, note the we would read byte(s) into a temp byte that would be a stack variable, so the only time we need to allocate memory from the heap is when we encounter a new word (actually a new suffix). Therefore, garbage collection wouldn't happen nearly as often. And the total memory used by a Radix tree would be a lot smaller than a HashMap.
Theoretically , since HashMap access is generally O(1), I guess your algorithm is O(n), but in reality has several inefficiencies. Ideally you would iterate over the contents of the file just once, processing (i.e. counting) the words while you read them in. There's no need to store the entire file contents in memory (your ArrayList). You loop over the contents three times - once to read them, and the second and third times in the two loops in your code above. In particular, the first loop in your code above is completely unnecessary. Finally, your use of HashMap will be slower than needed because the default size at construction is very small, and it will have to grow internally a number of times, forcing a rebuilding of the hash table each time. Better to start it off a size appropriate for what you expect it to hold. You also have to consider the load factor into that.
Have you considered using a mapreduce solution? If the dataset gets bigger then it would really be better to split it in pieces and count the words in parallel
You should read through the file with words only once.
No need to put the nulls beforehand - you can do it within the main loop.
The complexity is indeed O(n) in both cases, but you want to make the constant as small as possible. (O(n) = 1000 * O(n), right :) )
To answer your question, first, you need to understand how HashMap works. It consists of buckets, and every bucket is a linked list. If due to hashing another pair need to occupy the same bucket, it will be added to the end of linked list. So, if map has high load factor, searching and inserting will not be O(1) anymore, and algorithm will become inefficient. Moreover, if map load factor exceeds predefined load factor (0.75 by default), the whole map will be rehashed.
This is an excerpt from JavaDoc http://download.oracle.com/javase/6/docs/api/java/util/HashMap.html:
The expected number of entries in the map and its load factor should
be taken into account when setting its initial capacity, so as to
minimize the number of rehash operations. If the initial capacity is
greater than the maximum number of entries divided by the load factor,
no rehash operations will ever occur.
So I would like to recommend you to predefine a map capacity guessing that every word is unique:
Map<String,Integer> map= new HashMap<String,Integer>(al.size());
Without of that, your solution is not efficient enough, though it still has a linear approximation O(3n), because due to amortization of rehashing, an insertion of elements will cost 3n instead of n.

Categories