Adding "\0" to a subset range end - java

Directly from this java api:
Why adding a "\0" would "open" one range end as explained in the following quote?
I checked the "\0" escape sequence and it says it represents the null character.
What is the null character in terms of Strings? and why adding to the "high parameter" of a subset should give the parameter itself included in the range?
If you need a closed range (which includes both endpoints), and the
element type allows for calculation of the successor of a given value,
merely request the subrange from lowEndpoint to
successor(highEndpoint). For example, suppose that s is a sorted set
of strings. The following idiom obtains a view containing all of the
strings in s from low to high, inclusive:
SortedSet sub = s.subSet(low, high+"\0");
Thanks in advance for your time.

high+"\0" is a way to obtaining the String that would be sorted immediately after high.
So, if you want a subset that includes the high element, you need to specify the limit to the subset as high+"\0"
For example, if you were dealing with a SortedSet<Int> and you wanted the subset between 4 and 8, both inclusive, you would use s.subSet(4, 8+1). high+"\0" is the String equivalent.

When you call subset with a high and a low limit, the high limit element will not be included (ie low <= element < high will be included, but that excludes high).
If you want it included, you need to give a limit slightly higher, but not high enough to include another element.
The easiest way to make the next bigger string is to append a \0, since making it longer will make it sort just after the high limit (so the high limit element is included), but it's not possible to find another string that sorts between them, so there's no risk of inadvertently including an extra element.

Related

Java Structure that is able to determine approximate number of elements less then x in an ordered set which is updated concurrently

Suppose U is an ordered set of elements, S ⊆U, and x ∈ U. S is being updated concurrently. I want to get an estimate of the number of elements in S that is less x in O(log(|S|) time.
S is being maintained by another software component that I cannot change. However, whenever e is inserted (or deleted) into S I get a message e inserted (deleted). I don't want to maintain my own version of S since memory is limited. I am looking for a structure, ES, (perhaps using O(log(|S|) space) where I can get a reasonable estimate of the number of elements less than any give x. Assume that the entire set S can periodically be sampled to recreate or update ES.
Update: I think that this problem statement must include more specific values for U. One obvious case is where U are numbers (int, double,etc). Another case is where U are strings ordered lexigraphically.
In the case of numbers one could use a probability distribution (but how can that be determined?).
I am wondering if the set S can be scanned periodically. Place the entire set into an array and sort. Then pick the log(n) values at n/log(n), 2n/log(n) ... n where n = |S|. Then draw a histogram based on those values?
More generally how can one find the appropriate probability distribution from S?
Not sure what the unit of measure would be for strings lexigraphically ordered?
By concurrently, I'm assuming you mean thread-safe. In that case, I believe what you're looking for is a ConcurrentSkipListSet, which is essentially a concurrent TreeSet. You can use ConcurrentSkipListSet#headSet.size() or ConcurrentSkipListSet#tailSet.size() to get the amount of elements greater/less than (or equal to) a single element where you can pass in a custom Comparator.
Is x constant? If so it seems easy to track the number less than x as they are inserted and deleted?
If x isn't constant you could still take a histogram approach. Divide up the range that values can take. As items are inserted / deleted, keep track of how many items are in each range bucket. When you get a query, sum up all the values from smaller buckets.
I accept your point that bucketing is tricky - especially if you know nothing about the underlying data. You could record the first 100 values of x, and use those calculate a mean and a standard deviation. Then you could assume the values are normally distributed and calculate the buckets that way.
Obviously if you know more about the underlying data you can use a different distribution model. It would be easy enough to have a modular approach if you want it to be generic.

A good data structure for storing and searching integers?

Edit: Typos fixed and ambiguity tried to fix.
I have a list of five digit integers in a text file. The expected amount can only be as large as what a 5-digit integer can store. Regardless of how many there are, the FIRST line in this file tells me how many integers are present, so resizing will never be necessary. Example:
3
11111
22222
33333
There are 4 lines. The first says there are three 5-digit integers in the file. The next three lines hold these integers.
I want to read this file and store the integers (not the first line). I then want to be able to search this data structure A LOT, nothing else. All I want to do, is read the data, put it in the structure, and then be able to determine if there is a specific integer in there. Deletions will never occur. The only things done on this structure will be insertions and searching.
What would you suggest as an appropriate data structure? My initial thought was a binary tree of sorts; however, upon thinking, a HashTable may be the best implementation. Thoughts and help please?
It seems like the requirements you have are
store a bunch of integers,
where insertions are fast,
where lookups are fast, and
where absolutely nothing else matters.
If you are dealing with a "sufficiently small" range of integers - say, integers up to around 16,000,000 or so, you could just use a bitvector for this. You'd store one bit per number, all initially zero, and then set the bits to active whenever a number is entered. This has extremely fast lookups and extremely fast setting, but is very memory-intensive and infeasible if the integers can be totally arbitrary. This would probably be modeled with by BitSet.
If you are dealing with arbitrary integers, a hash table is probably the best option here. With a good hash function you'll get a great distribution across the table slots and very, very fast lookups. You'd want a HashSet for this.
If you absolutely must guarantee worst-case performance at all costs and you're dealing with arbitrary integers, use a balanced BST. The indirection costs in BSTs make them a bit slower than other data structures, but balanced BSTs can guarantee worst-case efficiency that hash tables can't. This would be represented by TreeSet.
Given that
All numbers are <= 99,999
You only want to check for existence of a number
You can simply use some form of bitmap.
e.g. create a byte[12500] (it is 100,000 bits which means 100,000 booleans to store existence of 0-99,999 )
"Inserting" a number N means turning the N-th bit on. Searching a number N means checking if N-th bit is on.
Pseduo code of the insertion logic is:
bitmap[number / 8] |= (1>> (number %8) );
searching looks like:
bitmap[number/8] & (1 >> (number %8) );
If you understand the rationale, then a even better news for you: In Java we already have BitSet which is doing what I was describing above.
So code looks like this:
BitSet bitset = new BitSet(12500);
// inserting number
bitset.set(number);
// search if number exists
bitset.get(number); // true if exists
If the number of times each number occurs don't matter (as you said, only inserts and see if the number exists), then you'll only have a maximum of 100,000. Just create an array of booleans:
boolean numbers = new boolean[100000];
This should take only 100 kilobytes of memory.
Then instead of add a number, like 11111, 22222, 33333 do:
numbers[11111]=true;
numbers[22222]=true;
numbers[33333]=true;
To see if a number exists, just do:
int whichNumber = 11111;
numberExists = numbers[whichNumber];
There you are. Easy to read, easier to mantain.
A Set is the go-to data structure to "find", and here's a tiny amount of code you need to make it happen:
Scanner scanner = new Scanner(new FileInputStream("myfile.txt"));
Set<Integer> numbers = Stream.generate(scanner::nextInt)
.limit(scanner.nextInt())
.collect(Collectors.toSet());

Dictionary data structure + fast complexity methods

I'm trying to build from scratch, a data structure that would be able to hold a vast dictionary (of words/characters).
The "words" can be made out of arbitrarily large number of characters.
The dictionary would need standard methods such as search, insert, delete.
I need the methods to have time complexity that is better than O(log(n)), so between O(log(n)) to O(1), e.g log(log(n))
where n = dictionary size (number of elements)
I've looked into various tree structures, like for example b-tree which has log(n) methods (not fast enough) as well as trie which seemed most appropriate for the dictionary, but due to the fact that the words can be arbitrarily large it seemed liked it's complexity would not be faster than log(n).
If you could please provide any explanation
A trie has significant memory requirements but the access time is usually faster than O(log n).
If I recall well, the access time depends on the length of the word, not of the count of the words in the structure.
The efficiency and memory consumption also depend on exactly what implementation of the trie you chose to use. There are some pretty efficient implementations out there.
For more information on Tries see:
http://en.wikipedia.org/wiki/Trie
http://algs4.cs.princeton.edu/52trie/
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/
If your question is how to achieve as few string comparisons as possible, then a hash table is probably a very good answer, as it requires close to O(1) string comparisons. Note that hashing the key value takes time proportional to the string length, as can be the time for string comparison.
But this is nothing new. Can we do better for long strings ? To be more precise, we will assume the string length to be bounded by M. We will also assume that the length of every string is known (for long strings, this can make a difference).
First notice that the search time is bounded below by the string length, and is Ω(M) in the worst case: comparing two strings can require to compare all characters as the strings can differ only in the last character comparisons. On the other hand, in the best case, the comparison can conclude immediately, either because the lengths are different or because the strings differ in the first characters compared.
Now you can reason as follows: consider the whole set of strings in the dictionary and find the position of the first character on which they differ. Based on the value of this character, you will decompose in a number of subsets. And you can continue this decomposition recursively until you get singletons.
For example,
able
about
above
accept
accident
accompany
is organized as
*bl*
*bou*
*bov*
*c*e**
*c*i****
*c*o*****
where an asterisk stands for a character which just ignored, and the remaining characters are used for discrimination.
As you can see, in this particular example two or three character comparisons are enough to recognize any word in the dictionary.
This representation can be described as a finite state automaton such that in every state you know which character to check next and what are the possible outcomes, leading to the next states. It has a K-ary tree structure (where K is the size of the alphabet).
For an efficient implementation, every state can be represented by the position of the decision character and an array of links to the next states. Actually, this is a trie structure, with path compression. (As said by #peter.petrov, there are many variants of the trie structure.)
How do we use it ? There are two situations:
1) the search string is known to be in the dictionary: then a simple traversal of the tree is guaranteed to find it. It will do so after a number of character comparisons equal to the depth of the corresponding leaf in the tree O(D), where D is this depth. This can be a very significant saving.
2) the search string may not be in the dictionary: during traversal of the tree you can observe an early rejection; otherwise, in the end you find a single potential match. Then you can't avoid performing an exhaustive comparison, O(1) in the best case, O(M) in the worst. (On average O(M) for random strings, but probably better for real-world distributions.) But you will compare against a single string, never more.
In addition to that device, if your distribution of key lengths is sparse, it may be useful to maintain a hash table of the key lengths, so that immediate rejection of the search string can occur.
As final remarks, notice that this solution has a cost not directly a function of N, and that it is likely that time sublinear in M could be achieved by suitable heuristics taking advantage of the particular distribution of the strings.

Huffman Code Decoder Encoder In Java Source Generation

I want to create a fast Huffman Code decoder in Java and therefore thought about lookup tables. Since those tables consume memory and we use Java code to navigate and access the tables one can easily (or not) write a programm / method that expresses the same table.
The problem with that approach is, I dont know what is the best strategy. I know it is a lot about what fits in the cache and branch prediction. Also the switch case implementation meaning the actual ASM is beyond me. If I have a in memory lookup table (or a hierarchy of it) I will be able to simply jump in and out but I doupt that for my purposal that table would fit in the cache.
Since I actually walk a tree one could implement it as if else statements requireing a certain number of comparisms but for each comparism it would need additional binary operations.
So the following options exist:
General Algorithm using in Memory lookup tables
If/else representation of the decision tree
If/else representation with small switch statements to find the correct group of symboles (same bit pattern length) (fewer if statements, might be more code).
Switch statement representation of the code
Writing and benchmarking is quite tricky so any initial thoughts would be great.
One additional problem that comes into play is the order of bits. The most significant bit comes always first meaning it is stored in reverse order.
If your tree is A = 0, B = 10, C = 11 to write BAC it would actually be 01 + 0 + 11 (plus means append).
So actually the code have to be written in reverse order. using if /else or switch approach for groups it would not be a problem since masking out the bits is simple and the reverse of bit is simply possible but it would lose the idea of getting the index within the group out of the mask since in reverse bit order add and remove have different meaning and also a simple lookup is not possible.
Reversing the bits is a costly operation (I use 4bit lookup tables) not outweighting the performance penality of binary operations.
But reversing the bits on the go is better suited for this and require four operations per bit (shifting up, Masking out, add and also shifting the input down). Since I read bits ahead all those operations will be done in registers so they might take only a few cycles.
This way I can use switch, sub and if to find the right symbol group and also to return those.
So finaly I need advices. Since my codes are global for language processing, they can be hardwired (ie be in source).
I wonder what the parser generators like ANTRL use to express those decisions. Since they also seam to switch or if/else based on the input symbole it would might give me a clue.
[Updates]
I found a simplification that avoids the reverse bit problem but still adds costs per group. So I end up in writing the bits in the order of the groups to traverse. So I will not need four modifications per bit but per group (different bit lengths).
For each group we have:
1. The value for the first element, the size (and therefore the value for the last element within that group.
Therefore for each group the algorithm looks like:
1. Read mbits and combine with the current read value.
2. Compare the value with the last value of that group is it smaller its within that group if not its outside. -> read next
3. If it is inside the group aan array of values can be accessed or use a switch statement.
This is totally generic and can be used without loops making it efficient. Also if the group was detected, the bit length of the code is known and the bits can be consumed from source since the code looks far ahead (reading from stream).
[Update 2]
To access the actual value one could use a single big array of elements grouped by group. Since the propability reduces for group to group it is very likely that a significant part fits L2 or L1 cache speeding up access here.
Or one uses switch statements.
[Update 3]
Depending on the cases of a switch the compiler generates either a tableswitch or a lookup switch. The lookup switch has a complexity of O(log n) and stores key, jmp offset pairs which is not preferable. Therefore checking for groups is better suited for if/else.
The tableswitch itself uses only a table of jump offsets and it only takes substract, compare, access, jmp to reach the destination, than it must executes a return value on a constant.
Therefore a table access looks more promising. Also to avoid an unnecessary jump each group might contain the logic to access and return the group symbols table. Storing everything in a big table is promising since it might be int or short per symbole and my codes often do only have 1000 to 4000 symbols at most making it actually short.
I will check if 1 - pattern will give me the opportunity to store and access the masks in a better way allowing for binary searching the correct group instead of advancing in O(n) and might even avoid any shift operations at all during the processing.
I couldn't make sense of most of what you wrote in your (long) question, but there is a simple approach.
We'll start with a single table. Let's say your longest Huffman code is 15 bits. (In fact, deflate limits the size of its Huffman codes to 15 bits.) Then construct a table with 32768 entries, where each entry is the number of bits in the next code, and the symbol for that code. For codes less than 15 bits, there is more than one entry in the table for the same code. E.g. if the code is 10010110 (7 bits) for the symbol 'C', then all of the indexes of the table xxxxxxxx10010110 have the same thing. Those entries all have {7, 'C'}.
Then you get 15 bits from the stream, and look up the next code in the table. You remove the number of bits from that table entry, and use the resulting symbol. Now you get as many bits from the stream as you need to have 15, and repeat. So if you used 7 bits, then get 8 more to get back to 15 and look up the next code.
The next subtlety is that if your Huffman code changes often, you might end up spending more time filling up that large table for each new Huffman code than you spend actually decoding. To avoid that, you can make a two-level table which has, say, a 9-bit lookup (512 entries) for the first portion of the code. If the code is 9-bits or less, then you proceed as above. That will be the most common case, since shorter codes are more frequent (that being the whole point of Huffman coding). If the table entry says that there are 10 or more bits in the code (and you don't know yet how much more), then you consume the first nine bits and go to a second-level table for those initial nine bits pointed to by the entry in the first table, that has entries for the remaining six bits (64 entries). That resolves the remainder of the code and so tells you how many more bits to consume and what the symbol is. This approach can greatly reduce the time spent filling tables, and is very nearly as fast since short codes are more common. This is the approach used by inflate in zlib.
In the end it was quite simple. I support almost all solutions now. One can test every symbol group (same bit length), use a lookup table (10bit + 10bit + 10bit (just tables of 10bit, symbolscount + 1 is the reference to those talbes)) and generating java (and if needed javascript but currently I use GWT to translate it).
I even use long reads and shift operations to reduce the access to binary information. This way the code gets more efficiently since I only support a maximum bit size (20bit (so a table of a table) which makes 2^20 symbols and therefore at most a million).
For the ordering I use a generator for the bit masks just using shift operations and no requirement of reversing bit orders or such.
The table lookups can also be expressed in Java storing the tables as arrays of arrays (its interesting how big the java files can be without compilers to complain)).
Also I found it interesting that since comparing is expressing an ordering (half order I guess) one can sort the symbols and instead of mapping the symbols mapping the comparison index. By comparing two index one can simply sort streams of codes without touching to much. By also storing the first or first two comparison index (16 or 32bit) one can efficiently sort and therefore binary sort compressed strings using the same Huffman code, which makes it ideal to compress strings in a certain language.

Collection to store primitive ints that allows for faster contains() & ordered iteration

I need a space efficient collection to store a large list of primitive int(s)(with around 800,000 ints), that allows for fast operations for contains() & allows for iteration in a defined order.
Faster contains() operations to check whether an int is there in the list or not, is main priority as that is done very frequently.
I'm open to using widely used & popular 3rd party libraries like Trove, Guava & such others.
I have looked at TIntSet from Trove but I believe that would not let me define the order of iteration anyhow.
Edit:
The size of collection would be around 800,000 ints.
The range of values in the collection will be from 0 to Integer.Max_VALUE. The order of iteration should be actually based on the order in which I add the value to collection or may be I just provide an ordered int[] & it should iterate in the same order.
As data structure I would choose an array of longs (which I logically treat as two ints). The high-int part (bits 63 - 32) represent the int value you add to the collection. The low-int part (bits 31 - 0) represents the index of the successor when iterating. In case of your 800.000 unique integers you need to create a long array of size 800.000.
Now you organize the array as a binary balanced tree ordered by your values. To the left the smaller values and to the right the higher values. You need two more tracking values: one int to point to the first index to start iterating at and one int to point to the index of the value inserted last.
Whenever you add a new value, reorganize your binary balanced tree and update the pointer from the last value added pointing to the currently added value (as indexes).
Wrap this values (the array and both int values) as the collection of your choice.
With this data structure you get a search performance of O(log(n)) and a memory usage of two times the size of values.
As this reeks of database, but you require a more direct approach, use a memory mapped file of java.nio. Especially a self-defined ordering of 800_000 ints will not do otherwise. The contains could be realized with a BitSet in memory though, parallel to the ordering in the file.
You can use 2 Sets one set is set based on hash (e.g. TIntSet) for fast contains operations. Another is set based on tree structure like TreeSet to iterate in speicific order.
And when you need to add int, you update both sets at the same time.
It sounds like LinkedHashSet might be what you're looking for. Internally, it maintains two structures--a HashSet and a LinkedList, allowing for both fast 'contains()' (from the former) and defined iteration order (from the latter).
Just use a ArrayList<Integer>.

Categories