Multikey hashmap - strong/weak key based lookup - java

I'm building a simple multi key hash map in a Java based application which would return a lookup value provided different combinations of keys where all keys and values are vanilla strings. Let's say below is a sample data set.
Key1|Key2|Key3|Key4|Result|
T1 | T2 | T3 | T4 | A1 |
* | * | T3 | T4 | A4 |
T1 | T2 | T3 | * | A2 |
* | T1 | * | T4 | A2 |
where * indicates ANY value.
The hash map will comprise of keys 1-4 and result as it's look up value. The keys will have specific values(such as T1,T2) and it's only the data set that has *(ANY) values. I'm trying to figure out what could be a best possible way to look up the correct value based on the most specific key.
For example a key combination of T1,T2,T3,T4 (from above) should return A1 as the result whereas a key combination of B1,B2,T3,T4 should return A4 as the result.
Any ideas would be really appreciated. The preference is to do it in simple Java without any additional libraries/frameworks but happy to look at them if need be.
Thanks a lot

Related

What data structures to use to build a formula evaluator

My team is building an application which has to solve many user defined formulas. It is a replacement for a huge spreadsheet that our customers use. For e.g. Each formula uses simple arithmetic (mostly) and a few math functions. We are using an expression evaluation library called Parsii to do the actual formula evaluation. But among all the formulas we have to evaluate them in the order of their dependent formula. For e.g.
F1 = a + b
F2 = F1 * 10%
F3 = b / 2
F4 = F2 + F3
In the example above a, b are values input by users. The system should compute F1 & F3 initially since they are directly dependent on user input. Then F3 should be computed. And finally F4.
My question is that what data structure is recommended to model these dependencies of formula evaluation?
We have currently modeled it as a DIRECTED GRAPH. In the example above, F1 & F3 being the root node, and F3 being connected to both, and F4 connected to F3, F4 being the leaf node. We've used the Tinkerpop3 graph implementation to model this.
Any data structure used to model this should have following characteristics.
- Easy to change some input data of few top level root nodes (based on user input)
- Re-calculate only those formulas that are dependent on the root nodes that got changed (since we have 100s of formulas in a specific calculation context and have to respond back to the GUI layer within 1-2 secs)
- Minimize the amount of code to create the data structure via some existing libraries.
- Be able to query the data structure to query/lookup the root nodes by various keys (name of formula object, id of the object, year etc.) and be able to edit the properties of those keys.
Do you store this in a flat file currently?
If you wish to have better queryability, and easier modification, then you could store it as a DAG on database tables.
Maybe something like this (I expect the real solution to be somewhat different):
+-----------------------------------------------------------+
| FORMULA |
+------------+--------------+----------------+--------------+
| ID (PK) | FORMULA_NAME | FORMULA_STRING | FORMULA_YEAR |
+============+==============+================+==============+
| 1 | F1 | a + b | |
+------------+--------------+----------------+--------------+
| 2 | F2 | F1 * 10% | |
+------------+--------------+----------------+--------------+
| 3 | F3 | b / 2 | |
+------------+--------------+----------------+--------------+
| 4 | F4 | F2 + F3 | |
+------------+--------------+----------------+--------------+
+--------------------------------------+
| FORMULA_DEPENDENCIES |
+-----------------+--------------------+
| FORMULA_ID (FK) | DEPENDS_ON_ID (FK) |
+=================+====================+
| 2 | 1 |
+-----------------+--------------------+
| 4 | 2 |
+-----------------+--------------------+
| 4 | 3 |
+-----------------+--------------------+
With this you can also have the security of easily knowing if a formula depends on a non-existent formula because it would violate the DEPENDS_ON_ID foreign key. Also the database can detect if any of the formulas form a cycle of dependencies. Eg where F1 depends on F2 depends on F3 depends on F1.
Additionally you can easily add whatever metadata you wish to the tables and index on whatever you might query on.

Implement minhash LSH using Spark (Java)

this is quite long, and I am sorry about this.
I have been trying to implement the Minhash LSH algorithm discussed in chapter 3 by using Spark (Java). I am using a toy problem like this:
+--------+------+------+------+------+
|element | doc0 | doc1 | doc2 | doc3 |
+--------+------+------+------+------+
| d | 1 | 0 | 1 | 1 |
| c | 0 | 1 | 0 | 1 |
| a | 1 | 0 | 0 | 1 |
| b | 0 | 0 | 1 | 0 |
| e | 0 | 0 | 1 | 0 |
+--------+------+------+------+------+
the goal is to identify, among these four documents (doc0,doc1,doc2 and doc3), which documents are similar to each other. And obviously, the only possible candidate pair would be doc0 and doc3.
Using Spark's support, generating the following "characteristic matrix" is as far as I can reach at this point:
+----+---------+-------------------------+
|key |value |vector |
+----+---------+-------------------------+
|key0|[a, d] |(5,[0,2],[1.0,1.0]) |
|key1|[c] |(5,[1],[1.0]) |
|key2|[b, d, e]|(5,[0,3,4],[1.0,1.0,1.0])|
|key3|[a, c, d]|(5,[0,1,2],[1.0,1.0,1.0])|
+----+---------+-------------------------+
and here is the code snippets:
CountVectorizer vectorizer = new CountVectorizer().setInputCol("value").setOutputCol("vector").setBinary(false);
Dataset<Row> matrixDoc = vectorizer.fit(df).transform(df);
MinHashLSH mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol("vector")
.setOutputCol("hashes");
MinHashLSHModel model = mh.fit(matrixDoc);
Now, there seems to be two main calls on the MinHashLSHModel model that one can use: model.approxSimilarityJoin(...) and model.approxNearestNeighbors(...). Examples about using these two calls are here: https://spark.apache.org/docs/latest/ml-features.html#lsh-algorithms
On the other hand, model.approxSimilarityJoin(...) requires us to join two datasets, and I have only one dataset which has 4 documents and I would like to figure out which ones in these four are similar to each other, so I don't have a second dataset to join... Just to try it out, I actually joined my only dataset with itself. Based on the result, seems like model.approxSimilarityJoin(...) just did a pair-wise Jaccard calculation, and I don't see any impact by changing the number of Hash functions etc, left me wondering about where exactly the minhash signature was calculated and where the band/row partition has happened...
The other call, model.approxNearestNeighbors(...), actually asks a comparison point, and then the model will identify the nearest neighbor(s) to this given point... Obviously, this is not what I wanted either, since I have four toy documents, and I don't have an extra reference point.
I am running out of ideas, so I went ahead implemented my own version of the algorithm, using Spark APIs, but not much support from MinHashLSHModel model, which really made me feel bad. I am thinking I must have missed something... ??
I would love to hear any thoughts, really wish to solve the mystery.
Thank you guys in advance!
The minHash signatures calculation happens in
model.approxSimilarityJoin(...) itself where model.transform(...)
function is called on each of the input datasets and hash signatures
are calculated before joining them and doing a pair-wise jaccard
distance calculation. So, the impact of changing the number of hash
functions can be seen here.
In model.approxNearestNeighbors(...),
the impact of the same can be seen while creating the model using
minHash.fit(...) function in which transform(...) is called on
the input dataset.

2 AtomicReferences can false share?

when i have
Object a;
Object b;
i have false sharing
this way i dont
#Contended
Object a;
Object b;
but if i have
final AtomicReference<Object> a;
final AtomicReference<Object> b;
do i still have false sharing?
my guess is that i dont need #Contended as although the a,b may be in the same cacheline what they refer to is not...
Instances of AtomicReference usually take 16 bytes (on HotSpot JVM): 12 bytes object header + 4 bytes value field. If two AtomicReferences lie next to each other in Java heap, they may still share the same cache line, which is typically 64 bytes.
Note: even if you allocate some object between two AtomicReferences, Garbage Collection may compact heap so that AtomicReferences are again located next to each other.
There are several ways to avoid false sharing:
Extend AtomicReference class and add at least 6 long fields - this will make your references occupy 64 bytes or more.
-------------------------------------------------------------------------------
| header | value | long1 | ... | long6 | header | value | long1 | ... | long6 |
-------------------------------------------------------------------------------
^ ^
|--------------- 64 bytes -------------|
Use AtomicReferenceArray and place your references at the distance of at least 16 cells, e.g. one reference will be located at the index 16, and the other - at the index 32.
----------------------------------------------------------------------------
| header | len | 0 | 1 | ... | 15 | 16 | 17 | ... | 31 | 32 |
----------------------------------------------------------------------------
^ ^
|-------- 64 bytes -------|

Why set insert the item in ordered?

Actually Set is not an ordered one. I just create the set and insert the numbers 5,2,10.
Wen it is printed in the console, it prints as 2,5,10.
Why since set is not ordered?
This is because this speeds up queries for whether a certain element is part of the set.
The difference is that this behavior is not guaranteed. It may be beneficial to keep small sets ordered for fast lookup, but switch to a hash based implementation once a certain number of elements has been reached, at which point elements would suddenly be sorted by hash value.
Set is an interface while it has several implementations. HashSet is not guaranteed your insertion order(not ordered). LinkedHashSet preserve insertion order and TreeSet give you a sorted set(sorted set).
Then you insert 5,2,10 to HashSet you can't guaranteed the same order.
Set is just an interface, assuming that you are talking about HashSet (because it's where this happens), it doesn't keep them sorted. For example:
HashSet<Integer> set = new HashSet<Integer>();
set.add(1);
set.add(16);
System.out.println(set);
Output
[16, 1]
This is because an HashSet uses the hashcode function to compute the index where the item will be stored in an array-like structure. This way, since the hashcode never changes, it can extract the element from the correct index computing again the hashcode and checking the cell at that index.
The hashcode function converts most classes to an integer:
System.out.println(new Integer(1).hashCode()); # 1
System.out.println(new Integer(1000).hashCode()); # 1000
System.out.println("Hello".hashCode()); # 69609650
Each class can define its own way to compute the hashcode and Integer returns itself.
As you can see numbers get big soon, and we don't want to have an array with 1000 cells just to save the two integers.
To avoid the problem we can create an array with n elements and then use the remainder of the hashcode divided by n as the index.
For example if we want to find the index for 1000 in an array of 16 elements:
System.out.println(new Integer(1000).hashCode() % 16); # 8
So our dictionary will know that the integer 1000 is at index 8. That's how HashSet is implemented.
So, why [16, 1] is not ordered? That's because HashSet are created with 16 elements as capacity at the beginning (when not differently specified), and grow as needed (more on this here).
Let's compute the index to store the data having key = 2 and key = 9 in a dictionary with n = 8:
System.out.println(new Integer(1).hashCode() % 16); # 1
System.out.println(new Integer(16).hashCode() % 16); # 0
This means that the array that contains the dictionary data will be:
| index | value |
|-------|-------|
| 0 | 16 |
| 1 | 1 |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
Iterating over it, the order will be the one presented in this representation, so 16 will be before 1.
Set is the Interface. It only indicates avoid duplicate entity of the collection.
HashSet internally uses Hashmap. Normally hashmap uses hashcode. So It wont return in ordered way. If you want insertion order you will use LinkedHashMap.
Set is just an interface. Ordering will depend on implementation. For example TreeSet is an ordered implementation of Set.

Performance testing : meaningful graph of a 3 variable statistic result

I'm performing performance testing of a computer application (JAVA). The test concerns the response time (t) obtained while testing the application with a certain number of concurrent threads (th) and a certain amount of data (d).
Suppose I have the following results:
+------+-------+-----+
| th | d | t |
+------+-------+-----+
| 2 | 500 | A |
+------+-------+-----+
| 4 | 500 | B |
+------+-------+-----+
| 2 | 1000 | C |
+------+-------+-----+
| 4 | 1000 | D |
+------+-------+-----+
How can i benefit the most of these results such as knowing the limit of my app as well as creating meaningful graphs to represent these results.
I'm not a statistics person so pardon my ignorance. Any suggestions would be really helpful (even related statistics technical keywords I can Google).
Thanks in advance.
EDIT
The tricky part for me was to determine the application's performance evolution taking both the number of threads and the amount of data into consideration in one plot.
Yes there is a way, check the following example I made with paint (the numbers I picked are just random):

Categories