Implement minhash LSH using Spark (Java)

Implement minhash LSH using Spark (Java) - java

this is quite long, and I am sorry about this.
I have been trying to implement the Minhash LSH algorithm discussed in chapter 3 by using Spark (Java). I am using a toy problem like this:
+--------+------+------+------+------+
|element | doc0 | doc1 | doc2 | doc3 |
+--------+------+------+------+------+
| d | 1 | 0 | 1 | 1 |
| c | 0 | 1 | 0 | 1 |
| a | 1 | 0 | 0 | 1 |
| b | 0 | 0 | 1 | 0 |
| e | 0 | 0 | 1 | 0 |
+--------+------+------+------+------+
the goal is to identify, among these four documents (doc0,doc1,doc2 and doc3), which documents are similar to each other. And obviously, the only possible candidate pair would be doc0 and doc3.
Using Spark's support, generating the following "characteristic matrix" is as far as I can reach at this point:
+----+---------+-------------------------+
|key |value |vector |
+----+---------+-------------------------+
|key0|[a, d] |(5,[0,2],[1.0,1.0]) |
|key1|[c] |(5,[1],[1.0]) |
|key2|[b, d, e]|(5,[0,3,4],[1.0,1.0,1.0])|
|key3|[a, c, d]|(5,[0,1,2],[1.0,1.0,1.0])|
+----+---------+-------------------------+
and here is the code snippets:
CountVectorizer vectorizer = new CountVectorizer().setInputCol("value").setOutputCol("vector").setBinary(false);
Dataset<Row> matrixDoc = vectorizer.fit(df).transform(df);
MinHashLSH mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol("vector")
.setOutputCol("hashes");
MinHashLSHModel model = mh.fit(matrixDoc);
Now, there seems to be two main calls on the MinHashLSHModel model that one can use: model.approxSimilarityJoin(...) and model.approxNearestNeighbors(...). Examples about using these two calls are here: https://spark.apache.org/docs/latest/ml-features.html#lsh-algorithms
On the other hand, model.approxSimilarityJoin(...) requires us to join two datasets, and I have only one dataset which has 4 documents and I would like to figure out which ones in these four are similar to each other, so I don't have a second dataset to join... Just to try it out, I actually joined my only dataset with itself. Based on the result, seems like model.approxSimilarityJoin(...) just did a pair-wise Jaccard calculation, and I don't see any impact by changing the number of Hash functions etc, left me wondering about where exactly the minhash signature was calculated and where the band/row partition has happened...
The other call, model.approxNearestNeighbors(...), actually asks a comparison point, and then the model will identify the nearest neighbor(s) to this given point... Obviously, this is not what I wanted either, since I have four toy documents, and I don't have an extra reference point.
I am running out of ideas, so I went ahead implemented my own version of the algorithm, using Spark APIs, but not much support from MinHashLSHModel model, which really made me feel bad. I am thinking I must have missed something... ??
I would love to hear any thoughts, really wish to solve the mystery.
Thank you guys in advance!

The minHash signatures calculation happens in
model.approxSimilarityJoin(...) itself where model.transform(...)
function is called on each of the input datasets and hash signatures
are calculated before joining them and doing a pair-wise jaccard
distance calculation. So, the impact of changing the number of hash
functions can be seen here.
In model.approxNearestNeighbors(...),
the impact of the same can be seen while creating the model using
minHash.fit(...) function in which transform(...) is called on
the input dataset.

Related

What data structures to use to build a formula evaluator

My team is building an application which has to solve many user defined formulas. It is a replacement for a huge spreadsheet that our customers use. For e.g. Each formula uses simple arithmetic (mostly) and a few math functions. We are using an expression evaluation library called Parsii to do the actual formula evaluation. But among all the formulas we have to evaluate them in the order of their dependent formula. For e.g.
F1 = a + b
F2 = F1 * 10%
F3 = b / 2
F4 = F2 + F3
In the example above a, b are values input by users. The system should compute F1 & F3 initially since they are directly dependent on user input. Then F3 should be computed. And finally F4.
My question is that what data structure is recommended to model these dependencies of formula evaluation?
We have currently modeled it as a DIRECTED GRAPH. In the example above, F1 & F3 being the root node, and F3 being connected to both, and F4 connected to F3, F4 being the leaf node. We've used the Tinkerpop3 graph implementation to model this.
Any data structure used to model this should have following characteristics.
- Easy to change some input data of few top level root nodes (based on user input)
- Re-calculate only those formulas that are dependent on the root nodes that got changed (since we have 100s of formulas in a specific calculation context and have to respond back to the GUI layer within 1-2 secs)
- Minimize the amount of code to create the data structure via some existing libraries.
- Be able to query the data structure to query/lookup the root nodes by various keys (name of formula object, id of the object, year etc.) and be able to edit the properties of those keys.

Do you store this in a flat file currently?
If you wish to have better queryability, and easier modification, then you could store it as a DAG on database tables.
Maybe something like this (I expect the real solution to be somewhat different):
+-----------------------------------------------------------+
| FORMULA |
+------------+--------------+----------------+--------------+
| ID (PK) | FORMULA_NAME | FORMULA_STRING | FORMULA_YEAR |
+============+==============+================+==============+
| 1 | F1 | a + b | |
+------------+--------------+----------------+--------------+
| 2 | F2 | F1 * 10% | |
+------------+--------------+----------------+--------------+
| 3 | F3 | b / 2 | |
+------------+--------------+----------------+--------------+
| 4 | F4 | F2 + F3 | |
+------------+--------------+----------------+--------------+
+--------------------------------------+
| FORMULA_DEPENDENCIES |
+-----------------+--------------------+
| FORMULA_ID (FK) | DEPENDS_ON_ID (FK) |
+=================+====================+
| 2 | 1 |
+-----------------+--------------------+
| 4 | 2 |
+-----------------+--------------------+
| 4 | 3 |
+-----------------+--------------------+
With this you can also have the security of easily knowing if a formula depends on a non-existent formula because it would violate the DEPENDS_ON_ID foreign key. Also the database can detect if any of the formulas form a cycle of dependencies. Eg where F1 depends on F2 depends on F3 depends on F1.
Additionally you can easily add whatever metadata you wish to the tables and index on whatever you might query on.

Best way to implement friends list into a database? MySQL

So my project has a "friends list" and in the MySQL database I have created a table:
nameA
nameB
Primary Key (nameA, nameB)
This will lead to a lot of entries, but to ensure that my database is normalised I'm not sure how else to achieve this?
My project also uses Redis.. I could store them there.
When a person joins the server, I would then have to search for all of the entries to see if their name is nameA or nameB, and then put those two names together as friends, this may also be inefficient.
Cheers.

The task is quite common. You want to store pairs where A|B has the same meaning as B|A. As a table has columns, one of the two will be stored in the first column and the other in the second, but who to store first and who second and why?
One solution is to always store the lesser ID first and the greater ID second:
userid1 | userid2
--------+--------
1 | 2
2 | 5
2 | 6
4 | 5
This has the advantage that you store each pair only once, as feels natural, but has the disadvantage that you must look up a person in both coumns and find their friend sometimes in the first and sometimes in the second column. That may make queries kind of clumsy.
Another method is to store the pairs redundantly (by using a trigger typically):
userid1 | userid2
--------+--------
1 | 2
2 | 1
2 | 5
2 | 6
4 | 5
5 | 2
5 | 4
6 | 2
Here querying is easier: Look the person up in one column and find their friends in the other. However, it looks kind of weird to have all pairs duplicated. And you rely on a trigger, which some people don't like.
A third method is to store numbered friendships:
friendship | user_id
-----------+--------
1 | 1
1 | 2
2 | 2
2 | 5
3 | 2
3 | 6
4 | 4
4 | 5
This gives both users in the pair equal value. But in order to find friends, you need two passes: find the friendships for a user, find the friends in these friendships. However, the design is very clear and even extensible, i.e. you could have friendships of three four or more users.
No method is really much better than the other.

Performance testing : meaningful graph of a 3 variable statistic result

I'm performing performance testing of a computer application (JAVA). The test concerns the response time (t) obtained while testing the application with a certain number of concurrent threads (th) and a certain amount of data (d).
Suppose I have the following results:
+------+-------+-----+
| th | d | t |
+------+-------+-----+
| 2 | 500 | A |
+------+-------+-----+
| 4 | 500 | B |
+------+-------+-----+
| 2 | 1000 | C |
+------+-------+-----+
| 4 | 1000 | D |
+------+-------+-----+
How can i benefit the most of these results such as knowing the limit of my app as well as creating meaningful graphs to represent these results.
I'm not a statistics person so pardon my ignorance. Any suggestions would be really helpful (even related statistics technical keywords I can Google).
Thanks in advance.
EDIT
The tricky part for me was to determine the application's performance evolution taking both the number of threads and the amount of data into consideration in one plot.

Yes there is a way, check the following example I made with paint (the numbers I picked are just random):

Dijkstra algorithm alternatives - shortest path in graph, bus routes

i am using slightly modified Dijkstra algorithm in my app but it`s quite slow and i know there have to be a lot better approach. My input data are bus stops with specified travel times between each other ( ~ 400 nodes and ~ 800 paths, max. result depth = 4 (max 4 bus changes or nothing).
Input data (bus routes) :
bus_id | location-from | location-to | travel-time | calendar_switch_for_today
XX | A | B | 12 | 1
XX | B | C | 25 | 1
YY | C | D | 5 | 1
ZZ | A | D | 15 | 0
dijkstraResolve(A,D, '2012-10-10') -> (XX,A,B,12),(XX,B,C,25),(YY,C,D,5)
=> one bus change, 3 bus stops to final destination
* A->D cant be used as calendar switch is OFF
As you can imagine, in more complicated graphs where e.g. main city(node) does have 170 connections to different cities is Dijkstra slower (~ more then 5 seconds) because compute all neighbours first one by one as it`s not "trying" to reach target destination by some other way...
Could you recommend me any other algorithm which could fit well ?
I was looking on :
http://xlinux.nist.gov/dads//HTML/bellmanford.html (is it faster ?)
http://jboost.sourceforge.net/examples.html (i do not see
straightforward example here...)
Would be great to have (just optional things) :
- option to prefer minimal number of bus changes or minimal time
- option to look on alternatives way (if travel time is similar)
Thank you for tips

Sounds like you're looking for A*. It's a variant of Djikstra's which uses a heuristic to speed up the search. Under certain reasonable assumptions, A* is the fastest optimal algorithm. Just make sure to always break ties towards the endpoint.
There are also variants of A* which can provide near-optimal paths in much shorter time. See for example here and here.
Bellman-Ford (as suggested in your question) tends to be slower than either Djikstra's or A* - it is primarily used when there are negative edge-weights, which there are not here.

Maybe A* algorithm? See: http://en.wikipedia.org/wiki/A-star_algorithm
Maybe contraction hierarchies? See: http://en.wikipedia.org/wiki/Contraction_hierarchies.
Contraction hierarchies are implemented by the very nice, very fast Open Source Routing Machine (OSRM):
http://project-osrm.org/
and by OpenTripPlanner:
http://opentripplanner.com/
A* is implemented by a number of routing systems. Just do a search with Google.
OpenTripPlanner is a multi-modal routing system and, as long as I can see, should be very similar to your project.

The A* algorithm would be great for this; it achieves better performance by using heuristics.
Here is a simple tutorial to get you started: Link

Strategy for parsing natural language descriptions into structured data

I have a set of requirements and I'm looking for the best Java-based strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a structured format (see requirements below to see what I'm trying to do).
I've looked around here and other places, but have found nothing that gives a high-level advice on what direction follow. So, I'll put it to the smart people :-):
What's the best / simplest way to solve this problem? Should I use a natural language parser, dsl, lucene/solr, or some other tool/technology? NLP seems like it may work, but it looks really complex. I'd rather not spend a whole lot of time doing a deep dive just to find out it can't do what I'm looking for or that there is a simpler solution.
Requirements
Given these recipe ingredient descriptions....
"8 cups of mixed greens (about 5 ounces)"
"Eight skinless chicken thighs (about 1¼ lbs)"
"6.5 tablespoons extra-virgin olive oil"
"approximately 6 oz. thinly sliced smoked salmon, cut into strips"
"2 whole chickens (3 .5 pounds each)"
"20 oz each frozen chopped spinach, thawed"
".5 cup parmesan cheese, grated"
"about .5 cup pecans, toasted and finely ground"
".5 cup Dixie Diner Bread Crumb Mix, plain"
"8 garlic cloves, minced (4 tsp)"
"8 green onions, cut into 2 pieces"
I want to turn it into this....
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
| | Measure | | | weight | weight | | |
| # | value | Measure | ingredient | value | measure | preparation | Brand Name |
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
| 1. | 8 | cups | mixed greens | 5 | ounces | - | - |
| 2. | 8 | - | skinless chicken thigh | 1.5 | pounds | - | - |
| 3. | 6.5 | tablespoons | extra-virgin olive oil | - | - | - | - |
| 4. | 6 | ounces | smoked salmon | - | - | thinly sliced, cut into strips | - |
| 5. | 2 | - | whole chicken | 3.5 | pounds | - | - |
| 6. | 20 | ounces | forzen chopped spinach | - | | thawed | - |
| 7. | .5 | cup | parmesean cheese | - | - | grated | - |
| 8. | .5 | cup | pecans | - | - | toasted, finely ground | - |
| 9. | .5 | cup | Bread Crumb Mix, plain | - | - | - | Dixie Diner |
| 10. | 8 | - | garlic clove | 4 | teaspoons | minced | - |
| 11. | 8 | - | green onions | - | - | cut into 2 pieces | - |
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
Note the diversity of the descriptions. Some things are abbreviated, some are not. Some numbers are numbers, some are spelled out.
I would love something that does a perfect parse/translation. But, would settle for something that does reasonably well to start.
Bonus question: after suggesting a strategy / tool, how would you go about it?
Thanks!
Joe

Short answer. Use GATE.
Long answer. You need some tool for pattern recognition in text. Something, that can catch patterns like:
{Number}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}{"("}{Value}{")"}
...
Where {Number} is a number, {Ingredient} is taken from dictionary of ingredients, {Measure} - from dictionary measures and so on.
Patterns I described are very similar to GATE's JAPE rules. With them you catch text that matches pattern and assign some label to each part of a pattern (number, ingredient, measure, etc.). Then you extract labeled text and put it into single table.
Dictionaries I mentioned can be represented by Gazetteers in GATE.
So, GATE covers all your needs. It's not the easiest way to start, since you will have to learn at least GATE's basics, JAPE rules and Gazetteers, but with such approach you will be able to get really good results.

It is basically natural language parsing. (You did already stemming chicken[s].)
So basically it is a translation process.
Fortunately the context is very restricted.
You need a supportive translation, where you can add dictionary entries, adapt the grammar rules and retry again.
An easy process/work flow in this case is much more important than the algorithms.
I am interested in both aspects.
If you need a programming hand for an initial prototype, feel free to contact me. I did see, you are already working quite structured.
Unfortunately I do not know of fitting frameworks. You are doing something, that Mathematica wants to do with its Alpha (natural language commands yielding results).
Data mining? But simple natural language parsing with a manual adaption process should give fast and easy results.

You also can try Gexp.
Then you have to write rules as Java class such as
seq(Number, opt(Measure), Ingradient, opt(seq(token("("), Number, Measure, token(")")))
Then you have to add some group to capture (group(String name, Matcher m)) and extrat parts of pattern and store this information into table.
For Number, Measure you should use similar Gexp pattern, or I would recommend some Shallow parsing for noun phrase detection with words from Ingradients.

If you don't want to be exposed to the nitty-gritty of NLP and machine learning, there are a few hosted services that do this for you:
Zestful (disclaimer: I'm the author)
Spoonacular
Edamam
If you are interested in the nitty-gritty, the New York Times wrote about how they parsed their ingredient archive. They open-sourced their code, but abandoned it soon after. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.

Do you have access to a tagged corpus for training a statistical model? That is probably the most fruitful avenue here. You could build one up using epicurious.com; scrape a lot of their recipe ingredients lists, which are in the kind of prose form you need to parse, and then use their helpful "print a shopping list" feature, which provides the same ingredients in a tabular format. You can use this data to train a statistical language model, since you will have both the raw untagged data, and the expected parse results for a large number of examples.
This might be a bigger project than you have in mind, but I think in the end it will produce better results than a structured top-down parsing approach will.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.