Hashing to form couples - java

I'm working on my final project for master degree and I'm facing a problem. I have some clusters containing information like: user_id, latitude and longitude, array of hours in which the user is in that cluster, array of days of week in which the user is in that cluster etc... I should be able to find people who leave near clusters in the same time span and arrive to near clusters in the same time span. So, for example, I have a cluster representing my house that is near to another cluster representing another user's house. We both leave home at 7 a.m and we both go almost to the same place between 7 a.m and 8 a.m. I'd like to couple this two users.
I should be able to realize this in a fast way from the computational point of view. My professor told me that we could use hashing to do this. How could hashing be used to do so?
Maybe calculating hashing value of some values of the clusters and grouping those are similar?
Thank you for the help.

Related

Java clustering algorithm to handle both similarity and dissimilarity

I'm working on a Java project where I need to match user queries against several engines.
Each engine has a method similarity(Object a, Object b) which returns: +1 if the objects surely match; -1 if the objects surely DON'T match; any float in-between when there's uncertainty.
Example: user searches "Dragon Ball".
Engine 1 returns "Dragon Ball", "Dragon Ball GT", "Dragon Ball Z", and it claims they are DIFFERENT result (similarity=-1), no matter how similar their names look. This engine is accurate, so it has a high "weight" value.
Engine 2 returns 100 different results. Some of them relate to DBZ, others to DBGT, etc. The engine claims they're all "quite similar" (similarity between 0.5 and 1).
The system queries several other engines (10+)
I'm looking for a way to build clusters out of this system. I need to ensure that values with similarity near -1 will likely end up in different clusters, even if many other values are very similar to all of them.
Is there a well-known clustering algorithm to solve this problem? Is there a Java implementation available? Can I build it on my own, perhaps with the help of a support library? I'm good at Java (15+ years experience) but I'm completely new at clustering.
Thank you!
The obvious approach would be to use "1 - similarity" as a distance function, which will thus go from 0 to 2. Then add them up.
Or you could use 1 + similarity and take the product of these values, ... or, or, or, ...
But since you apparently trust the first score more, you may also want to increase its influence. There is no mathematical solution for this, you habe to choose the weights depending on your data and preferences. If you have training data, you can optimize weights for your approach, and you may want to even discard some rankers if they don't work well or are correlated.

Traveling Salesman Variation: Building a tour of baseball stadiums

I'm trying to write a Java program to find the best itinerary to do a driving tour to see a game at each of the 30 Major League Baseball stadiums. I define the "best" itinerary using the metric Miles Driven + (200 * Number of Days on the Road); this eliminates tours that are 20,000 miles over 30 days, or 11,000 miles over 90 days, neither of which would be a trip that I'd want to take. Each team plays 81 home games over the course of a 183-day season, so when a team is at home needs to be taken into consideration.
Also, I'm not just looking for one best tour for the entire baseball season. I'm looking to find the best tour that starts/ends at any given MLB city, on any given date (Detroit on June 15, Atlanta on August 3, etc.).
I've got the program producing results that I'm pretty happy with, but it would take a few months to run to completion on my laptop, and I'm wondering if anyone has any ideas for how to make it more efficient.
The program runs iteratively. It starts with a single game; say, Chicago on April 5. It figures out which games you could get to next within the next day or two after the Chicago game; let's say there are two such games, in Cincinnati and Detroit. It creates a data structure containing all the stops on each prospective tour (one for Chicago-Cincinnati, one for Chicago-Detroit). Then, it does the same thing to find prospective 3rd stops for both of the 2-stop tours, and so on, until it gets to the 30th and last stop, at which point it ascertains the best tour.
It uses a couple of methods to prune inefficient tours as it goes. The main one is employed using a HashMap. The key is a character sequence that denotes (1) which ballparks have already been visited, and (2) which was the last one visited. So it would run into a duplicate on, say, A-B-C-D-E and A-D-B-C-E. It would then keep the shorter route and eliminate the longer one. For most starting points, this keeps the maximum number of tours on the list at any given time at around 20 million, but for some starting points, it gets up to around 90 million.
So ... any ideas?
Your algorithm is not actually greedy -- it enumerates all possible tours (pruning the obviously bad ones as you go). A greedy algorithm looks only one step ahead, makes the best decision for that step, then moves the next step ahead, etc. They are not exact but they are very fast. I would suggest adapting a standard greedy-type TSP heuristic for your problem. There are many common ones -- nearest neighbor, cheapest insertion, etc. You'll find various online sources describing them if you aren't already familiar with them.
You could also create duplicate stadium nodes, one for each home game, and then model this as a generalized TSP (GTSP) in which each "cluster" consists of the nodes for a given stadium. The distance from node (i_1,j_1) to (i_2,j_2) (where i = stadium and j = date) is defined by your metric.
Technically this is a TSP with time windows, but it's more complicated than the usual definition of the TSPTW because usually each node has a single contiguous time window (e.g., arrive between 8 am and 2 pm) whereas here you have a disjoint set of windows and you must choose one of them.
Hope this helps.

Finding as many pairs as possible

I'm trying to solve a problem but unfortunately my solution is not really the best for this task.
Task:
At a party there are N guests ( 0 < N < 30000 ). All guests tell when they get to the party and when they leave (for example [10;12]). The task is to take photos of as many people as possible at the party. On a photo there can only be 2 people (a pair) and each person can only be on exactly one photo. Of course, a photo can only be taken when the two persons are at the party at the same time. This is the case when their attendance intervals overlap.
My idea: I wrote a program which from the intervals creates a graph of connections. From the graph I search for the person who has the least number of connections. From the connected persons I also select the person who has the least connections. Then these two are chosen as a pair on a photo. Both are removed from the graph. The algorithm runs until no connections are left.
This approach works however there is a 10 secs limit for the program to calculate. With 1000 entries it runs in 2 secs, but even with 4000 it takes a lot of time. Furthermore, when I tried it with 25000 data, the program stops with an out of memory error, so I cannot even store the connections properly.
I think a new approach is needed here, but I couldn't find an other way to make this work.
Can anyone help me to figure out the proper algorithm for this task?
Thank you very much!
Sample Data:
10
1 100
2 92
3 83
4 74
5 65
6 55
7 44
8 33
9 22
10 11
The first line is the number of guests the further data is the intervals of people at the party.
No need to create graph here, this problem can be solved well on intervals structure. Sort people by ascending order of their leaving time(ending point of interval). Then iterate over them in that sorted order: if current person is not intersecting with anyone, then he should be removed. If he is intersecting with more than one person, take as a pair one of them who has earliest leaving time. During iteration you should compare each person only with next ones.
Proving this approach is not so difficult, so I hope you can prove it yourself.
Regarding running time, simple solution will be O(N^2), however I think that it can be reduced to O(N * logN). Anyway, O(N^2) will fit in 10 seconds on a normal PC.
Seems like a classical maximum matching problem to me.
You build a graph, where people who can possibly be pictured together (their time intervals intersect) are connected with an edge, and then find maximum matching, for example, with Edmond's Blossom algorithm.
I wouldn't say, that it's quite easy to implement. However, you can get quite a good approximation of this with Kuhn's algorithm for maximum matching in bipartite graphs. This one is really easy to implement, but won't give you the exact solution.
I have some really simple idea:
Assume, party will take Xh, make X sets for each hour, add to them appropriate people. Of course, people who will be there longer than hour will be in few sets. Now if there are 2 sets "together" with even number of ppl, you just could take n/2 photos for each sets. If there are 2 sets of odd number of people you are looking for someone who will be on each of that 2 sets, and move him to one of them (so you got 2 even number sets of people who will be on the same time on the party).
Remember to remove all used ppl (consider some class - Man with lists of all his/her hours).
My idea probably should be expand to more advanced "moving people" algorithm, through more than one neighboring set.
I think the following can do:
First, read all the guests' data and sort them into an array by leaving time ascending. Then, take first element of the array and iterate through next elements until the very first time-match found (next guest's entry time is less than this guest's leave time), if found, remove both from array as a pair, and report it elsewhere. If not, remove the guest as it can't be paired at all. Repeat until the array is empty.
The worst case of this is also N^2, as a party can be like [1,2],[3,4],... where no guests could be paired with each other, and the algorithm will search through all 30000 guests in vain all the time. So I don't think this is the optimal algorithm, but it should give an exact answer.
You say you already have a graph structure representation. I assume your vertices represent the guest and the interval of their staying at the party and the edges represent overlap of the respective intervals. What you then have to solve is the graph theoretical maximum matching problem, which has been solved before.
However, as indicated in my comments above, I think you can exploit the properties of the problem, especially the transitivity-like "if A leaves before B leaves and B leaves before C arrives, then A and C will not meet, either" like this:
Wait until the next yet unphotographed guest is about to leave, then take a photo of this one with the one who leaves next among those present.
You might succeed in thinking about the earliest time a photo can be taken: It is the time when the second person arrives at the party.
So as a photographer, goto the party being the first person and wait. Whenever a person arrives, take a photo with him/her and all other persons at the party. As a person appears only once, you will not have any duplicates.
While taking a photo (i.e. iterating over the list of guests), remove those guests who actually left the party.

A scheduling algorithm

I am looking for help in designing a scheduling algorithm for a medical review board:
Everyday hundreds of customers are scheduled starting 14 days later to specialized doctors.
Each patient may need to visit more than one doctor, in extreme cases could be up to 5 visits.
There are a fixed number of rooms, some of them with specialized equipment. For some of the meetings only specific rooms can be used.
Each doctor has a specific schedule, but usually between 14:00 and 19:00.
The main requirement is to try to have each patient come only once.
Many contraints including second visit with same doctor, avoid conflicts of interest (patient and doctor know each other) among others. Hospital/residents problem not suitable, mainly because of constraints. We are trying a solution using a prioritizing scheme and then trying to reschedule exceptions.
At this moment we are trying to define the algorithm, this is part of a whole system to manage the medical review board.
The sytem is based on Java with dojo for FE and EJB for BE.
This is a question that may be closed because it is too localized. It won't be much help to someone else. But it's a fun problem so I thought I'd throw out some ideas.
You are going to need to find matches for the most complex cases first.
Look for "best-fit" solutions. Don't take time on an empty day if you can fill up another day.
You're going to have to figure out a way to iterate through the matching so you try a wide range of possibilities. Some way to pull back, make a different choice, and then continue without getting into an infinite loop.
You might do the fitting up to (let's say) 80% and then swap around people. Swap a 3 hour appointment with a 2 and a 1 or something. The goal is to leave the schedule with the most "flexibility".
You're going to need to determine your swapping rules. What makes a schedule better?
Here's a bunch of SO questions for you to read:
Worker Scheduling Algorithm
Best Fit Scheduling Algorithm
Is there a scheduling algorithm that optimizes for "maker's schedules"?
Hope some of this helps.

Best fit curve for trend line

Problem Constraints
Size of the data set, but not the data itself, is known.
Data set grows by one data point at a time.
Trend line is graphed one data point at a time (using a spline/Bezier curve).
Graphs
The collage below shows data sets with reasonably accurate trend lines:
The graphs are:
Upper-left. By hour, with ~24 data points.
Upper-right. By day for one year, with ~365 data points.
Lower-left. By week for one year, with ~52 data points.
Lower-right. By month for one year, with ~12 data points.
User Inputs
The user can select:
the type of time series (hourly, daily, monthly, quarterly, annual); and
the start and end dates for the time series.
For example, the user could select a daily report for 30 days in June.
Trend Weight
To calculate the window size (i.e., the number of data points to average when calculating the trend line), the following expression is used:
data points / trend weight
Where data points is derived from user inputs and trend weight is 6.4. Even though a trend weight of 6.4 produces good fits, it is rather arbitrary, and might not be appropriate for different user inputs.
Question
How should trend weight be calculated given the constraints of this problem?
Based on the looks of the graphs I would say you have too many points for your 12 point graph (it is just a spline of the points given... which is visually pleasing, but actually does more harm than good when trying to understand the trend) and too few points for your 365 point graph. Perhaps try doing something a little exponential like:
(Data points)^1.2/14.1
I do realize this is even more arbitrary than what you already have, but arbitrary isn't the worst thing in the world.
(I got 14.1 by trying to keep the 52 point graph fixed, since that one looks nice, by taking (52^(1.2)/52)*6.4=14.1. You using this technique you could try other powers besides 1.2 to see what you visually get.
Dan
I voted this up for the quality of your results and the clarity of your write-up. I wish I could offer an answer that could improve on your already excellent work.
I fear that it might be a matter of trial and error with the trend weight until you see an improved fit.
It could be that you could make this an input from users as well: allow them to fiddle with the value, given realistic constraints, until they get satisfactory values.
I also wondered if the weight would be different for each graph, since the number of points in each is different. Are you trying to get a single weighting that works for all graphs?
Excellent work; a nice question. Well done. I wish I was more helpful. Perhaps someone else will have more wisdom to impart than I do.
It might look like the trend lines are accurate in those 4 graphs but its really quite off. (This is best seen in the begging of the lower left one and the beginning of the upper right. I would think that you would want to use no less than half of your points when finding the trend line (though really you should use much more than half). I would suggest a Trend Weight of 2 at a maximum. Though really you ought to stick closer to the 1-1.5 range. Since it is arbitrary i would suggest you give your user an "accuracy of trend line" slider that they can use where the most accurate setting uses a trend weight of 1 and the least accurate uses a weight of #of data points +1. This would use 0 points (amusing you always round down) and, i would assume, though your statistics software might be different, will generate a strait horizontal line.

Categories