I'm trying to write a Java program to find the best itinerary to do a driving tour to see a game at each of the 30 Major League Baseball stadiums. I define the "best" itinerary using the metric Miles Driven + (200 * Number of Days on the Road); this eliminates tours that are 20,000 miles over 30 days, or 11,000 miles over 90 days, neither of which would be a trip that I'd want to take. Each team plays 81 home games over the course of a 183-day season, so when a team is at home needs to be taken into consideration.
Also, I'm not just looking for one best tour for the entire baseball season. I'm looking to find the best tour that starts/ends at any given MLB city, on any given date (Detroit on June 15, Atlanta on August 3, etc.).
I've got the program producing results that I'm pretty happy with, but it would take a few months to run to completion on my laptop, and I'm wondering if anyone has any ideas for how to make it more efficient.
The program runs iteratively. It starts with a single game; say, Chicago on April 5. It figures out which games you could get to next within the next day or two after the Chicago game; let's say there are two such games, in Cincinnati and Detroit. It creates a data structure containing all the stops on each prospective tour (one for Chicago-Cincinnati, one for Chicago-Detroit). Then, it does the same thing to find prospective 3rd stops for both of the 2-stop tours, and so on, until it gets to the 30th and last stop, at which point it ascertains the best tour.
It uses a couple of methods to prune inefficient tours as it goes. The main one is employed using a HashMap. The key is a character sequence that denotes (1) which ballparks have already been visited, and (2) which was the last one visited. So it would run into a duplicate on, say, A-B-C-D-E and A-D-B-C-E. It would then keep the shorter route and eliminate the longer one. For most starting points, this keeps the maximum number of tours on the list at any given time at around 20 million, but for some starting points, it gets up to around 90 million.
So ... any ideas?
Your algorithm is not actually greedy -- it enumerates all possible tours (pruning the obviously bad ones as you go). A greedy algorithm looks only one step ahead, makes the best decision for that step, then moves the next step ahead, etc. They are not exact but they are very fast. I would suggest adapting a standard greedy-type TSP heuristic for your problem. There are many common ones -- nearest neighbor, cheapest insertion, etc. You'll find various online sources describing them if you aren't already familiar with them.
You could also create duplicate stadium nodes, one for each home game, and then model this as a generalized TSP (GTSP) in which each "cluster" consists of the nodes for a given stadium. The distance from node (i_1,j_1) to (i_2,j_2) (where i = stadium and j = date) is defined by your metric.
Technically this is a TSP with time windows, but it's more complicated than the usual definition of the TSPTW because usually each node has a single contiguous time window (e.g., arrive between 8 am and 2 pm) whereas here you have a disjoint set of windows and you must choose one of them.
Hope this helps.
Related
I'm working on my final project for master degree and I'm facing a problem. I have some clusters containing information like: user_id, latitude and longitude, array of hours in which the user is in that cluster, array of days of week in which the user is in that cluster etc... I should be able to find people who leave near clusters in the same time span and arrive to near clusters in the same time span. So, for example, I have a cluster representing my house that is near to another cluster representing another user's house. We both leave home at 7 a.m and we both go almost to the same place between 7 a.m and 8 a.m. I'd like to couple this two users.
I should be able to realize this in a fast way from the computational point of view. My professor told me that we could use hashing to do this. How could hashing be used to do so?
Maybe calculating hashing value of some values of the clusters and grouping those are similar?
Thank you for the help.
I am trying to build a poker bot in java. I have written the hand evaluation class and I am about to start feeding a neural network but I face a problem. I need the winning odds of every hand for every step: preflop, flop, turn, river.
My problem is that, there are 52 cards and the combinations of 5 cards are 2,598,960. So I need to store 2,598,960 odds for each possible hand. The number is huge and these are only the odds I need for the river.
So I have two options:
Find the odds for every possible hand and every possible deck and every time I start my application load them and kill my memory.
Calculate the odds on the fly and lack processing power.
Is there a 3rd better option to deal with this problem?
3rd option is use the disk... but my first choice would be to calculate odds as you need them.
Why do you need to calculate all combinations of 5 cards, a lot of these hands are worth the same, as there are 4 suits there is repetition between hands.
Personally I would rank your hand based on how many hands beat your hand and how many hands your hand beats. From this you can compute your probability of winning the table by multiplying by number of active hands.
What about ignoring the colors? From 52 possible values, you drop to 13. You only have 6175 options remaining. Of course, colors are important for a flush - but here, it is pretty much binary - are all the colors the same or not? So we are at 12350 (including some impossible combinations, in fact it is 7462 as in the others, a number is contained more than once, so the color must differ).
If the order is important (e.g. starting hand, flip, flop, river or how is it called), it will be a lot more, but it is still less than your two millions. Try simplifying your problems and you'll realize they can be solved.
I need to test whether a late goal in a Soccer game has changed the result.
The project is centered around one Team, and it is regarding their results against various opponents. It is an investigation regarding the importance of late goals to a winning team.
In my Goals database I have a list of the times that goals were scored and whether or not they were scored by the Team or the Opposition.
There is a Game Object that stores things like teamScore, oppScore, a result (String that contains either W or L or D) etc.
The Season Object is used to collect and return the results.
When a goal is scored by the team it increments the minutes Int[] in the Game at the correct minute:
minute[GoalTime]++
When a goal is scored by the team it decrements the minutes database at the correct minute:
minute[GoalTime]++
Therefore to find the score at any minute, we add up all the minutes that have been before:
int score85=0;
for (int g = 0; g <= 85; g++) {
score85+=minutes[g];
}
If I have score90 and score85 how do I compare them so that it only returns results where a late goal has changed the result? I wish to avoid logging games where a winning team has scored again post 85, as this makes no difference to the result. I’m only interested in goals that have had a direct impact on the outcome.
Here is what I have:
int difference = score90-score85;
if (difference>0 && score85<=0)
{
if (result.equals("W") || result.equals("D"))
{
season.gamesWDByLateGoal++;
}
}
if (difference<0 && score85>=0)
{
if (result.equals("D") || result.equals("L")){
season.gamesLDByLateGoal++;
println(gameNumber);
}
}
How can I be sure that I am getting the right result? I am testing 1500+ games and I have been getting different answers.
The goal count is not a good solution:
minute[GoalTime]++
That means to have an array of 90 integers (the length of a usual soccer game is 90 minutes). You are doing statistics on data, so you should be familiar with some of the average values. I hope you agree that having 9 goals in a game is a lot. That means that 90% of those integers you have are useless most of the time.
Not only is it a waste of memory, but also a waste of time, as you have to step through all those integers (in the for loop) to find the game result.
In addition to that, not every game lasts 90 minutes.
Since soccer players are paid like Hollywood actresses, they act as
such and allowances of a few minutes are added to every period in
order to compensate for their stage time.
There can be an overtime of 2 additional 15 minute periods
there were rules that could shorten the overtime, see golden and silver goal.
Considering that, the idea of using the time as an index doesn't look that good.
On top of that, there can be a penalty shoot-out, which causes goals that do not even have a time associated to them.
Further down in the article, there's this demonstration how different rating systems handle penalty goals
In the calculation of UEFA coefficients, shoot-outs are ignored for club coefficients,[53] but not national team coefficients, where the shoot-out winner gets 20,000 points: more than the shoot-out loser, who gets 10,000 (the same as for a draw) but less than the 30,000 points for winning a match outright.[56] In the FIFA World Rankings, the base value of a win is three points; a win on penalties is two; a draw and a loss on penalties are one; a loss is zero.[54] The more complicated ranking system FIFA used from 1999 to 2006 gave a shoot-out winner the same points as for a normal win and a shoot-out loser the same points as for a draw; goals in the match proper, but not the shoot-out, were factored into the calculation.[57]
I highly recommend looking at the different systems to see which one makes most sense to you. After all, you cannot compare to another system if you have a different definition of winning and losing a game.
To complicate things further, goals can have different "values". Goals that are scored away are more valuable than those at home.
If two matches between A and B are as follows
#A's home: A:1 B:1
#B's home: A:2 B:2
A wins on aggregate, even though both games are draws.
Did I mention indoor soccer yet? =)
conclusion
Use an official system as a reference. If your system comes to the
conclusion that the game was a draw, while the official saying is
that itS' a win, then your system is not comparable to any official
results and has thus very little meaningfulness. Make sure your system matches the official results.
Store the goals as their time in a dynamic list associated with the team.
Finally, in a separate list, store results with a time. Whenever a goal is scored that changes the result, add another result object with the new result to this list. You now know all the times when the result changed, without discretising the entire timeline. To get a result at an arbitrary point in time, get the result from that list with the next smaller time stamp.
I'm trying to solve a problem but unfortunately my solution is not really the best for this task.
Task:
At a party there are N guests ( 0 < N < 30000 ). All guests tell when they get to the party and when they leave (for example [10;12]). The task is to take photos of as many people as possible at the party. On a photo there can only be 2 people (a pair) and each person can only be on exactly one photo. Of course, a photo can only be taken when the two persons are at the party at the same time. This is the case when their attendance intervals overlap.
My idea: I wrote a program which from the intervals creates a graph of connections. From the graph I search for the person who has the least number of connections. From the connected persons I also select the person who has the least connections. Then these two are chosen as a pair on a photo. Both are removed from the graph. The algorithm runs until no connections are left.
This approach works however there is a 10 secs limit for the program to calculate. With 1000 entries it runs in 2 secs, but even with 4000 it takes a lot of time. Furthermore, when I tried it with 25000 data, the program stops with an out of memory error, so I cannot even store the connections properly.
I think a new approach is needed here, but I couldn't find an other way to make this work.
Can anyone help me to figure out the proper algorithm for this task?
Thank you very much!
Sample Data:
10
1 100
2 92
3 83
4 74
5 65
6 55
7 44
8 33
9 22
10 11
The first line is the number of guests the further data is the intervals of people at the party.
No need to create graph here, this problem can be solved well on intervals structure. Sort people by ascending order of their leaving time(ending point of interval). Then iterate over them in that sorted order: if current person is not intersecting with anyone, then he should be removed. If he is intersecting with more than one person, take as a pair one of them who has earliest leaving time. During iteration you should compare each person only with next ones.
Proving this approach is not so difficult, so I hope you can prove it yourself.
Regarding running time, simple solution will be O(N^2), however I think that it can be reduced to O(N * logN). Anyway, O(N^2) will fit in 10 seconds on a normal PC.
Seems like a classical maximum matching problem to me.
You build a graph, where people who can possibly be pictured together (their time intervals intersect) are connected with an edge, and then find maximum matching, for example, with Edmond's Blossom algorithm.
I wouldn't say, that it's quite easy to implement. However, you can get quite a good approximation of this with Kuhn's algorithm for maximum matching in bipartite graphs. This one is really easy to implement, but won't give you the exact solution.
I have some really simple idea:
Assume, party will take Xh, make X sets for each hour, add to them appropriate people. Of course, people who will be there longer than hour will be in few sets. Now if there are 2 sets "together" with even number of ppl, you just could take n/2 photos for each sets. If there are 2 sets of odd number of people you are looking for someone who will be on each of that 2 sets, and move him to one of them (so you got 2 even number sets of people who will be on the same time on the party).
Remember to remove all used ppl (consider some class - Man with lists of all his/her hours).
My idea probably should be expand to more advanced "moving people" algorithm, through more than one neighboring set.
I think the following can do:
First, read all the guests' data and sort them into an array by leaving time ascending. Then, take first element of the array and iterate through next elements until the very first time-match found (next guest's entry time is less than this guest's leave time), if found, remove both from array as a pair, and report it elsewhere. If not, remove the guest as it can't be paired at all. Repeat until the array is empty.
The worst case of this is also N^2, as a party can be like [1,2],[3,4],... where no guests could be paired with each other, and the algorithm will search through all 30000 guests in vain all the time. So I don't think this is the optimal algorithm, but it should give an exact answer.
You say you already have a graph structure representation. I assume your vertices represent the guest and the interval of their staying at the party and the edges represent overlap of the respective intervals. What you then have to solve is the graph theoretical maximum matching problem, which has been solved before.
However, as indicated in my comments above, I think you can exploit the properties of the problem, especially the transitivity-like "if A leaves before B leaves and B leaves before C arrives, then A and C will not meet, either" like this:
Wait until the next yet unphotographed guest is about to leave, then take a photo of this one with the one who leaves next among those present.
You might succeed in thinking about the earliest time a photo can be taken: It is the time when the second person arrives at the party.
So as a photographer, goto the party being the first person and wait. Whenever a person arrives, take a photo with him/her and all other persons at the party. As a person appears only once, you will not have any duplicates.
While taking a photo (i.e. iterating over the list of guests), remove those guests who actually left the party.
Problem Constraints
Size of the data set, but not the data itself, is known.
Data set grows by one data point at a time.
Trend line is graphed one data point at a time (using a spline/Bezier curve).
Graphs
The collage below shows data sets with reasonably accurate trend lines:
The graphs are:
Upper-left. By hour, with ~24 data points.
Upper-right. By day for one year, with ~365 data points.
Lower-left. By week for one year, with ~52 data points.
Lower-right. By month for one year, with ~12 data points.
User Inputs
The user can select:
the type of time series (hourly, daily, monthly, quarterly, annual); and
the start and end dates for the time series.
For example, the user could select a daily report for 30 days in June.
Trend Weight
To calculate the window size (i.e., the number of data points to average when calculating the trend line), the following expression is used:
data points / trend weight
Where data points is derived from user inputs and trend weight is 6.4. Even though a trend weight of 6.4 produces good fits, it is rather arbitrary, and might not be appropriate for different user inputs.
Question
How should trend weight be calculated given the constraints of this problem?
Based on the looks of the graphs I would say you have too many points for your 12 point graph (it is just a spline of the points given... which is visually pleasing, but actually does more harm than good when trying to understand the trend) and too few points for your 365 point graph. Perhaps try doing something a little exponential like:
(Data points)^1.2/14.1
I do realize this is even more arbitrary than what you already have, but arbitrary isn't the worst thing in the world.
(I got 14.1 by trying to keep the 52 point graph fixed, since that one looks nice, by taking (52^(1.2)/52)*6.4=14.1. You using this technique you could try other powers besides 1.2 to see what you visually get.
Dan
I voted this up for the quality of your results and the clarity of your write-up. I wish I could offer an answer that could improve on your already excellent work.
I fear that it might be a matter of trial and error with the trend weight until you see an improved fit.
It could be that you could make this an input from users as well: allow them to fiddle with the value, given realistic constraints, until they get satisfactory values.
I also wondered if the weight would be different for each graph, since the number of points in each is different. Are you trying to get a single weighting that works for all graphs?
Excellent work; a nice question. Well done. I wish I was more helpful. Perhaps someone else will have more wisdom to impart than I do.
It might look like the trend lines are accurate in those 4 graphs but its really quite off. (This is best seen in the begging of the lower left one and the beginning of the upper right. I would think that you would want to use no less than half of your points when finding the trend line (though really you should use much more than half). I would suggest a Trend Weight of 2 at a maximum. Though really you ought to stick closer to the 1-1.5 range. Since it is arbitrary i would suggest you give your user an "accuracy of trend line" slider that they can use where the most accurate setting uses a trend weight of 1 and the least accurate uses a weight of #of data points +1. This would use 0 points (amusing you always round down) and, i would assume, though your statistics software might be different, will generate a strait horizontal line.