I am wondering if anyone knows of an Algorithm I could use to help me solve the following problem:
Allocate people (n) to certain events (m), m can have only one person attached to it and it must be randomized each time (Same person allowed if only one option available(n)). n has properties such as time available and day available. For n to be matched to m the time available and day available must match for both n and m. There can be multiple of n that match the times of m but it has to be the best fit so the rest of m can be allocated. The diagram below will more than likely explain it better (Sorry). n can be allocated to more than one m but should be done fairly such that one n doesnt have all of the available m's
As you can see Person A could be attached to Event A but due to the need to have them all matching (the best attempt to match) it is attached to Event B to allow Person C to be allocated to Event A and person B to Event C.
I am simply wondering if anyone knows the name of this type of problem and how I could go about solving it, I am coding the program in Java
This is a variant of the the Max Flow Problem. There are many algorithms taylor-made to solve max-flow problems, including The Ford-Fulkerson Algorithm or its refinement, the Edmonds-Karp Algorithm. Once you are able to change your problem into a max-flow problem, solving it is fairly simple. But what is the max flow problem?
The problem takes in a weighted, directed graph and asks the question "What is the maximum amount of flow that can be directed from the source (a node) to the sink (another node)?".
There are a few constraints, that make logical sense when thinking of the graph as a network of water flows.
The amount of flow through each edge must be less than or equal to the "capacity" (weight) of that edge for every edge in the graph. They also must be non-negative numbers.
The amount of flow into each node must equal the amount of flow leaving that node for ever node except the source and sink. There is no limit to the amount of flow that goes through a node.
Consider the following graph, with s as the source and t as the sink.
The solution to the max flow problem would be a total flow of 25, with the following flow amounts:
It is simple to transform your problem into a max flow problem. Assuming your inputs are:
N people, plus associated information on when person p_i is available time and date wise.
M events with a time and place.
Create a graph with the following properties:
A super source s
N person nodes p_1 ... p_n, with an edge of capacity infinity connecting s to p_i for all i in 1 ... n.
A super sink t
M event nodes e_1 ... e_m, with an edge of capacity 1 connecting e_i to t for all i in 1 ... m
For every combination of a person and event (p_i, e_j), an edge with capacity infinity connecting p_i to e_j iff p can validly attend event e (otherwise no edge connecting p_i and e_j).
Constructing a graph to these specifications has O(1) + O(N) + O(N) + O(M) + O(M) + O(1) + O(NM) = O(NM) runtime.
For your example the constructed graph would look like the following (with unlabeled edges having capacity infinity):
You correctly noticed that there is a Max Flow with value 4 for this example, and any max flow would return the same value. Once you can perform this transformation, you can run any max flow algorithm and solve your problem.
Create a class called AllocatePerson that has a Person and a list of Events as Attribute called lsInnerEvents (you have to define the class Person and the class of Events first, both with a list of Time and Day).
In the Constructor of AllocatePerson you feed a Person and a list of Events, the constructor will cycle thought the events and add to the internal list only the one that matches the parameter of the Person.
The main code will create an AllocatePerson for each Person (1 at the time) implementing the following logic:
if the newly create object "objAllocatePerson" has the lsInnerEvents list with size = 1 you remove the element contained in lsInnerEvents from the List of Events to Allocate and will fire a procedure called MaintainEvents(Events removedEvents) passing the event allocated (the one inside lsInnerEvents).
The function MaintainEvents will cycle through the current Array of AllocatePersons and remove from their lsInnerEvents the "removedEvents", if after that the size of lsInnerEvents is = 1, it will recursively invoke MaintainEvents() with the new removed events, and remove the new lsInnerEvents from the main List of Events to allocate.
At the end of the execution you will have all the association simply by cycling through the array of AllocatePersons, where lsInnerEvents size is 1
An approach that you can consider is as follows:
Create Java Objects for Persons and Events.
Place all Events in a pool (Java Collection)
Have Each Person select an Event from the pool. As Each Person can only select Events on Specific Days, Create a subset of Events that will be in the pool for selection from the Person.
Add necessary attributes to the Events to ensure that it can only be selected once by a Person
Related
As described in another question, I am attempting to add several "identity" vertices into a "group" vertex. Based on the recipe recommendation, I'm trying to write the traversal steps in such a way that the traversers iterate the identity vertices instead of appending extra steps in a loop. Here's what I have:
gts.V(group)
.addE('includes')
.to(V(identityIds as Object[]))
.count().next()
This always returns a value of 1, no matter how many IDs I pass in identityIds, and only a single edge is created. The profile indicates that only a single traverser is created for the __.V even though I'm passing multiple values:
Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
TinkerGraphStep(vertex,[849e1878-86ad-491e-b9f9... 1 1 0.633 40.89
AddEdgeStep({label=[Includes], ~to=[[TinkerGrap... 1 1 0.915 59.11
TinkerGraphStep(vertex,[50d9bb4f-ed0d-493d-bf... 1 1 0.135
>TOTAL - - 1.548 -
Why is only a single edge added to the first vertex?
The to() syntax you are using is not quite right. A modulator like to() expects the traversal you provide it to produce a single Vertex not a list. So, given V(identityIds) only the first vertex returned from that list of ids will be used to construct the edge. Step modulators like to(), by(), etc. tend to work like that.
You would want to reverse your approach to:
gts.V(identityIds)
.addE('includes')
.from(V(group))
.count().next()
But perhaps that leads back to your other question.
i am new to spark and to its relevant concepts, so please be kind with me and help me to clear up my doubts, i'll give you an example to help you to understand my question.
i have one javaPairRDD "rdd" which contains tuples like
Tuple2 <Integer,String[]>
lets assume that String[].length =3, means it contains 3 elements besides the key,what i want to do is to update each element of the vector using 3 RDDs and 3 operations,"R1" and "operation1" is used to modify the first element,"R2" and "operation2" is used to modify the second element and "R3" and "operation3" is used to modify the third element,
R1, R2 and R3 are the RDDs that provide the new values of elements
i know that spark devides the data (in this example is "rdd") into many partitions, but what i am asking about : is it possible to do different operations in the same partition and at the same time?
according to my example,and because i have 3 operations, it means that i can take 3 tuples at the same time instead of taking only one to operate it:
the treatment that i want it is :(t refers the time)
at t=0:
*tuple1=use operation1 to modify the element 1
*tuple2=use operation2 to modify the element2
*tuple3=use operation3 to modify the element 3
at t=1:
*tuple1=use operation2 to modify the element 2
*tuple2=use operation3 to modify the element3
*tuple3=use operation1 to modify the element 1
at t=2:
*tuple1=use operation.3 to modify the element 3
*tuple2=use operation1 to modify the element1
*tuple3=use operation2 to modify the element 2
After finish updating the 3 first tuples, i take others (3 tuples) from the same partion to treat them, and so on..
please be kind it's just a thought that crossed my mind, and i want to know if it is possible to do it or not, thank you for your help
Spark doesn't guarantee the order of execution.
You decide how individual elements of RDD should be transformed and Spark is responsible for applying the transformation to all elements in a way that it decides is the most efficient.
Depending on how many executors (i.e. thread or servers or both) are available in your environment Spark will actually process as many tuples as possible at the same time.
First of all, welcome to the Spark community.
To add to #Tomasz BÅ‚achut answer, Spark's execution context does not identify nodes (e.g. one computing PC) as individual processing units but instead their cores. Therefore, one job may be assigned to two cores on a 22-core Xeon instead of the whole node.
Spark EC does consider nodes as computing units when it comes to their efficiency and performance, though; as this is relevant for dividing bigger jobs among nodes of varying performance or blacklisting them if they are slow or fail often.
This question is asked in one of investment banking company's interview .
I have to design myCache which keeps a cache of studentRecords object and can have one object of myCache of studentRecords collection.When user wants to insert record in studentRecords it will only insert record if there is less than 20 record in collection .otherwise it will remove the least used record from the studentRecords and insert the record.Record will be inserted on basis of ranking of studentRecords in sorted order.When user wants to read the record it will check if studentRecords exist in myCache ,if not exist then will read record from studentRecords collection.
I created a doubly linked list and insert the record on basis of ranking .also can make a mycache class which is singleton and reads records from cache .But how to delete records which are least used .
I can create a array list which delete records top in array(least used record) but can not keep elements on basis of ordering of rank. and to read record on basis of ranking is expensive again .
Is there any other solution which would have impressed interviewer.
myCache class have functions like :
public void removeRecordFromStudentRecords(String rank);
public void addRecordToStudentRecords(StudentRecords st);
public Student readRecordFromStudentRecords(String rank);
table of StudentRecords
SrNo rank name maths science total percentage
1 1 rohan 90 90 180 90
2 2 sohan 80 90 160 80
3 3 abhi 70 70 140 70
If we're talking about a Cache we should optimize time complexity first and memory later.
So, in this case, I can provide next solution:
Use Map (i.e. HashMap) for storing records (key: recordId, value: Record).
Use Stack for last used items (value: recordId).
Use Tree (i.e. BST) for holding rank (key: rankValue, value: recordId).
Combination of this tree data structures allows to provide the fastest solution (I guess).
Read by Id operation: O(1) - just simple get from the map
Add record operation: O(ln N) - because we need to insert key into tree (we do not include balancing into counting complaxity)
Remove by rank operation: O(ln N) - simply finding recordId by rank in Tree( don't forget removing record from Map and recordID from Stack)
This is just brief overview of the problem. Guess, it's enough info to understand the main idea.
In order to keep a track of the least used record, you need to store the number of hits each record has (if you do not know what "hits" are, I suggest you look up "hits and misses in caching") So each studentRecord can be an object of a class as follows:
class StudentRecord{
int unique_id;
int ranking;
int hits
}
StudentRecord studentRecord = new StudentRecord();
sort your cache based on studentRecord.ranking and when you need to decide which studentRecord to delete, simply traverse the cache on the basis of hits and delete the element with the minimum hits.
To maintain hits, whenever you get a query for a studentRecord based on its unique_id, you increment its hits by 1. Thus, hits will give you a metric of which studentRecord is most used/least used
EDIT: Your question is now much clearer. For sorting, you can use simple insertion sort. The reasons for this are 1) you have max 20 elements in your cache and 2) when you try and insert a new element, insertion sort will help you perfectly to find the index where you can place the element. In fact, technically speaking, you need only sort once. Then you just need to figure out where to insert future elements.
I would say a simple linked list with arbitrary access (like java.util.ArrayList) will suffice. It will give you random access as in arrays and also the provision to accomodate less than 20 elements. I see no reason to make a doubly linked list since there is no need to access an element's left and right neighbours here...
Least Recently Used Scheduling technique can be applied here,you can keep a byte field for each entry in your list.
So everytime an entry of your object is used you can push 1 to the byte(b>>1).
So for entries which are being more frequently used, you'll have a lot of 1s in the binary representation for your byte.
For data, not being used at all will have all 0s.
And everytime, you are required to delete an entry from your cache, just delete the one with 0 or the one with the smallest value of this byte field.
Also, to remember references for greater time spans than just eight, you can use bigger datatypes.
Suppose I'm into Big Data (as in bioinformatics), and I've chosen to analyze it in Java using the wonderful Collections Map-Reduce framework on HPC. How can I work with datasets of more than 2 31 ^ 1 - items? For example,
final List<Gene> genome = getHugeData();
profit.log(genome.parallelStream().collect(magic);
Wrap your data so it consists of many chunks -- once you're exceed 2 ^ 31 - 1 you're going to next one. Sketch is:
class Wrapper {
private List<List<Gene>> chunks;
Gene get(long id) {
int chunkId = id / Integer.MAX_VALUE;
int itemId = id % Integer.MAX_VALUE;
List<Gene> chunk = chunks.get(chunkId);
return chunk.get(itemId);
}
}
In this case you have multiple problems. How big your data is?
The simplest solution is to use another structure such as LinkedList which (only if you are interested in serial accesses) or a HashMap which may have a high insertion cost. A LinkedList does not allow any random access whatsoever. If you want to access the 5th element you have to access first all previous 4 elements as well.
Here is another thought:
Let us assume that each gene has an id number (long). You can use an index structure such as a B+-tree and index your data using the tree. The index does not have to be stored on the disk it can remain on the memory. It does not have much overhead as well. You can find many implementations of it online.
Another solution would be to create a container class which would contain either other container classes or Genes. In order to achieve that both should implement an interface called e.g. Containable. In that way both classes Gene and Container are Containable(s). Once a container reaches its max. size it can be inserted in another container and so on. You can create multiple levels that way.
I would suggest you look online (e.g. Wikipedia) for the B+-tree if u are not familiar with that.
An Array with 2^31 Objects would consume about 17 GB memory...
You schould store the data a Database.
I'm looking into using a consistent hash algorithm in some java code I'm writing. The guava Hashing library has a consistentHash(HashCode, int) method, but the documentation is rather lacking. My initial hope was that I could just use consistentHash() for simple session affinity to efficiently distribute load across a set of backend servers.
Does anyone have a real-world example of how to use this method? In particular I'm concerned with managing the removal of a bucket from the target range.
For example:
#Test
public void testConsistentHash() {
List<String> servers = Lists.newArrayList("server1", "server2", "server3", "server4", "server5");
int bucket = Hashing.consistentHash(Hashing.md5().hashString("someId"), servers.size());
System.out.println("First time routed to: " + servers.get(bucket));
// one of the back end servers is removed from the (middle of the) pool
servers.remove(1);
bucket = Hashing.consistentHash(Hashing.md5().hashString("blah"), servers.size());
System.out.println("Second time routed to: " + servers.get(bucket));
}
Leads to the output:
First time routed to: server4
Second time routed to: server5
What I want is for that identifier ("someId") to map to the same server after removal of a server earlier in the list. So in the sample above, after removal I guess I'd want bucket 0 to map to "server1", bucket 1 to map to "server3", bucket 2 to map to "server4" and bucket 3 to map to "server5".
Am I supposed to maintain a separate (more complicated than a list) data structure to manage bucket removal and addition? I guess I had envisioned perhaps a more complicated Hashing API that would manage the remapping after adding and removal of particular buckets for me.
Note: I know the sample code is using a small input and bucket set. I tried this with 1000s of input across 100 buckets and the result is the same. Inputs that map to buckets 0-98 stay the same when I change the buckets to 99 and bucket 99 gets distributed across the remaining 99 buckets.
I'm afraid that no data structure can do it really right with the current consistentHash. As the method accepts only the list size, nothing but appending and removal from the end can be supported. Currently, the best solution consist probably of replacing
servers.remove(n)
by
server.set(n, servers.get(servers.size() - 1);
servers.remove(servers.size() - 1);
This way you sort of swap the failed and the very last server. This looks bad as it makes the assignments to the two swapped servers wrong. This problem is only half as bad as one of them have failed. But it makes sense, as after the following removal of the last list element, everything's fine, except for the assignments to the failed server and to the previously last server.
So twice as much assignments as needed change. Not optimal, but hopefully usable?
I don't think there's a good way to do this at the moment. consistentHash in its current form is useful only in simple cases -- basically, where you have a knob to increase or decrease the number of servers... but always by adding and removing at the end.
There's some work underway to add a class like this:
public final class WeightedConsistentHash<B, I> {
/** Initially, all buckets have weight zero. */
public static <B, I> WeightedConsistentHash<B, I> create(
Funnel<B> bucketFunnel, Funnel<I> inputFunnel);
/**
* Sets the weight of bucket "bucketId" to "weight".
* Requires "weight" >= 0.0.
*/
public void setBucketWeight(B bucketId, double weight);
/**
* Returns the bucket id that "input" maps to.
* Requires that at least one bucket has a non-zero weight.
*/
public B hash(I input);
}
Then you would write:
WeightedConsistentHash<String, String> serverChooser =
WeightedConsistentHash.create(stringFunnel(), stringFunnel());
serverChooser.setBucketWeight("server1", 1);
serverChooser.setBucketWeight("server2", 1);
// etc.
System.out.println("First time routed to: " + serverChooser.hash("someId"));
// one of the back end servers is removed from the (middle of the) pool
serverChooser.setBucketWeight("server2", 0);
System.out.println("Second time routed to: " + serverChooser.hash("someId"));
And you should get the same server each time. Does that API look suitable?
The guava API does not have any knowledge of your server list. It can only guarantee this:
int bucket1 = Hashing.consistentHash(Hashing.md5().hashString("server1"),N);
int bucket2 = Hashing.consistentHash(Hashing.md5().hashString("server1"),N-1);
assertThat(bucket1,is(equalTo(bucket2))); iff bucket1==bucket2!=N-1
you need to manange the bucket to your server list yourself
The answer proposed in the question is the correct one:
Am I supposed to maintain a separate (more complicated than a list) data structure to manage bucket removal and addition?
Guava is hashing into a ring with ordinal numbers. The mapping from those ordinal numbers to the server ids has to be maintained externally:
Given N servers initially - one can choose some arbitrary mapping for each ordinal number 0..N-1 to server-ids A..K (0->A, 1->B, .., N-1->K). A reverse mapping from the server id to it's ordinal number is also required (A->0, B->1, ..).
On the deletion of a server - the ordinal number space shrinks by one. All the ordinal numbers starting with the one for the deleted server need to be remapped to the next server (shift by one).
So for example, after the initial mapping, say server C (corresponding to ordinal number 2) got deleted. Now the new mappings become: (0->A, 1->B, 2->D, 3->E, .., N-2->K)
On the addition of a server L (say going from N to N+1 servers) - a new mapping can be added from N->L.
What we are doing here is mimicking how nodes would move in a ring as they are added and deleted. While the ordering of the nodes remains the same - their ordinal numbers (on which Guava operates) can change as nodes come and leave.