I've got a sorting problem. I have a function, let's say blackbox to simplify.
As input it takes two jobs (tasks) and as output it returns the one to be processed first. For example:
input(1,2) --> output: Job 2 is first.
Problem is, this blackbox sometimes takes bad decisions.
Example: Suppose we have 3 jobs : 0, 1 and 2. We test each job against the other to identify a processing order.
input(0,1) --> output: Job 0 is first
input(1,2) --> output: Job 1 is first
input(0,2) --> output: Job 2 is first (bad decision)
So here's the problem, normally using the two fist input, job 0 have to be processed before 2. But the balckbox says otherwise.
I want using this blackbox sort a set of jobs, taking into consideration this problem.
So, how can I sort the set of jobs ?
There is easy to identify that problem exists. You need to build direct graph of decision. If it contains cycles than you have a bad decision somewhere.
But It is impossible to find out which decision is bad. Any decision in cycle can be bad (or even several of them).
EDIT
You can remove some edges of graph to break loop (you can chose it any way you like). After that you task will be partially ordered (or maybe totally ordered I need to think about it).
EDIT 2
Here is wiki article which may help you Feedback arc set
Related
I was solving a Codeforces problem yesterday. The problem's URL is this
I will just explain the question in short below.
Given a binary string, divide it into a minimum number of subsequences
in such a way that each character of the string belongs to exactly one
subsequence and each subsequence looks like "010101 ..." or "101010
..." (i.e. the subsequence should not contain two adjacent zeros or
ones).
Now, for this problem, I had submitted a solution yesterday during the contest. This is the solution. It got accepted temporarily and on final test cases got a Time limit exceeded status.
So today, I again submitted another solution and this passed all the cases.
In the first solution, I used HashSet and in the 2nd one I used LinkedHashSet. I want to know, why didn't HashSet clear all the cases? Does this mean I should use LinkedHashSet whenever I need a Set implementation? I saw this article and found HashSet performs better than LinkedHashSet. But why my code doesn't work here?
This question would probably get more replies on Codeforces, but I'll answer it here anyways.
After a contest ends, Codeforces allows other users to "hack" solutions by writing custom inputs to run on other users' programs. If the defending user's program runs slowly on the custom input, the status of their code submission will change from "Accepted" to "Time Limit Exceeded".
The reason why your code, specifically, changed from "Accepted" to "Time Limit Exceeded" is that somebody created an "anti-hash test" (a test on which your hash function results in many collisions) on which your program ran slower than usual. If you're interested in how such tests are generated, you can find several posts on Codeforces, like this one: https://codeforces.com/blog/entry/60442.
As linked by #Photon, there's a post on Codeforces explaining why you should avoid using Java.HashSet and Java.HashMap: https://codeforces.com/blog/entry/4876, which is essentially due to anti-hash tests. In some instances, adding the extra log(n) factor from a balanced BST may not be so bad (by using TreeSet or TreeMap). In many cases, an extra log(n) factor won't make your code time out, and it gives you protection from anti-hash tests.
How do you determine whether your algorithm is fast enough to add the log(n) factor? I guess this comes with some experience, but most people suggest performing some sort of calculation. Most online judges (including Codeforces) show the time that your program is allowed to run on a particular problem (usually somewhere between one and four seconds), and you can use 10^9 constant-time operations per second as a rule of thumb when performing calculations.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am working on a project, where I was provided a Java matrix-multiplication program which can run in a distributed system , which is run like so :
usage: java Coordinator maxtrix-dim number-nodes coordinator-port-num
For example:
java blockMatrixMultiplication.Coordinator 25 25 54545
Here's a snapshot of how output looks like :
I want to extend this code with some kind of failsafe ability - and am curious about how I would create checkpoints within a running matrix multiplication calculation. The general idea is to recover to where it was in a computation (but it doesn't need to be so fine grained - just recover to beginning, i.e row 0 column 0 )
My first idea is to use log files (like Apache log4j ), where I would be logging the relevant matrix status. Then, if we forcibly shut down the app in the middle of a calculation, we could recover to a reasonable checkpoint.
Should I use MySQL for such a task (or maybe a more lightweight database)? Or would a basic log file ( and using some useful Apache libraries) be good enough ? any tips appreciated, thanks
source-code :
MatrixMultiple
Coordinator
Connection
DataIO
Worker
If I understand the problem correctly, all you need to do is recover your place in a single matrix calculation in the event of a crash or if the application is quit half way through.
Minimum Viable Solution
The simplest approach would be to recover just the two matrixes you were actively multiplying, but none of your progress, and multiply them from the beginning next time you load the application.
The Process:
At the beginning of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, create a file, let's call it recovery_data.txt with the state of the two arrays being multiplied (parameters a and b). Alternatively, you could use a simple database for this.
At the end of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, right before you return, clear the contents of the file, or wipe you database.
When the program is initially run, most likely near the beginning of the main(String[] args) you should check to see if the contents of the text file is non-null, in which case you should multiply the contents of the file, and display the output, otherwise proceed as usual.
Notes on implementation:
Using a simple text file or a full fledged relational database is a decision you are going to have to make, mostly based on the real world data that only you could really know, but in my mind, a textile wins out in most situations, and here are my reasons why. You are going to want to read the data sequentially to rebuild your matrix, and so being relational is not that useful. Databases are harder to work with, not too hard, but compared to a text file there is no question, and since you would not be much use of querying, that isn't balanced out by the ways they usually might make a programmers life easier.
Consider how you are going to store your arrays. In a text file, you have several options, my recommendation would be to store each row in a line of text, separated by spaces or commas, or some other character, and then put an extra line of blank space before the second matrix. I think a similar approach is used in crAlexander's Answer here, but I have not tested his code. Alternatively, you could use something more complicated like JSON, but I think that would be too heavy handed to justify. If you are using a database, then the relational structure should make several logical arrangements for your data apparent as well.
Strategic Checkpoints
You expressed interest in saving some calculations by taking advantage of the possibility that some of the calculations will have already been handled on last time the program ran. Lets look first look at the Pros and Cons of adding in checkpoints after every row has been processed, best I can see them.
Pros:
Save computation time next time the program is run, if the system had been closed.
Cons:
Making the extra writes will either use more nodes if distributed (more on that later) or increase general latency from calculations because you now have to throw in a database write operation for every checkpoint
More complicated to implement (but probably not by too much)
If my comments on the implementation of the Minimum Viable Solution about being able to get away with a text file convinced you that you would not have to add in RDBMS, I take back the parts about not leveraging queries, and everything being accessed sequentially, so a database is now perhaps a smarter choice.
I'm not saying that checkpoints are definitely not the better solution, just that I don't know if they are worth it, but here is what I would consider:
Do you expect people to be quitting half way through a calculation frequently relative to the total amount of calculations they will be running? If you think this feature will be used a lot, then the pro of adding checkpoints becomes much more significant relative to the con of it slowing down calculations as a whole.
Does it take a long time to complete a typical calculation that people are providing the program? If so, the added latency I mentioned in the cons is (percentage wise) smaller, and so perhaps more tolerable, but users are already less happy with performance, and so that cancels out some of the effect there. It also makes the argument for checkpointing more significant because it has the potential to save more time.
And so I would only recommend checkpointing like this if you expect a relatively large amount of instances where this is happening, and if it takes a relatively large amount of time to complete a calculation.
If you decide to go with checkpoints, then modify the approach to:
after every row has been processed on the array that you produce the content of that row to your database, or if you use the textile, at the end of the textile, after another empty line to separate it from the last matrix.
on startup if you need to finish a calculation that has already been begun, solve out and distribute only the rows that have yet to be considered, and retrieve the content of the other rows from your database.
A quick point on implementing frequent checkpoints: You could greatly reduce the extra latency from adding in frequent checkpoints by pushing this task out to an additional thread. Doing this would use more processes, and there is always some latency in actually spawning the process or thread, but you do not have to wait for the entire write operation to be completed before proceeding.
A quick warning on the implementation of any such failsafe method
If there is an unchecked edge case that would mean some sort of invalid matrix would crash the program, this failsafe now bricks the program it entirely by trying it again on every start. To combat this, I see some obvious solutions, but perhaps a bit of thought would let you modify my approaches to something you prefer:
Use a lot of try and catch statements, if you get any sort of error that seems to be caused by malformed data, wipe your recovery file, or modify it to add a note that tells your program to treat it as a special case. A good treatment of this special case may be to display the two matrixes at start with an explanation that your program failed to multiply them likely due to malformed content.
Add data in your file/database on how many times the program has quit while solving the current problem, if this is not the first resume, treat it like the special case in the above option.
I hope that this provided enough information for you to implement your failsafe in the way that makes the most sense given what you suspect the realistic use to be, and note that there are perhaps other ways you could approach this problem as well, and these could equally have their own lists of pros and cons to take into consideration.
I have made a small program that computes logic circuits' truth tables. In the representation I have chosen (out of ignorance, I have no schooling on the subject), I use a Circuit class, and a Connector class to represent "circuits" (including basic gates such as NOT, OR...) and wiring.
A Factory class is used to "solder the pins and wires" with statements looking like this
factory.addCircuit("OR0", CircuitFactory.OR);
factory.addConnector("N0OUT", "NOT0", 0, "AND1", 1);
When the circuit is complete
factory.createTruthTable();
computes the circuit's truth table.
Inputing the truth tables for OR NOT and AND, the code has chained the creation of XOR, 1/2 ADDER, ADDER and 4-bit ADDER, reusing the previous step's truth table at each step.
It's all very fine and dandy for an afternoon's work, but it will obviously break on loops (as an example, flip-flops). Does anyone know of a convenient way to represent a logic circuit with loops? Ideal would be if it could be represented with a table, maybe a table with previous states, new states and delays.
Pointing me to litterature describing such a representation would also be fine. One hour of internet searching brought only a PhD paper, a little over my understanding.
Thanks a lot!
Any loop must contain at least one node with "state", of which the flip-flop (or register) is the fundamental building block. An effective approach is to split all stateful nodes into two nodes; one acting as a data source, the other as a data sink. So you now have no loops.*
To simulate, on every clock cycle,** you propagate your data values from sources to sinks in a feedforward fashion. Then you update your stateful sources (from their corresponding sinks), ready for the next cycle.
* If you still have loops at this point, then you have an invalid circuit graph.
** I'm assuming you want to simulate synchronous logic, i.e. you have a clock, and state only updates on clock edges. If you want to simulate asynchronous logic, then things get trickier, as you need to start modelling propagation delays and so on.
I am currently reading a paper and i have come to a point were the writers say that they have some arrays in memory for every map task and when the map task ends, they output that array.
This is the paper that i am referring to : http://research.google.com/pubs/pub36296.html
This looks somewhat a bit non-mapreduce thing to do, but i am trying to implement this project and i have come to a point were this is the only solution. I have tried many ways to use the common map reduce philosophy, which is process each line and output a key-value pair, but in that way i have for every line of input many thousands of context writes and its takes a long time to write them. So my map task is a bottleneck. These context writes cost a lot.
If i do it their way, i will have managed to reduce the number of key-value pairs dramatically. So i need to find a way to have in memory structures for every map task.
I can define these structures as static in the setup function, but i can find a way to tell when the map tasks ends, so that i can output that structure. I know it sounds a bit weird, but it is the only way to work efficiently.
This is what they say in that paper
On startup, each mapper loads the set of split points to be
considered for each ordered attribute. For each node n ∈ N
and attribute X, the mapper maintains a table Tn,X of key-
value pairs.
After processing all input data, the mappers out-
put keys of the form n, X and value v, Tn,X [v]
Here are some edits after Sean's answer :
I am using a combiner in my job. The thing is that these context.write(Text,Text) commands in my map function, are really time consuming. My input is csv files or arff files. In every line there is an example. My examples might have up to thousands of attributes. I am outputting for every attribute, key-value pairs in the form <(n,X,u),Y>, where is the name of the node (i am building a decision tree), X is the name of the attribute, u is the value of the attribute and Y are some statistics in Text format. As you can tell, if i have 100,000 attributes, i will have to have 100,000 context.write(Text,Text) commands for every example. Running my map task without these commands, it runs like the wind. If i add the context.write command, it takes forever. Even for a 2,000 thousand attribute training set. It really seems like i am writing in files and not in memory. So i really need to reduce those writes. Aggregating them in memory (in map function and not in the combiner) is necessary.
Adding a different answer since I see the point of the question now I think.
To know when the map task ends, well, you can override close(). I don't know if this is what you want. If you have 50 mappers, the 1/50th of the input each sees is not known or guaranteed. Is that OK for your use case -- you just need each worker to aggregate stats in memory for what it has seen and output?
Then your procedure is fine but probably would not make your in-memory data structure static -- nobody said two Mappers won't run in one JVM classloader.
A more common version of this pattern plays out in the Reducer where you need to collect info over some known subset of the keys coming in before you can produce one record. You can use a partitioner, and the fact the keys are sorted, to know you are seeing all of that subset on one worker, and, can know when it's done because a new different subset appears. Then it's pretty easy to collect data in memory while processing a subset, output the result and clear it when a new subset comes in.
I am not sure that works here since the bottleneck happens before the Reducer.
Without knowing a bit more about the details of what you are outputting, I can't be certain if this will help, but, this sounds like exactly what a combiner is designed to help with. It is like a miniature reducer (in fact, a combiner implementation is just another implementation of Reducer) attached to the output of a Mapper. Its purpose is to collect map output records in memory, and try to aggregate them before being written to disk and then collected by the Reducer.
The classic example is counting values. You can output "key,1" from your map and then add up the 1s in a reducer, but, this involves outputting "key,1" 1000 times from a mapper if the key appears 1000 times when "key,1000" would suffice. A combiner does that. Of course it only applies when the operation in question is associative/commutative and can be run repeatedly with no side effect -- addition is a good example.
Another answer: in Mahout we implement a lot of stuff that is both weird, a bit complex, and very slow if done the simple way. Pulling tricks like collecting data in memory in a Mapper is a minor and sometimes necessary sin, so, nothing really wrong with it. It does mean you really need to know the semantics that Hadoop guarantees, test well, and think about running out of memory if not careful.
I've encountered JBehave recently and I think we should use it. So I have called in the tester of our team and he also thinks that this should be used.
With that as starting point I have asked the tester to write stories for a test application (the Bowling Game Kata of Uncle Bob). At the end of the day we would try to map his tests against the bowling game.
I was expecting a test like this:
Given a bowling game
When player rolls 5
And player rolls 4
Then total pins knocked down is 9
Instead, the tester came with 'logical tests', in other words he was not being that specific. But, in his terms this was a valid test.
Given a bowling game
When player does a regular throw
Then score should be calculated appropriately
My problem with this is ambiguity, what is a 'regular throw'? What is 'appropriately'? What will it mean when one of those steps fail?
However, the tester says that a human does understand and that what I was looking for where 'physical tests', which where more cumbersome to write.
I could probably map 'regular' with rolling two times 4 (still no spare, nor strike), but it feels like I am again doing a translation I don't want to make.
So I wonder, how do you approach this? How do you write your JBehave tests? And do you have any experience when it is not you who writes these tests, and you have to map them to your code?
His test is valid, but requires a certain knowledge of the domain, which no framework will have. Automated tests should be explicit, think of them as examples. Writing them costs more than writing "logical tests", but this pays in the long run since they can be replayed at will, very quickly, and give an immediate feedback.
You should have paired with him writing the first tests, to put it in the right direction. Perhaps you could give him your test, and ask him to increase the coverage by adding new tests.
The amount of explicitness needed in acceptance criteria depends on level of trust between the development team and the business stakeholders.
In your example, the business is assuming that the developers/testers understand enough about bowling to determine the correct outcome.
But imagine a more complex domain, like finance. For that, it would probably be better to have more explicit examples to ensure a good understanding of the requirement.
Alternatively, let's say you have a scenario:
Given I try to sign up with an invalid email address
Then I should not be registered
For this, a developer/tester probably has better knowledge of what constitutes a valid or invalid email address than the business stakeholder does. You would still want to test against a variety of addresses, but that can be specified within the step definitions, rather than exposing it at the scenario level.
I hate such vague words as "appropriately" in the "expected values". The "appropriately" is just an example of "toxic word" for the testing, and if not eliminated, this "approach" can get widespread, effectively killing the testing in general. It might "be enough" for human tester, but such "test cases" are acceptable only at first attempts to exploratory "smoke test".
Whatever reproducible, systematical and automatable, every test case must be specific. (not just "should".. to assume the softness of "would" could be allowed? Instead I use the present tense "shall be" or even better strict "is", as a claim to confirm/refuse.) And this rule is absolute once it comes to automation.
What your tester made, was rather a "test-area", a "scenario template", instead of a real test-case: Because so many possible test-results can be produced...
You were specific, in your scenario: That was a very specific real "test case". It is possible to automate your test case, nice: You can delegate it on a machine and evaluate it as often as you need, automatically. (with the bonus of automated report, from an Continuous Integration server)
But the "empty test scenario template"? It has some value too: It is a "scenario template", an empty skeleton prepared to be filled by data: So I love to name these situations "DDT": "Data Driven Testing".
Imagine a web-form to be tested, with validations on its 10 inputs, with cross-validations... And the submit button. There can be 10 test-cases for every single input:
empty;
with a char, but still too short anyway;
too long for the server, but allowed within the form for copy-paste and further edits;
with invalid chars...
The approach I recommend is to prepare a set of to-pass data: even to generate them (from DB or even randomly), whatever you can predict shall pass the test, the "happy scenario". Keep the data aside, as a data-template, and use it to initialize the form, to fill it up, and then to brake-down some single value: Create test cases "to fail". Do it i.e. 10 times for every single input, for each of the 10 inputs (100 tests-cases even before cross-rules attempted) ... and then, after the 100 times of the refusing of the form by the server, fill up the form by the to-pass data, without distorting them, so the form can be accepted finally. (accepted submit changes status on the server-app, so needs to go as the last one, to test all the 101 cases on the same app-state)
To do your test this way, you need two things:
the empty scenario template,
and a table of 100 rows of data:
10 columns of input data: with only one value manipulated, as passing row by row down the table (i.e. ever heard about grey-code?),
possibly keeping the inheritance history in a row-description, where from is the row derived and how, via which manipulated value.
Also the 11th column, the "expected result" column(s) filled: to pass/fail expected status, expected err/validation message, reference to the requirements, for the test-coveradge tracking. (i.e. ever seen FitNesse?)
And possibly also the column for the real detected result, when test performed, to track history of the single row-test-case. (so the CI server mentioned already)
To combine the "empty scenario skeleton" on one side and the "data-table to drive the test" on the other side, some mechanism is needed, indeed. And your data need to be imported. So you can prepare the rows in excel, which could be theoretically imported too, but for the easier life I recommend either CSV, properties, XML, or just any machine&human readable format, textual format.
His 'logical test' has the same information content as the phrase 'test regular bowling score' in a test plan or TODO list. But it is considerably longer, therefor worse.
Using jbehave at all only makes sense in the case the test team are responsible for generating tests with more information in them than that. Otherwise, it would be more efficient to take the TODO list and code it up in JUnit.
And I love words like "appropriately" in the "expected values". You need to use cucumber or other wrappers as the generic documentation. If you're using it to cover and specify all possible scenarios you're probably wasting a lot of your time scrolling through hundred of feature files.