Ambiguity in a CodeForces Problem - usage of HashSet Vs LinkedHashSet

Ambiguity in a CodeForces Problem - usage of HashSet Vs LinkedHashSet - java

I was solving a Codeforces problem yesterday. The problem's URL is this
I will just explain the question in short below.
Given a binary string, divide it into a minimum number of subsequences
in such a way that each character of the string belongs to exactly one
subsequence and each subsequence looks like "010101 ..." or "101010
..." (i.e. the subsequence should not contain two adjacent zeros or
ones).
Now, for this problem, I had submitted a solution yesterday during the contest. This is the solution. It got accepted temporarily and on final test cases got a Time limit exceeded status.
So today, I again submitted another solution and this passed all the cases.
In the first solution, I used HashSet and in the 2nd one I used LinkedHashSet. I want to know, why didn't HashSet clear all the cases? Does this mean I should use LinkedHashSet whenever I need a Set implementation? I saw this article and found HashSet performs better than LinkedHashSet. But why my code doesn't work here?

This question would probably get more replies on Codeforces, but I'll answer it here anyways.
After a contest ends, Codeforces allows other users to "hack" solutions by writing custom inputs to run on other users' programs. If the defending user's program runs slowly on the custom input, the status of their code submission will change from "Accepted" to "Time Limit Exceeded".
The reason why your code, specifically, changed from "Accepted" to "Time Limit Exceeded" is that somebody created an "anti-hash test" (a test on which your hash function results in many collisions) on which your program ran slower than usual. If you're interested in how such tests are generated, you can find several posts on Codeforces, like this one: https://codeforces.com/blog/entry/60442.
As linked by #Photon, there's a post on Codeforces explaining why you should avoid using Java.HashSet and Java.HashMap: https://codeforces.com/blog/entry/4876, which is essentially due to anti-hash tests. In some instances, adding the extra log(n) factor from a balanced BST may not be so bad (by using TreeSet or TreeMap). In many cases, an extra log(n) factor won't make your code time out, and it gives you protection from anti-hash tests.
How do you determine whether your algorithm is fast enough to add the log(n) factor? I guess this comes with some experience, but most people suggest performing some sort of calculation. Most online judges (including Codeforces) show the time that your program is allowed to run on a particular problem (usually somewhere between one and four seconds), and you can use 10^9 constant-time operations per second as a rule of thumb when performing calculations.

Related

Cyclomatic Complexity in Intellij

I was working on an assignment today that basically asked us to write a Java program that checks if HTML syntax is valid in a text file. Pretty simple assignment, I did it very quickly, but in doing it so quickly I made it very convoluted (lots of loops and if statements). I know I can make it a lot simpler, and I will before turning it in, but Amid my procrastination, I started downloading plugins and seeing what information they could give me.
I downloaded two in particular that I'm curious about - CodeMetrics and MetricsReloaded. I was wondering what exactly these numbers that it generates correlate to. I saw one post that was semi-similar, and I read it as well as the linked articles, but I'm still having some trouble understanding a couple of things. Namely, what the first two columns (CogC and ev(G)), as well as some more clarification on the other two (iv(G) and v(G)), mean.
MetricsReloaded Method Metrics:
MetricsReloaded Class Metrics:
These previous numbers are from MetricsReloaded, but this other application, CodeMetrics, which also calculates cyclomatic complexity gives slightly different numbers. I was wondering how these numbers correlate and if someone could just give a brief general explanation of all this.
CodeMetrics Analysis Results:
My final question is about time complexity. My understanding of Cyclomatic complexity is that it is the number of possible paths of execution and that it is determined by the number of conditionals and how they are nested. It doesn't seem like it would, but does this correlate in any way to time complexity? And if so, is there a conversion between them that can be easily done? If not, is there a way in either of these plug-ins (or any other in IntelliJ) that can automate time complexity calculations?

Efficient string searching in Java

I am working with two big lists of data and I need to efficiently check for matches between the two. This is the scenario:
Reading from a file line by line (this file has 1 million lines)
For each line, check within an ArrayList of strings whether it has a match (this ArrayList also has a huge number of elements)
If a match is found, replace the line from the file with a new value
Any ideas what would be a good way to tackle this problem in terms of efficiency? Obviously looping through that number of records is hopelessly inefficient and process heavy.
Thanks for any help!
UPDATE
It's worth noting, I'm not specifically saying I need to use an ArrayList, that is just something I was using for testing. Any suggestions of more efficient Collections would be welcome.

Without more details (such as the nature of the keys) it is difficult to be certain but you may find using a Bloom filter useful to minimise the number of times you do check within an ArrayList of strings whether it has a match.
Obviously this would not help much if the lookup list changes over time.
You would use the Bloom filter to do a pre-check before searching the list because it can very quickly give you a straight no answer if the key does not exist in the list. You will still need to search you list if the bloom filter says maybe.

You may consider reading the file partially by different threads.
Similar issue is discussed here.
You may process the text in chunks (say x bytes or one line) , each chunk can be executed by different threads , ie one thread per chunk.

you should use HashMap it's approximately O(1), or if your strings have a lot of collisions than you need to use TreeSet O(logN), or Bloom filter.

How can I make my matrix multiplication Java code more fail-safe? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am working on a project, where I was provided a Java matrix-multiplication program which can run in a distributed system , which is run like so :
usage: java Coordinator maxtrix-dim number-nodes coordinator-port-num
For example:
java blockMatrixMultiplication.Coordinator 25 25 54545
Here's a snapshot of how output looks like :
I want to extend this code with some kind of failsafe ability - and am curious about how I would create checkpoints within a running matrix multiplication calculation. The general idea is to recover to where it was in a computation (but it doesn't need to be so fine grained - just recover to beginning, i.e row 0 column 0 )
My first idea is to use log files (like Apache log4j ), where I would be logging the relevant matrix status. Then, if we forcibly shut down the app in the middle of a calculation, we could recover to a reasonable checkpoint.
Should I use MySQL for such a task (or maybe a more lightweight database)? Or would a basic log file ( and using some useful Apache libraries) be good enough ? any tips appreciated, thanks
source-code :
MatrixMultiple
Coordinator
Connection
DataIO
Worker

If I understand the problem correctly, all you need to do is recover your place in a single matrix calculation in the event of a crash or if the application is quit half way through.
Minimum Viable Solution
The simplest approach would be to recover just the two matrixes you were actively multiplying, but none of your progress, and multiply them from the beginning next time you load the application.
The Process:
At the beginning of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, create a file, let's call it recovery_data.txt with the state of the two arrays being multiplied (parameters a and b). Alternatively, you could use a simple database for this.
At the end of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, right before you return, clear the contents of the file, or wipe you database.
When the program is initially run, most likely near the beginning of the main(String[] args) you should check to see if the contents of the text file is non-null, in which case you should multiply the contents of the file, and display the output, otherwise proceed as usual.
Notes on implementation:
Using a simple text file or a full fledged relational database is a decision you are going to have to make, mostly based on the real world data that only you could really know, but in my mind, a textile wins out in most situations, and here are my reasons why. You are going to want to read the data sequentially to rebuild your matrix, and so being relational is not that useful. Databases are harder to work with, not too hard, but compared to a text file there is no question, and since you would not be much use of querying, that isn't balanced out by the ways they usually might make a programmers life easier.
Consider how you are going to store your arrays. In a text file, you have several options, my recommendation would be to store each row in a line of text, separated by spaces or commas, or some other character, and then put an extra line of blank space before the second matrix. I think a similar approach is used in crAlexander's Answer here, but I have not tested his code. Alternatively, you could use something more complicated like JSON, but I think that would be too heavy handed to justify. If you are using a database, then the relational structure should make several logical arrangements for your data apparent as well.
Strategic Checkpoints
You expressed interest in saving some calculations by taking advantage of the possibility that some of the calculations will have already been handled on last time the program ran. Lets look first look at the Pros and Cons of adding in checkpoints after every row has been processed, best I can see them.
Pros:
Save computation time next time the program is run, if the system had been closed.
Cons:
Making the extra writes will either use more nodes if distributed (more on that later) or increase general latency from calculations because you now have to throw in a database write operation for every checkpoint
More complicated to implement (but probably not by too much)
If my comments on the implementation of the Minimum Viable Solution about being able to get away with a text file convinced you that you would not have to add in RDBMS, I take back the parts about not leveraging queries, and everything being accessed sequentially, so a database is now perhaps a smarter choice.
I'm not saying that checkpoints are definitely not the better solution, just that I don't know if they are worth it, but here is what I would consider:
Do you expect people to be quitting half way through a calculation frequently relative to the total amount of calculations they will be running? If you think this feature will be used a lot, then the pro of adding checkpoints becomes much more significant relative to the con of it slowing down calculations as a whole.
Does it take a long time to complete a typical calculation that people are providing the program? If so, the added latency I mentioned in the cons is (percentage wise) smaller, and so perhaps more tolerable, but users are already less happy with performance, and so that cancels out some of the effect there. It also makes the argument for checkpointing more significant because it has the potential to save more time.
And so I would only recommend checkpointing like this if you expect a relatively large amount of instances where this is happening, and if it takes a relatively large amount of time to complete a calculation.
If you decide to go with checkpoints, then modify the approach to:
after every row has been processed on the array that you produce the content of that row to your database, or if you use the textile, at the end of the textile, after another empty line to separate it from the last matrix.
on startup if you need to finish a calculation that has already been begun, solve out and distribute only the rows that have yet to be considered, and retrieve the content of the other rows from your database.
A quick point on implementing frequent checkpoints: You could greatly reduce the extra latency from adding in frequent checkpoints by pushing this task out to an additional thread. Doing this would use more processes, and there is always some latency in actually spawning the process or thread, but you do not have to wait for the entire write operation to be completed before proceeding.
A quick warning on the implementation of any such failsafe method
If there is an unchecked edge case that would mean some sort of invalid matrix would crash the program, this failsafe now bricks the program it entirely by trying it again on every start. To combat this, I see some obvious solutions, but perhaps a bit of thought would let you modify my approaches to something you prefer:
Use a lot of try and catch statements, if you get any sort of error that seems to be caused by malformed data, wipe your recovery file, or modify it to add a note that tells your program to treat it as a special case. A good treatment of this special case may be to display the two matrixes at start with an explanation that your program failed to multiply them likely due to malformed content.
Add data in your file/database on how many times the program has quit while solving the current problem, if this is not the first resume, treat it like the special case in the above option.
I hope that this provided enough information for you to implement your failsafe in the way that makes the most sense given what you suspect the realistic use to be, and note that there are perhaps other ways you could approach this problem as well, and these could equally have their own lists of pros and cons to take into consideration.

Mergesort running faster on larger inputs

I'm working on an empirical analysis of merge sort (sorting strings) for school, and I've run into a strange phenomenon that I can't explain or find an explanation of. When I run my code, I capture the running time using the built in system.nanotime() method, and for some reason at a certain input size, it actually takes less time to execute the sort routine than with a smaller input size.
My algorithm is just a basic merge sort, and my test code is simple too:
//Get current system time
long start = System.nanoTime();
//Perform mergesort procedure
a = q.sort(a);
//Calculate total elapsed sort time
long time = System.nanoTime()-start;
The output I got for elapsed time when sorting 900 strings was: 3928492ns
For 1300 strings it was: 3541923ns
With both of those being the average of about 20 trials, so it's pretty consistent. After 1300 strings, the execution time continues to grow as expected. I'm thinking there might be some peak input size where this phenomenon is most noticeable.
So my Question: What might be causing this sudden increase in speed of the program? I was thinking there might be some sort of optimization going on with arrays holding larger amounts of data, although 1300 items in an array is hardly large.
Some info:
Compiler: Java version 1.7.0_07
Algorithm: Basic recursive merge sort (using arrays)
Input type: Strings 6-10 characters long, shuffled (random order)
Am I missing anything?

Am I missing anything?
You're trying to do a microbenchmark, but the code you've posted so far does not resemble a well working sample. To do so, please follow the rules stated here: How do I write a correct micro-benchmark in Java?.
The explanation about your code being faster is because after some iterations of your method, the JIT will trigger and the performance of your code will be optimized, thus your code getting faster, even when processing larger data.
Some recommendations:
Use several array/list inputs with different size. Good values to do this kind of analysis are 100, 1000 (1k), 10000 (10k), 100000 (100k), 1000000 (1m) and random size values between these. You will get more accurate results when performing evaluations that take longer time.
Use arrays/lists of different objects. Create a POJO and make it implement the Comparable interface, then execute your sort method. As explained above, use different arrays values.
Not directly related to your question, but the execution results are based on the JDK used. Eclipse is just an IDE and can work with different JDK versions, e.g. at my workplace I use JDK 6 u30 to work on projects on the company, but for personal projects (like proof of concepts) I use JDK 7 u40.

Is this Java reply to a job questionnaire valid? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I was asked this in WRITTEN form through a recruiter, with the previous question being string related, and the previous one being about Inversion of Control:
How would you find the second largest element in an array?
Being a Project manager who has is self teaching/learning JAVA, my response was:
How is large defined (integers? parsing of: strings/objects? ) How large is the array? Bubble sort , then return the second to last index. Temporarily store the largest and second to largest variable, sort through the array, replacing the appropriate variables, then return the second largest variable. There are many ways, however to develop an appropriate function that isn’t expensive would require more information. If the range is very wide, make several lean methods, measure the array length, and apply the appropriate method on the array.
Is this a valid response to the question, and if not, what do you think it needs improvement on? I'm asking this because of the extreme gap presented in the common recruitment atmosphere, and am having difficult understanding the actual intended purpose of similar structured questions, and need your feedback to understand what/how to approach/reply.
UPDATE I received notification that they typically see code, but provided no parameters or guidelines, and replied with this:
I can not provide a codified answer, not knowing at least the first two pieces of information. The first will allow for parsing, as numbers are sorted universally, while strings, objects, and others are parsed subjectively. The second part, dealing with the array length, and whether or not it is static, is definitely relavent as a developer, as expensive code (whether in computation time facing the user, or hardware costs to the client) can be costly. The question is worded poorly for an exact technical response, especially given that it is in written format, where there is no feedback. I am merely typing my considerations as a reply, the same as if it were asked in person. The context I am given from the previous questions is that they are looking for someone who understands IoC practices (bullet 3) and would be parsing Strings (given bullet 2) in potentially a transaction (try/catch) situation (bullet 4) and then find (current issue). If the questionnaire's purpose is to see how I approach a problem, then my response is valid. If they would like further clarification, I would be happy to accomodate, but if they are requiring a codified answer AND unwilling to provide the necessary context, then I am unwilling to work with them, as it would highlight there misunderstanding of how to interact with external customers, who DO need guidance when one "simple" question is asked, but other pieces of information are needed to best accommodate, the central reason behind any business seeking outside specialized help. I hope this response in not read in a harsh tone, as my cadence when replying out loud is quite the opposite, as further information is required, including if sole development in JAVA is within the scope of this job, as I am skilled in JS and Python as well, and this was not discussed yesterday. Hopefully you and your client can understand that with the recruiter method, a layer of abstraction can be beneficial in many areas, but in situations like this, it can hinder and cloud perceptions without direct communication and feedback. Please feel free to provide this in addition to my answer.
and received acclaim for this reply. Thank you for your help, as I am really trying to get a programming job, but have no prior formal experience and this guidance really helps.

In this case, I think the interviewer really wants to know in order to find 2nd largest element, does one have to sort the entire array(and pick the second element) or is there any better approach?
The answer is you don't have to sort the entire array in order to find top k elements. Sorting will take O(nlgn) time whereas finding top k items will take only O(nlogk).
You can explain the answer using simple example. If you have to find 2nd largest card in 100 cards with numbers ranging some low to high. The cards are not sorted. To find 2nd largest card, all of you have to do is hold top 2 cards that you have seen so far in your hand. As you pick new cards, see if it is larger than the one in your hand, if so replace the new largest with the smallest in hand. At the end of this process, you will end up holding top 2 cards.
EDIT: Like others said, bubble sort has worst runtime O(n^2). For fun, Check out President Obama's answer to sorting interview question. http://www.youtube.com/watch?v=k4RRi_ntQc8

The most efficient to do the needed task is obvious: iterate once on the array and looking for the "largest" element, all the while storing also the previous "largest" element. At the end of the array, your previous "largest" element is the one your method needs to return.
To formulate your answer, I see 2 choices:
Beginning your answer by "Assuming the array contains int elements..." or something approaching.
Using an undefined isLargest() or isLarger() method and explain that the purpose of the method is to check if the current element is largest currently examined, whatever it means.

This is pretty subjective, but frankly I agree with your question-about-the-question -- they should have defined what it is your array contains.
What you could have done is written something like "Assuming the elements in the array are all integers, here's how I'd do it" and then give your answer with that assumption.
I'd follow the same steps with your other requests for additional info -- make an assumption, declare your assumption, then proceed with that assumption in mind.

Seems valid - I just fear they wanted CODE. I think maybe you should have just assumed integers and written something IN ADDITION to what you said.

My initial gut reaction would be "Modified Merge Sort", implemented using a ForkJoinPool. It's still O(n log[n]), but the actual runtime would be faster than a serial implementation.
You may ask why not quicksort, and it's because worst case for quicksort is O(n^2), which would be bad for an extremely large dataset. Modified Merge Sort has more predictable performance and a better worst case scenario O(n log[n])

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.