Time Complexity of the Java Bitset.flip(from_index, to_index) - java

I am interested in knowing time complexity of BitSet.flip().
I tried various documentations. Closest thing I found is this Oracle documentation , which doesn't mention anything about time complexity of the internal operation.
If you are someone who says, it has time complexity of O(n) where n is distance between from_index and to_index, please have look at this solution. This solution is proof that BitSet.flip() has some internal optimisation. It may be setting the bits in bigger chunks of blocks. If that is that case I wanna know what is the block size .
There is literally no information about Time complexity or internal implementation of BitSet.flip().

Related

Ambiguity in a CodeForces Problem - usage of HashSet Vs LinkedHashSet

I was solving a Codeforces problem yesterday. The problem's URL is this
I will just explain the question in short below.
Given a binary string, divide it into a minimum number of subsequences
in such a way that each character of the string belongs to exactly one
subsequence and each subsequence looks like "010101 ..." or "101010
..." (i.e. the subsequence should not contain two adjacent zeros or
ones).
Now, for this problem, I had submitted a solution yesterday during the contest. This is the solution. It got accepted temporarily and on final test cases got a Time limit exceeded status.
So today, I again submitted another solution and this passed all the cases.
In the first solution, I used HashSet and in the 2nd one I used LinkedHashSet. I want to know, why didn't HashSet clear all the cases? Does this mean I should use LinkedHashSet whenever I need a Set implementation? I saw this article and found HashSet performs better than LinkedHashSet. But why my code doesn't work here?
This question would probably get more replies on Codeforces, but I'll answer it here anyways.
After a contest ends, Codeforces allows other users to "hack" solutions by writing custom inputs to run on other users' programs. If the defending user's program runs slowly on the custom input, the status of their code submission will change from "Accepted" to "Time Limit Exceeded".
The reason why your code, specifically, changed from "Accepted" to "Time Limit Exceeded" is that somebody created an "anti-hash test" (a test on which your hash function results in many collisions) on which your program ran slower than usual. If you're interested in how such tests are generated, you can find several posts on Codeforces, like this one: https://codeforces.com/blog/entry/60442.
As linked by #Photon, there's a post on Codeforces explaining why you should avoid using Java.HashSet and Java.HashMap: https://codeforces.com/blog/entry/4876, which is essentially due to anti-hash tests. In some instances, adding the extra log(n) factor from a balanced BST may not be so bad (by using TreeSet or TreeMap). In many cases, an extra log(n) factor won't make your code time out, and it gives you protection from anti-hash tests.
How do you determine whether your algorithm is fast enough to add the log(n) factor? I guess this comes with some experience, but most people suggest performing some sort of calculation. Most online judges (including Codeforces) show the time that your program is allowed to run on a particular problem (usually somewhere between one and four seconds), and you can use 10^9 constant-time operations per second as a rule of thumb when performing calculations.

Indexing Performance in Apache Lucene

I am trying to index a record of 5 billion, or even more, rows using lucene.
Does the time of indexing increase exponentially as the record set increases?
My initial indexing of 10 million records happened very quickly, but when I tried to index more than 100 million records, it took more time than I expected, with respect to the 10 million record indexing time.
Is it because it is indexing it against more document hence time is increasing exponentially? Or what could be the reason behind this behavior, and is there any way to optimize it (please note,currently all fields in all the documents are of type StringField, will chaning it to IntField help me in this direction?).
My second question would be how will the search performance be in case of indexing 5 billion records. Any ideas on that?
Let me know if you need more information from my end on this.
Our current use case seems somewhat similar to yours: 1.6 billion rows, most fields are exact matches, periodic addition of files/rows, regular searching. Our initial indexing is not distributed or parallelized in any way, currently, and takes around 9 hours. I only offer that number to give you a very vague sense of what your indexing experience may be.
To try and answer your questions:
Our indexing time does not grow exponentially with the number of rows already indexed, though it does slow down very gradually. For us, perhaps 20% slower by the end, though it could also be specific to our data.
If you are experiencing significant slow-down, I support femtoRgon's suggestion that you profile to see what's eating the time. Lucene has never been the slowest/weakest component in our system.
Yes, you can write to your index in parallel, and you may see improved throughput. Whether it helps or not depends on where your bottlenecks are, of course. Consider using Solr - it may ease your efforts here.
We use a mixture of StringField, LongField, and TextField. It seems unlikely that the type of field is causing your slowdown on its own.
These answers are all anecdotal, but perhaps they'll be of some use to you.
This page is now quite dated, but if you exhaust all your other options, may provide hints of which levers you can pull that might tweak performance: How to make indexing faster
Have you profiled to see what is actually causing your performance issues? You could find something unexpected is eating up all that time. When I profiled a similar performance issue I thought was caused by lucene, turned out the problem was mostly string concatenations.
As to whether you should use StringField or IntField (or TextField, or whatever), you should determine that based on what is in the field on how you are going to search it. If you might want to search the field as a range of numeric values, it should be an IntField, not a StringField. By the way, StringField indexes the entire value as a single term, and skips analysis, so this is also the wrong field for full text, for which you should use a TextField. Basically, using a StringField for everything would seem very much like a bad code smell to me, and could cause performance issues at index time, but I would definitely expect the much larger problems would appear when you start trying to search.
As far as "how will the search performance be with 5 billion values", that's far too vague a question to even attempt to answer. No idea. Try it and see.

A hashtable with a 1.0 maxLoad factors, time complexity

Just a quick question to confirm my thoughts,
The complexity of a hash table using a load factor of 1.0 would be quadratic time shown with the following notation O(n^2).
This would be because of having to continuously resize and insert over and over. Please correct me if im wrong.
Thanks
The worst case scenarios for searching, inserting, and deleting for a hash table will almost always be O(n).
Seems you mean open addressing. With separate chaining, complexity of all operations will still be O(1).

java frameworks complexity statistics

It is extremely difficult to illustrate the complexity of frameworks (hibernate, spring, apache-commons, ...)
The only thing I could think of was to compare the file sizes of the jar libraries or even better, the number of classes contained in the jar files.
Of course this is not a mathematical sound proof of complexity. But at least it should make clear that some frameworks are lightweight compared to others.
Of course it would take quiet some time to calculate statistics. In an attempt to save time I was wondering if perhaps somebody did so already ?
EDIT:
Yes, there are a lot of tools to calculate the complexity of individual methods and classes. But this question is about third party jar files.
Also please note that 40% of phrases in my original question stress the fact that everybody is well aware of the fact that complexity is hard to measure and that file size and nr of classes may indeed not be sufficient. So, it is not necessary to elaborate on this any further.
There are tools out there that can measure the complexity of code. However this is more of a psychological question as you cannot mathematically define the term 'complex code'. And obviously giving two random persons some piece of code will give you very different answers.
In general the issue with complexity arises from the fact that a human brain cannot process more than a certain number of lines of code simultaneously (actually functional pieces, but normal lines of code should be exactly that). The exact number of lines that one can hold and understand in memory at the same time of course varies based on many factors (including time of day, day of the week and status of your coffee machine) and therefore completely depends on the audience. However less number of lines of code that you have to keep in your 'internal memory register' for one task is better, therefore this should be the general factor when trying to determine the complexity of an API.
There is however a pitfall with this way of calculating complexity, as many APIs offer you a fast way of solving a problem (easy entry level), but this solution later turns out to cause several very complex coding decisions, that on overall makes your code very difficult to understand. In contrast other APIs require you to do a very complex setup that is hard to understand at first, but the rest of your code will be extremely easy because of that initial setup.
Therefore a good way of measuring API complexity is to define a task to solve by that API that is representative and big enough, and then measure the average amount of simultaneous lines of code one has to keep in mind to implement that task.And once you're done, please publish the result in a scientific paper of your choice. ;)

Subgraph matching (JUNG)

I have a set of subgraphs and I need to match them on the graph they were extracted from. I also need to count how many times each subgraph shows up in such graph (I need to store all possible matches). There must be a perfect match considering the edges' labels of both subgraph and graph, the vertices' labels, however, donĀ“t need to match each other. I built my system using JUNG API, so I would like a solution (api, algorithm etc) that could deal with the Graph structure provided by JUNG. Any thoughts?
JUNG is very full-featured, so if there isn't a graph analysis algorithm in JUNG for what you need, there's usually a strong, theoretical reason for it. To me, your problem sounds like an instance of the Subgraph Isomorphism problem, which is NP-Complete:
http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
NP-Completeness may or may not be familiar to you (it took me 7 years of college and Master's Degree in Computer Science to understand it!), so I'll give a high-level description here. Certain problems, like sorting, can be solved in Polynomial time with respect to their input size. For example, if I have a list of N elements, I can sort it in O(N log(N)) time. More specifically, if I can solve a problem in Polynomial time, this means I can solve the problem without exhausting every possible solution. In the sorting case, I could traverse every possible permutation of the list and, if I found a permutation of the list that was sorted, return it. This is obviously not the fastest way to solve the problem though! Some very clever mathematicians were able to get it down to its theoretical minimum of O(N log(N)), thus we can sort really big lists of things quite quickly using computers today.
On the flip-side, NP-Complete problems are thought to have no Polynomial time solution (I say thought because no one has ever proven it, although evidence strongly suggests this is the case). Anyway, what this means is that you cannot definitively solve an NP-Complete problem without first exhausting every possible solution. The time complexity of NP-Complete problems are always O(c ^ N) or worse, where c is some constant greater than 1. This means that the time required to solve the problem grows exponentially with every incremental increase in problem size.
So what does this have to do with my problem???
What I'm getting at here is that, if the Subgraph Isomorphism problem is NP-Complete, then the only way you can determine if one graph is a subgraph of another graph is by exhausting every possible solution. So you can solve this, but probably only up to graphs of a few nodes or so (since the problem's time complexity grows exponentially with every incremental increase in graph size). This means that it is computationally infeasible to compute a solution for your problem because as soon as you reach a certain graph size, it will quite literally take forever to find a solution.
More practically, if your boss asks you to do something that is provably NP-Complete, you can simply say it's impossible and he will have to listen to you. If your professor asks you to do something that is provably NP-Complete, show him that it's NP-Complete and you'll probably get an A for the course. If YOU are trying to do something NP-Complete of your own accord, it's better to just move on to the next project... ;)
Well, I had to solve the problem by implementing it from scratch. I followed the strategy suggested in the topic Any working example of VF2 algorithm?. So, if someone is in doubt about this problem too, I suggest to take a look at Rich Apodaca's answer in the aforementioned topic.

Categories