How to implement equivalence class in Java?

How to implement equivalence class in Java? - java

What would be the simple way to implement equivalence class in Java? Is there any library for that purpose?
The bothering part is how to write an efficient and non-naive "equal" operator.
Let S = {x,y,z,w,h}. If we use a mapping x->1, y->1, z->1, w->2, h->2 for the equivalence class of S, one has to consider the mapping x->10, y->10, z->10, w->20, h->20 as the same equivalence class.
Naive "equal" operator can quickly become time-consuming when the cardinal of the set S becomes large.
What would be the simple way? Any idea?
[EDITED] To clarify, the specific problem can be formalized as follows:
Let S be a non-empty set. We denote by M a set of partial mappings from V to integers. It is relatively easy to show that the binary relation \sim defined below derives an equivalence relation on M.
For m1 and m2 two partial mappings of M, m1 \sim m2 if and only if,
for any a of V, m1(a) is defined if and only if m2(a) is defined
for any a,b of V, m1(a) and m1(b) are both defined to be the same
integer value 'z1' if and only if m2(a) and m2(b) are both defined
to the same integer value 'z2' (which may or may not differ from
'z1')
Example.
a->9,b->9,w->1 \sim a->10,b->10,w->0
But it is not correct to say
a->5 \sim b->9
Thanks.

From what I understand from your question you can find the greatest common divisor (Euclid's algorithm recursively) for a set once and map the quotients with that instead - if they're exactly equal with another set it's equal, else not. This will only work if the sets are equal in size and mappings.

If I understand you correct, you could apply vector normalization. A 3d vector for example is normalized to a length of 1 by dividing all of its components separately with the vectors length. If two normalized vector's components are equal, their original (non-normalized) vectors point in the same direction (which is what I think you define as 'equal')
x,y,z,w,h would in your case be a 5-dimensional vector. They belong to the same class when the show into the same direction, but may have an arbitrary length.

Aside: I assume that the set S is actually the set V in your definition.
I think Uli is on the right track, although I would not assume that Set(Set(E)).equals() is efficient for your purposes. (Sorry, I couldn't get the lt or gt symbols to come through)
The default implementation of Set(E).equals() is likely O(nlog n) or O(n^2). Set(E).equals() almost certainly involves sorting; O(nlog n) is as good as it gets. I suggest you look at radix sort. It's O(n*log n), but grows very slowly.

Related

Collections sort method vs iteration

I was working on a playing cards shuffle problem and found two solutions for it.
The target is to shuffle all 52 playing cards stored in a array as Card objects. Card class has id and name associated to it.
Now, one way is to iterate using for loop and then with the help of a temp card object holder and a random number generator, we can swap two objects. This continues until we reach half of the cards.
Another way is to implement comparable overriding compareto method with a random generator number, so we get a random response each time we call the method.
Which way is better you think?

You should not do it by sorting with a comparator that returns random results, because then those random results can be inconsistent with one another (e.g., saying that a<b<c<a), and this can actually result in the distribution of orderings you get being far from uniform. See e.g. this demonstration by Mike Bostock. Also, it takes longer, not that that should matter for shuffling 52 objects.
The standard way to do it does involve a loop, but your description sounds peculiar and I suspect what you have in mind may also not produce the ideal results. (If the question is updated to make it clearer what the "iterate using for loop" approach is meant to be, I will update this.)
(There is a way to get good shuffling by sorting: pair each element up with a random number -- e.g., a random floating-point number in the range 0..1 -- and then sort using that number as key. But this is slower than Fisher-Yates and requires extra memory. In lower-level languages it generally also takes more code; in higher-level languages it can be terser; I'd guess that for Java it ends up being about equal.)
[EDITED to add:] As Louis Wasserman very wisely says in comments, when your language's standard library has a ready-made function to do a thing, you should generally use it. Unless you're doing this for, e.g., a homework assignment that requires you to find and implement an algorithm to solve the problem.

First of all, the comparator you've described wont work. More on this here. TLDR: comparsions must be reproducible, so if your comparator says that a is less then b next time when comparing b to a it should return "greater", not a random value. The same for Comparable.
If I were you, I'd rather use Collections#shuffle method, which "randomly permutes the specified list using a default source of randomness. All permutations occur with approximately equal likelihood". It's always better to rely on someone's code, then write your own, especially if it is in a standard library.

A4options.symmetry and signature instances permutattions in Alloy

I have modeled a diagram transformation chain in Alloy. I am interested in any chain that results of the solving, but some of the chains are exactly the same.
They are the same except permutation between signature instances, but the relations between instances form exactly the same graphs from one solution to an other.
Is there a way to avoid these redundant solutions?
I saw a symmetry option in the A4Option class but I didn't really understand how to configure it.
/** This option specifies the amount of symmetry breaking to do (when symmetry breaking isn't explicitly disabled).
*
* <p> If a formula is unsatisfiable, then in general, the higher this value,
* the faster you finish the solving. But if this value is too high, it will instead slow down the solving.
*
* <p> If a formula is satisfiable, then in general, the lower this value, the faster you finish the solving.
* Setting this value to 0 usually gives the fastest solve.
*
* <p> Default value is 20.
*/
Does it mean if I put 0 it is disabled? if I put a higher value does it avoid symmetry?
If you consider a set of atoms and relations between these atoms as a graph.
Ans an adjacency matrix as the characterization the relation between atoms in a matrix.
Does symmetry means 2 instances that have equivalent adjacency matrix?
In order to reduce the solving complexity, is there a way to specify to the solver that we are not interested in some specific signature instances permutation or relation permutation but in their architecture configuration?
Thanks in advance.

Does it mean if I put 0 [symmetry breaking] is disabled?
Yes
if I put a higher value does it avoid symmetry?
Yes, the best it can.
Does symmetry means 2 instances that have equivalent adjacency matrix?
I don't know what you mean by "adjacency matrix", but in any case, the answer is likely to be "not necessarily". Symmetry breaking is just a heuristic; it is implemented at a level lower than the Alloy AST, meaning that some symmetries that make sense at a high level of your domain model are not necessarily automatically detected and broken by the Alloy Analyzer.
In order to reduce the solving complexity, is there a way to specify
to the solver that we are not interested in some specific signature
instances permutation or relation permutation but in their
architecture configuration?
I don't think that can be readily done using Alloy.

A good hash function to use in interviews for integer numbers, strings?

I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"

Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.

You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)

For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

Hashcode comparison problem

I have list of a an object which is termed as rule in our case, this object itself is a list of field for which I have to do hashcode comparison as we can't duplicate rule in the
system.
i.e Let say I have two Rules R1 and R2 with fields A & B.
Now if values of A & B in R1 are 7 and 2 respectively.
And in R2 it's 3 and 4 respectively then the process I have used to check the duplicity
of Rules in the system that is hashcode comparison fails
the method which I have used is
for(Rule rule : rules){
changeableAttrCode=0;
fieldCounter=1;
attributes = rule.getAttributes();
for(RuleField ruleField : attributes){
changeableAttrCode = changeableAttrCode + (fieldCounter * ruleField.getValue().hashCode());
fieldCounter++;
}
parameters = rule.getParameters();
for(RuleField ruleField : parameters){
changeableAttrCode = changeableAttrCode + (fieldCounter * ruleField.getValue().hashCode());
fieldCounter++;
}
changeableAttrCodes.add(changeableAttrCode);
here changeableAttrCodes where we store the hashcode of all the rules.
so can please suggest me better method so that this kind of problem does not arise in future as well as duplicity of rules in system can be seen.
Thanks in advance

hashcode() is not meant to be used to check for equality. return 42; is a perfectly valid implementation of hashcode(). Why don't you overwrite equals() (and hashcode() for that matter) in the rules objects and use that to check whether two rules are equal? You could still use the hashcode to check which objects you need to investigate, since two equal() objects should always have the same hashcode, but that is a performance improvement that you may or may not need, depending on your system.

Implement hashCode and equals in class Rule.
Implementation of equals has to compare its values.
Then use a HashSet<Rule> and ask if(mySet.contains(newRule))
HashSet + equals implementation solves the problem of the non-uniqueness of the hash. It uses hash for classifying and speed but it uses equals at the end to ensure that two Rules with same hash are the same Rule or not.
More on hash: if you want to do it by hand, use the prime number sudggestion, and review the JDK code for string hashcodes. If you want to make a clean implementation try to retrieve the hashcode of the elements, make some kind of array of ints and use Arrays.hashCode(int[]) to get a hashcode for the combination of them.

Updated Your hashing algorithm is not producing a good spread of hash values - it gives the same value for (7, 2) and (3, 4):
1 * 7 + 2 * 2 = 11
1 * 3 + 2 * 4 = 11
It would also give the same value for (11, 0), (-1, 6), ... and one can trivially make up an endless number of similar equivalence classes based on your current algorithm.
Of course you can not avoid collisions - if you have enough instances, hash collision is inevitable. However, you should aim to minimize the chance for collisions. Good hashing algorithms strive to spread hash values equally over a wide range of values. A typical way to achieve this is to generate the hash value for an object containing n independent fields as an n-digit number with a base big enough to hold the different hash values for the individual fields.
In your case, instead of multiplying with fieldCounter you should multiply with a prime constant, e.g. 31 (that would be the base of your number). And add another prime constant to the result, e.g. 17. This gives you a better spread of hash values. (Of course the concrete base depends on what values can your fields take - I have no info about that.)
Also if you implement hashCode, you are strongly advised to implement equals as well - and in fact, you should use the latter to test for equality.
Here is an article about implementing hashCode.

I don't understand what you are trying to do here. With most hash function scenarios, collision is inevitable, because there are way more objects to hash than there are possible hash values (it's a pigeonhole principle).
It is generally the case that two different objects may have the same hash value. You cannot rely on hash functions alone to eliminate duplicates.
Some hash functions are better than others in minimizing collisions, but it's still an inevitability.
That said, there are some simple guidelines that usually gives a good enough hash function. Joshua Bloch gives the following in his book Effective Java 2nd Edition:
Store some constant nonzero value, say 17, in an int variable called result.
Compute an int hashcode c for each field:
If the field is a boolean, compute (f ? 1 : 0)
If the field is a byte, char, short, int, compute (int) f
If the field is a long, compute (int) (f ^ (f >>> 32))
If the field is a float, compute Float.floatToIntBits(f)
If the field is a double, compute Double.doubleToLongBits(f), then hash the resulting long as in above.
If the field is an object reference and this class's equals method compares the field by recursively invoking equals, recursively invoke hashCode on the field. If the value of the field is null, return 0.
If the field is an array, treat it as if each element is a separate field. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.
Combine the hashcode c into result as follows: result = 31 * result + c;

I started to write that the only way you can achieve what you want is with Perfect Hashing.
But then I thought about the fact that you said you can't duplicate objects in your system.
Edit based on thought-provoking comment from helios:
Your solution depends on what you meant when you wrote that you "can't duplicate rules".
If you meant that literally you cannot, that there is guaranteed to be only one instance of a rule with a particular set of values, then your problem is trivial: you can do identity comparison, in which case you can do identity comparison using ==.
On the other hand, you meant that you shouldn't for some reason (performance), then your problem is also trivial: just do value comparisons.
Given the way you've defined your problem, under no circumstances should you be considering the use of hashcodes as a substitute for equality. As others have noted, hashcodes by their nature yield collisions (false equality), unless you go to a Perfect Hashing solution, but why would you in this case?

What data-structure should I use to create my own "BigInteger" class?

As an optional assignment, I'm thinking about writing my own implementation of the BigInteger class, where I will provide my own methods for addition, subtraction, multiplication, etc.
This will be for arbitrarily long integer numbers, even hundreds of digits long.
While doing the math on these numbers, digit by digit isn't hard, what do you think the best datastructure would be to represent my "BigInteger"?
At first I was considering using an Array but then I was thinking I could still potentially overflow (run out of array slots) after a large add or multiplication. Would this be a good case to use a linked list, since I can tack on digits with O(1) time complexity?
Is there some other data-structure that would be even better suited than a linked list? Should the type that my data-structure holds be the smallest possible integer type I have available to me?
Also, should I be careful about how I store my "carry" variable? Should it, itself, be of my "BigInteger" type?

Check out the book C Interfaces and Implementations by David R. Hanson. It has 2 chapters on the subject, covering the vector structure, word size and many other issues you are likely to encounter.
It's written for C, but most of it is applicable to C++ and/or Java. And if you use C++ it will be a bit simpler because you can use something like std::vector to manage the array allocation for you.

Always use the smallest int type that will do the job you need (bytes). A linked list should work well, since you won't have to worry about overflowing.

If you use binary trees (whose leaves are ints), you get all the advantages of the linked list (unbounded number of digits, etc) with simpler divide-and-conquer algorithms. You do not have in this case a single base but many depending the level at which you're working.
If you do this, you need to use a BigInteger for the carry. You may consider it an advantage of the "linked list of ints" approach that the carry can always be represented as an int (and this is true for any base, not just for base 10 as most answers seem to assume that you should use... In any base, the carry is always a single digit)
I might as well say it: it would be a terrible waste to use base 10 when you can use 2^30 or 2^31.

Accessing elements of linked lists is slow. I think arrays are the way to go, with lots of bound checking and run time array resizing as needed.
Clarification: Traversing a linked list and traversing an array are both O(n) operations. But traversing a linked list requires deferencing a pointer at each step. Just because two algorithms both have the same complexity it doesn't mean that they both take the same time to run. The overhead of allocating and deallocating n nodes in a linked list will also be much heavier than memory management of a single array of size n, even if the array has to be resized a few times.

Wow, there are some… interesting answers here. I'd recommend reading a book rather than try to sort through all this contradictory advice.
That said, C/C++ is also ill-suited to this task. Big-integer is a kind of extended-precision math. Most CPUs provide instructions to handle extended-precision math at comparable or same speed (bits per instruction) as normal math. When you add 2^32+2^32, the answer is 0… but there is also a special carry output from the processor's ALU which a program can read and use.
C++ cannot access that flag, and there's no way in C either. You have to use assembler.
Just to satisfy curiosity, you can use the standard Boolean arithmetic to recover carry bits etc. But you will be much better off downloading an existing library.

I would say an array of ints.

An Array is indeed a natural fit. I think it is acceptable to throw OverflowException, when you run out of place in your memory. The teacher will see attention to detail.
A multiplication roughly doubles digit numbers, addition increases it by at most 1. It is easy to create a sufficiently big Array to store the result of your operation.
Carry is at most a one-digit long number in multiplication (9*9 = 1, carry 8). A single int will do.

std::vector<bool> or std::vector<unsigned int> is probably what you want. You will have to push_back() or resize() on them as you need more space for multiplies, etc. Also, remember to push_back the correct sign bits if you're using two-compliment.

i would say a std::vector of char (since it has to hold only 0-9) (if you plan to work in BCD)
If not BCD then use vector of int (you didnt make it clear)
Much less space overhead that link list
And all advice says 'use vector unless you have a good reason not too'

As a rule of thumb, use std::vector instead of std::list, unless you need to insert elements in the middle of the sequence very often. Vectors tend to be faster, since they are stored contiguously and thus benefit from better spatial locality (a major performance factor on modern platforms).
Make sure you use elements that are natural for the platform. If you want to be platform independent, use long. Remember that unless you have some special compiler intrinsics available, you'll need a type at least twice as large to perform multiplication.
I don't understand why you'd want carry to be a big integer. Carry is a single bit for addition and element-sized for multiplication.
Make sure you read Knuth's Art of Computer Programming, algorithms pertaining to arbitrary precision arithmetic are described there to a great extent.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.