Implementing Guava's MurMurHash Properly

Implementing Guava's MurMurHash Properly - java

I'm a Junior Java Developer and I'm trying to start a small personal project in order to learn the proper ways to do things (in general). I started searching about hash() and while reading an article about the benefits of Guava, I stumbled upon MurMurHash and the example is very clear on it's website, but there is something missing that I didn't understand: Funnel.
The code goes like this:
HashFunction hf = Hashing.md5();
HashCode hc = hf.newHasher()
.putLong(id)
.putString(name, Charsets.UTF_8)
.putObject(person, personFunnel)
.hash();
but then I have to define a Funnel to decompose an object type into primitive field values, for which I have to
Funnel<Person> personFunnel = new Funnel<Person>() {
#Override
public void funnel(Person person, PrimitiveSink into) {
into
.putInt(person.id)
.putString(person.firstName, Charsets.UTF_8)
.putString(person.lastName, Charsets.UTF_8)
.putInt(birthYear);
}
};
Although I searched for more info about how to use this or info in general, there is no clear explanation about how Funnel works and/or how should I use it. Also I don't understand what PrimitiveSink is, so I don't know what kind of data should I send as a second parameter.
I would appreciate an explanation o guidance about this.

You don't actually have to use a Funnel for anything, but a Funnel is just an object that expresses how to convert a particular type to a sequence of primitives. There is no special magic.
Funnel<Person> personFunnel = new Funnel<Person>() {
#Override
public void funnel(Person person, PrimitiveSink into) {
into
.putInt(person.id)
.putString(person.firstName, Charsets.UTF_8)
.putString(person.lastName, Charsets.UTF_8)
.putInt(birthYear);
}
};
This is just an object that explains how to convert a Person to a sequence of primitives by putting them into a thing that knows how to receive primitives; the interface for things that know how to receive primitives is PrimitiveSink. Hasher is an example of a class that implements PrimitiveSink, and when you call hasher.putObject(object, funnelForObjectType), the API internally just does funnelForObjectType.funnel(object, hasher), and keeps going.
It's just a way of writing a converter from a particular object type to primitives, nothing more. You are never likely to call Funnel.funnel yourself; it's only there to be passed into putObject calls; you should never need to pass in your own PrimitiveSink.

Related

Is it bad practice in Java to modify input object in void method?

In Java, assume you have a data object object with an attribute bar that you need to set with a value that is returned from a complex operation done in an external source. Assume you have a method sendRequestToExternalSource that send a request based on 'object' to the external source and gets an object back holding (among other things) the needed value.
Which one of these ways to set the value is the better practice?
void main(MyObject object) {
bar = sendRequestToExternalSource(object);
object.setBar(bar);
}
String sendRequestToExternalSource(MyObject object) {
// Send request to external source
Object response = postToExternalSource(object);
//Do some validation and logic based on response
...
//Return only the attribute we are interested in
return response.getBar();
}
or
void main(MyObject object) {
sendRequestToExternalSourceAndUpdateObject(object);
}
void sendRequestToExternalSourceAndUpdateObject(MyObject object) {
// Send request to external source
Object response = postToExternalSource(object);
//Do some validation and logic based on response
...
//Set the attribute on the input object
object.setBar(response.getBar());
}
I know they both work, but what is the best practice?

It depends on a specific scenario. Side-effects are not bad practice but there are also scenarios where a user simply won't expect them.
In any case your documentation of such a method should clearly state if you manipulate arguments. The user must be informed about that since it's his object that he passes to your method.
Note that there are various examples where side-effects intuitively are to be expected and that's also totally fine. For example Collections#sort (documentation):
List<Integer> list = ...
Collections.sort(list);
However if you write a method like intersection(Set, Set) then you would expect the result being a new Set, not for example the first one. But you can rephrase the name to intersect and use a structure like Set#intersect(Set). Then the user would expect a method with void as return type where the resulting Set is the Set the method was invoked on.
Another example would be Set#add. You would expect that the method inserts your element and not a copy of it. And that is also what it does. It would be confusing for people if it instead creates copies. They would need to call it differently then, like CloneSet or something like that.
In general I would tend to giving the advice to avoid manipulating arguments. Except if side-effects are to be expected by the user, as seen in the example. Otherwise the risk is too high that you confuse the user and thus create nasty bugs.

I would choose the first one if I have only these two choices. And the reason of that is "S" in SOLID principles, single responsibility. I think the job of doComplicatedStuff method is not setting new or enriched value of bar to MyObject instance.
Of course I don't know use case that you are trying to implement, but I suggest looking at decorator pattern to modify MyObject instance

I personally prefer the variant barService.doComplicatedStuff(object); because I avoid making copies

Regarding two lines of java code

I am trying to learn a java-based program, but I am pretty new to java. I am quite confusing on the following two lines of java code. I think my confusion comes from the concepts including “class” and “cast”, but just do not know how to analyze.
For this one
XValidatingObjectCorpus<Classified<CharSequence>> corpus
= new XValidatingObjectCorpus<Classified<CharSequence>>(numFolds);
What is <Classified<CharSequence>> used for in terms of Java programming? How to understand its relationships with XValidatingObjectCorpusand corpus
For the second one
LogisticRegressionClassifier<CharSequence> classifier
= LogisticRegressionClassifier.<CharSequence>train(para1, para2, para3)
How to understand the right side of LogisticRegressionClassifier.<CharSequence>train? What is the difference between LogisticRegressionClassifier.<CharSequence>train and LogisticRegressionClassifier<CharSequence> classifier
?

These are called generics. They tell Java to make an instance of the outer class - either XValidatingObjectCorpus or LogisticRegressionClassifier - using the type of the inner object.
Normally, these are used for lists and arrays, such as ArrayList or HashMap.
What is the relationship between XValidatingObjectCorpus and corpus?
corpus is just a name given to the new XValidatingObjectCorpus object that you make with that statement (hence the = new... part).
What does LogisticRegressionClassifier.<CharSequence>train mean?
I have no idea, really. I suggest looking at the API for that (I think this is the right class).
What is the difference between LogisticRegressionClassifier.<CharSequence>train and LogisticRegressionClassifier<CharSequence> classifier?
You can't really compare these two. The one on the left of the = is the object identifier, and the one on the right is the allocator (probably the wrong word, but it is what it does, kind of).
Together, the two define an instance of LogisticRegressionClassifier, saying to create that type of object, call it classifier, and then give it the value returned by the train() method. Again, look at the API to understand it more.
By the way, these look like wretched examples to be learning Java with. Start with something simple, or at least an easier part of the code. It looks like someone had way too much fun with long names (the API has even longer names). Seriously though, I only just got to fully understanding this, and Java was my main language for quite a while (It gets really confusing when you try and do simple things). Anyways, good luck!

public class Sample<T> { // T implies Generic implementation, T can be substituted with any object.
static <T> Sample<T> train(int par1, int par2, int par3){
return new Sample<T>(); // you are calling the Generic method to return Sample object which works with a particular type of generic object, may it be an Integer or a CharSequence. --> see the main method.
}
public static void main(String ... a)
{
int par1 = 0, par2 = 0, par3 = 1;
// Here you are returning Sample object which works with a sequence of characters.
Sample<CharSequence> sample = Sample.<CharSequence>train(par1, par2, par3);
// Here you are returning Sample object which works with Integer values.
Sample<CharSequence> sample1 = Sample.<Integer>train(par1, par2, par3);
}
}

<Classified<CharSequence>> is a generic parameter.
LogisticRegressionClassifier<CharSequence> is a generic type.
LogisticRegresstionClassifier.<CharSequence>train is a generic method.
Java Generics Tutorial

What is a callback in java [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What is a callback function?
I have read the wikipedia definition of a callback but I still didn't get it. Can anyone explain me what a callback is, especially the following line
In computer programming, a callback is a reference to executable code, or a piece of executable code, that is passed as an argument to other code. This allows a lower-level software layer to call a subroutine (or function) defined in a higher-level layer.

Callbacks are most easily described in terms of the telephone system. A function call is analogous to calling someone on a telephone, asking her a question, getting an answer, and hanging up; adding a callback changes the analogy so that after asking her a question, you also give her your name and number so she can call you back with the answer.
Paul Jakubik, Callback Implementations in C++.

Maybe an example would help.
Your app wants to download a file from some remote computer and then write to to a local disk. The remote computer is the other side of a dial-up modem and a satellite link. The latency and transfer time will be huge and you have other things to do. So, you have a function/method that will write a buffer to disk. You pass a pointer to this method to your network API, together with the remote URI and other stuff. This network call returns 'immediately' and you can do your other stuff. 30 seconds later, the first buffer from the remote computer arrives at the network layer. The network layer then calls the function that you passed during the setup and so the buffer gets written to disk - the network layer has 'called back'. Note that, in this example, the callback would happen on a network layer thread than the originating thread, but that does not matter - the buffer still gets written to the disk.

A callback is some code that you pass to a given method, so that it can be called at a later time.
In Java one obvious example is java.util.Comparator. You do not usually use a Comparator directly; rather, you pass it to some code that calls the Comparator at a later time:
Example:
class CodedString implements Comparable<CodedString> {
private int code;
private String text;
...
#Override
public boolean equals() {
// member-wise equality
}
#Override
public int hashCode() {
// member-wise equality
}
#Override
public boolean compareTo(CodedString cs) {
// Compare using "code" first, then
// "text" if both codes are equal.
}
}
...
public void sortCodedStringsByText(List<CodedString> codedStrings) {
Comparator<CodedString> comparatorByText = new Comparator<CodedString>() {
#Override
public int compare(CodedString cs1, CodedString cs2) {
// Compare cs1 and cs2 using just the "text" field
}
}
// Here we pass the comparatorByText callback to Collections.sort(...)
// Collections.sort(...) will then call this callback whenever it
// needs to compare two items from the list being sorted.
// As a result, we will get the list sorted by just the "text" field.
// If we do not pass a callback, Collections.sort will use the default
// comparison for the class (first by "code", then by "text").
Collections.sort(codedStrings, comparatorByText);
}

Strictly speaking, the concept of a callback function does not exist in Java, because in Java there are no functions, only methods, and you cannot pass a method around, you can only pass objects and interfaces. So, whoever has a reference to that object or interface may invoke any of its methods, not just one method that you might wish them to.
However, this is all fine and well, and we often speak of callback objects and callback interfaces, and when there is only one method in that object or interface, we may even speak of a callback method or even a callback function; we humans tend to thrive in inaccurate communication.
(Actually, perhaps the best approach is to just speak of "a callback" without adding any qualifications: this way, you cannot possibly go wrong.
See next sentence.)
One of the most famous examples of using a callback in Java is when you call an ArrayList object to sort itself, and you supply a comparator which knows how to compare the objects contained within the list.
Your code is the high-level layer, which calls the lower-level layer (the standard java runtime list object) supplying it with an interface to an object which is in your (high level) layer. The list will then be "calling back" your object to do the part of the job that it does not know how to do, namely to compare elements of the list. So, in this scenario the comparator can be thought of as a callback object.

A callback is commonly used in asynchronous programming, so you could create a method which handles the response from a web service. When you call the web service, you could pass the method to it so that when the web service responds, it call's the method you told it ... it "calls back".
In Java this can commonly be done through implementing an interface and passing an object (or an anonymous inner class) that implements it. You find this often with transactions and threading - such as the Futures API.
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/Future.html

In Java, callback methods are mainly used to address the "Observer Pattern", which is closely related to "Asynchronous Programming".
Although callbacks are also used to simulate passing methods as a parameter, like what is done in functional programming languages.

can I add to a private list directly through the getter?

I realize I'm going to get flamed for not simply writing a test myself... but I'm curious about people's opinions, not just the functionality, so... here goes...
I have a class that has a private list. I want to add to that private list through the public getMyList() method.
so... will this work?
public class ObA{
private List<String> foo;
public List<String> getFoo(){return foo;}
}
public class ObB{
public void dealWithObAFoo(ObA obA){
obA.getFoo().add("hello");
}
}

Yes, that will absolutely work - which is usually a bad thing. (This is because you're really returning a reference to the collection object, not a copy of the collection itself.)
Very often you want to provide genuinely read-only access to a collection, which usually means returning a read-only wrapper around the collection. Making the return type a read-only interface implemented by the collection and returning the actual collection reference doesn't provide much protection: the caller can easily cast to the "real" collection type and then add without any problems.

Indeed, not a good idea. Do not publish your mutable members outside, make a copy if you cannot provide a read-only version on the fly...
public class ObA{
private List<String> foo;
public List<String> getFoo(){return Collections.unmodifiableList(foo);}
public void addString(String value) { foo.add(value); }
}

If you want an opinion about doing this, I'd remove the getFoo() call and add an add(String msg) and remove(String msg) methods (or whatever other functionality you want to expose) to ObA

Giving access to collection always seems to be a bad thing in my experience--mostly because they are virtually impossible to control once they get out. I've taken to the habit of NEVER allowing direct access to collections outside the class that contains them.
The main reasoning behind this is that there is almost always some sort of business logic attached to the collection of data--for instance, validation on addition or perhaps some day you'll need to add a second closely-related collection.
If you allow access like you are talking about, it will be very difficult in the future to make a modification like this.
Oh, also, I often find that I eventually have to store a little more data with the object I'm storing--so I create a new object (only known inside the "Container" that houses the collection) and I put the object inside that before putting it in the collection.
If you've kept your collection locked down, this is a trivial refactor. Try to imagine how difficult it would be in some case you've worked on where you didn't keep the collection locked down...

If you wanted to support add and remove functions to Foo, I would suggest the methods addFoo() and removeFoo(). I ideally you could eliminate the getFoo at together by creating a method for each piece of functionality you need. This make it clear as to the functions a caller will preform on the list.

Should Java method arguments be used to return multiple values?

Since arguments sent to a method in Java point to the original data structures in the caller method, did its designers intend for them to used for returning multiple values, as is the norm in other languages like C ?
Or is this a hazardous misuse of Java's general property that variables are pointers ?

A long time ago I had a conversation with Ken Arnold (one time member of the Java team), this would have been at the first Java One conference probably, so 1996. He said that they were thinking of adding multiple return values so you could write something like:
x, y = foo();
The recommended way of doing it back then, and now, is to make a class that has multiple data members and return that instead.
Based on that, and other comments made by people who worked on Java, I would say the intent is/was that you return an instance of a class rather than modify the arguments that were passed in.
This is common practice (as is the desire by C programmers to modify the arguments... eventually they see the Java way of doing it usually. Just think of it as returning a struct. :-)
(Edit based on the following comment)
I am reading a file and generating two
arrays, of type String and int from
it, picking one element for both from
each line. I want to return both of
them to any function which calls it
which a file to split this way.
I think, if I am understanding you correctly, tht I would probably do soemthing like this:
// could go with the Pair idea from another post, but I personally don't like that way
class Line
{
// would use appropriate names
private final int intVal;
private final String stringVal;
public Line(final int iVal, final String sVal)
{
intVal = iVal;
stringVal = sVal;
}
public int getIntVal()
{
return (intVal);
}
public String getStringVal()
{
return (stringVal);
}
// equals/hashCode/etc... as appropriate
}
and then have your method like this:
public void foo(final File file, final List<Line> lines)
{
// add to the List.
}
and then call it like this:
{
final List<Line> lines;
lines = new ArrayList<Line>();
foo(file, lines);
}

In my opinion, if we're talking about a public method, you should create a separate class representing a return value. When you have a separate class:
it serves as an abstraction (i.e. a Point class instead of array of two longs)
each field has a name
can be made immutable
makes evolution of API much easier (i.e. what about returning 3 instead of 2 values, changing type of some field etc.)
I would always opt for returning a new instance, instead of actually modifying a value passed in. It seems much clearer to me and favors immutability.
On the other hand, if it is an internal method, I guess any of the following might be used:
an array (new Object[] { "str", longValue })
a list (Arrays.asList(...) returns immutable list)
pair/tuple class, such as this
static inner class, with public fields
Still, I would prefer the last option, equipped with a suitable constructor. That is especially true if you find yourself returning the same tuple from more than one place.

I do wish there was a Pair<E,F> class in JDK, mostly for this reason. There is Map<K,V>.Entry, but creating an instance was always a big pain.
Now I use com.google.common.collect.Maps.immutableEntry when I need a Pair

See this RFE launched back in 1999:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4222792
I don't think the intention was to ever allow it in the Java language, if you need to return multiple values you need to encapsulate them in an object.
Using languages like Scala however you can return tuples, see:
http://www.artima.com/scalazine/articles/steps.html
You can also use Generics in Java to return a pair of objects, but that's about it AFAIK.
EDIT: Tuples
Just to add some more on this. I've previously implemented a Pair in projects because of the lack within the JDK. Link to my implementation is here:
http://pbin.oogly.co.uk/listings/viewlistingdetail/5003504425055b47d857490ff73ab9
Note, there isn't a hashcode or equals on this, which should probably be added.
I also came across this whilst doing some research into this questions which provides tuple functionality:
http://javatuple.com/
It allows you to create Pair including other types of tuples.

You cannot truly return multiple values, but you can pass objects into a method and have the method mutate those values. That is perfectly legal. Note that you cannot pass an object in and have the object itself become a different object. That is:
private void myFunc(Object a) {
a = new Object();
}
will result in temporarily and locally changing the value of a, but this will not change the value of the caller, for example, from:
Object test = new Object();
myFunc(test);
After myFunc returns, you will have the old Object and not the new one.
Legal (and often discouraged) is something like this:
private void changeDate(final Date date) {
date.setTime(1234567890L);
}
I picked Date for a reason. This is a class that people widely agree should never have been mutable. The the method above will change the internal value of any Date object that you pass to it. This kind of code is legal when it is very clear that the method will mutate or configure or modify what is being passed in.
NOTE: Generally, it's said that a method should do one these things:
Return void and mutate its incoming objects (like Collections.sort()), or
Return some computation and don't mutate incoming objects at all (like Collections.min()), or
Return a "view" of the incoming object but do not modify the incoming object (like Collections.checkedList() or Collections.singleton())
Mutate one incoming object and return it (Collections doesn't have an example, but StringBuilder.append() is a good example).
Methods that mutate incoming objects and return a separate return value are often doing too many things.

There are certainly methods that modify an object passed in as a parameter (see java.io.Reader.read(byte[] buffer) as an example, but I have not seen parameters used as an alternative for a return value, especially with multiple parameters. It may technically work, but it is nonstandard.

It's not generally considered terribly good practice, but there are very occasional cases in the JDK where this is done. Look at the 'biasRet' parameter of View.getNextVisualPositionFrom() and related methods, for example: it's actually a one-dimensional array that gets filled with an "extra return value".
So why do this? Well, just to save you having to create an extra class definition for the "occasional extra return value". It's messy, inelegant, bad design, non-object-oriented, blah blah. And we've all done it from time to time...

Generally what Eddie said, but I'd add one more:
Mutate one of the incoming objects, and return a status code. This should generally only be used for arguments that are explicitly buffers, like Reader.read(char[] cbuf).

I had a Result object that cascades through a series of validating void methods as a method parameter. Each of these validating void methods would mutate the result parameter object to add the result of the validation.
But this is impossible to test because now I cannot stub the void method to return a stub value for the validation in the Result object.
So, from a testing perspective it appears that one should favor returning a object instead of mutating a method parameter.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.