Replacing all .equals(string) with == (int) - java

My java project uses objects that have an ArrayList of Strings. These strings are, however, predetermined. There are only 47 or so possible Strings. Now I want to check if an ArrayList contains a certain String.
I currently do this through .equals() , but I believe that's quite slow, so I want to switch to == (int). It is, however, quite annoying to manually turn all .equals() into == (int). Especially when things have to be changed again.
So is there a tool that turns all these .equals() to == (int) for a set of Strings?

Beware of premature optimisation (it's the root of all evil). There are plenty of resources describing this in detail. In short, it's important to measure the performance impact of any optimisation you make and weigh it against code readability. In this case, I would imagine you are reducing readability for negligible affect on performance.
It's also worth noting that accurate measuring performance is an entire subject on its own. It's difficult to get meaningful results if the differences are small because there are about a million factors other than the actual code which can affect performance - especially Java, with it's hotspot compiler.
Without seeing code, it's hard to appreciate your exact application. However, you may be able to improve performance without decreasing (or potentially improving?) readability by using enums in an EnumSet. Again, if improving performance is your primary goal, it's important to measure to see if your changes actually improve anything.
You may be also be able to simplify your existing String#equals code using [List#contains](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/List.html#contains(java.lang.Object)

If they are all predetermined can you just use an enum? Enums are basically a list of named numbers with some extra helper functions.
public enum myStrings
{
firstString,
secondString,
lastString
}

Related

Returning empty collections in accordance with Effective Java

Effective Java, Item 43 states:
return unmodifiable empty collections instead of null.
So far so good. Are there any guidelines, exactly what to return? Does this question even make sense? What I am thinking about is:
Does it make a difference whether you return an emtpy LinkedList<> or ArrayList<>(0)?
Does it make a difference whether you return an empty HashMap<> or TreeMap<>?
etc.
Performance difference? Hardly.
Memory footprint? Maybe.
CPU footprint? Maybe.
Should these static returns be declared globally (i.e., cached)?
They're already cached for you by the Collections class, which contains some utility methods.
You can use Collections.emptySet(), Collections.emptyMap() and Collections.emptyList() that return immutable empty collections. Just as long as you're using the Set, Map and List interfaces in your code, as you should.
There are also methods for returning (again immutable) collections containing a single instance, such as Collections.singletonList(mySingleElement).
They don't really affect performance, but they do make your code clearer:
return Collections.unmodifiableList(new ArrayList<>());
vs.
return Collections.emptyList();
You can also find Collections.EMPTY_LIST etc. but when using the methods you avoid getting warnings due to (lack of) generics.
The only thing you care about here: making sure that the returned collection is immutable.
And you get that when usingthe various emptyXyz() methods from the Collection class. In that sense: just make sure that you always use these methods; and don't waste time thinking about alternatives.
As in: don't start worrying about performance unless you do something in the order of millions of calls in a brief period of time. It is much more important that you write simple, clean, readable code. Because when you do that, you (most often) also have code that isn't dramatically "slow".
Meaning: of course, one should avoid outright stupid performance mistakes. But when that part is covered - don't assume you should worry about performance. Performance becomes an issue when customers complain. And then you profile what your code is doing, to find the true trouble makers. And if you then find emptyList() to be one thing that is more expensive than anything else and costing you too much - then come back and write up a question about that. (which will probably never happen)

Should Guava Splitters/Joiners be created each time they are used?

Guava contains utilities for splitting and joining Strings, but it requires the instantiation of a Splitter/Joiner object to do so. These are small objects that typically only contain the character(s) on which to split/join. Is it a good idea to maintain references to these objects in order to reuse them, or is it preferable to just create them whenever you need them and let them be garbage collected?
For example, I could implement this method in the following two ways:
String joinLines(List<String> lines) {
return Joiner.on("\n").join(lines);
}
OR
static final Joiner LINE_JOINER = Joiner.on("\n");
String joinLines(List<String> lines) {
return LINE_JOINER.join(lines);
}
I find the first way more readable, but it seems wasteful to create a new Joiner each time the method is called.
To be honest, this sounds like premature optimization to me. I agree with #Andy Turner, write whatever is easiest to understand and maintain.
If you plan to use Joiner.on("\n") in a few places, make it a well named constant; go with option two.
If you only plan to use it in your joinLines method, a constant seems overly verbose; go with option one.
It depends greatly on how often you expect the code to be called and what tradeoffs you want to make between CPU time, memory consumption and readability. Since Joiner is such a small thing, it's not going to make a huge difference either way: if you make it a constant, you save the (fairly minimal) costs of allocating it and GCing it for each call, while adding the (fairly minimal) memory consumption overhead to the program.
It also depends in part on what platform you're running the code on: if you're running on the server, typically you'll have plenty of memory so keeping a constant won't be an issue. On the other hand, if you're running on Android you're more memory constrained, but you also want to avoid unnecessary allocations since garbage collection is going to be worse and more impactful to your performance.
Personally, I tend to allocate a constant unless I know it's only going to be used some fixed number of times as opposed to repeatedly throughout the program.

Are variable definitions that are used once optimized?

Consider the following method:
private static long maskAndNegate(long l) {
int numberOfLeadingZeros = Long.numberOfLeadingZeros(l)
long mask = CustomBitSet.masks[numberOfLeadingZeros];
long result = (~l) & mask;
return result;
}
The method can be abbreviated to:
private static long maskAndNegate(long l) {
return (~l) & CustomBitSet.masks[Long.numberOfLeadingZeros(l)];
}
Are these two representations equal in actual run time? In other words, does the Java compiler optimize away the unnecessary definition of extra variables, that I've placed for readability and debugging?
The Java compiler itself hardly does any optimization. It's the JIT that does almost everything.
Local variables themselves are somewhat irrelevant to optimization though - a multi-operator expression still needs the various operands logically to go on the stack, just in unnamed "slots". You may well find that the generated bytecode for your two implementations if very similar, just without the names in the second case.
More importantly, any performance benefit that might occur very occasionally from reducing the number of local variables you use is almost certainly going to be insignificant. The readability benefit of the first method is much more likely to be significant. As always, avoid micro-optimizing without first having evidence that the place you're trying to optimize is a bottleneck, and then only allow optimizations which have proved their worth.
(By the time you've proved you need to optimize a particular method, you'll already have the tools to test any potential optimization, so you won't need to guess.)
The code is not significantly large enough to be optimized for starters. And in the second way you are just saving the memory used for storing references to numberOfLeadingZeros and all.
But when you will use this code significantly enough on runtime such as 10000 times at least, then JIT will identify it as HOT code and then optimize it with neat tricks such as Method Inlining and similar sorts.
But in your case preferable option is first one as it is more readable.
You should not compromise Readability for small optimizations.

Is there any difference between these statements?

Is there any difference between:
String x = getString();
doSomething(x);
vs.
doSomething(getString());
Resources and performance wise, Especially is it's done within a loop for tens, hundreds or thousands of times?
It has the same overhead. Local variables are just there to make your life easier. At the VM level they don't necessarily exist and certainly not anymore when machine code is run.
So what you need to worry about here is getString(), whether it is potentially expensive. x has very likely no effect at all.
Let me first begin by saying that your overriding goal should almost always be to maintain code readability. Your compiler is almost always better at trivial optimizations than you are. Trust it!
In response to your specific example: the bytecode generated for each example IS different. It didn't appear to make much difference though, because there wasn't a statistically significant or even consistent difference between the two approaches in a loop over Integer.MAX_VALUE iterations.
I believe both would be the same at compile time, the first may become more code readable in some cases though.
Both the statements are same. only difference is that in first approach you have used a local variable X which can be avoided using second syntax.
That largely depends on the use-case. Are you going to make repeated calls to doSomething using that exact String? Then using the local variable is a bit more efficient. However, if it's a single call or multiple calls with different Strings, it makes no difference.

array of structures, or structure of arrays?

Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:
public class Record {
final private int field1;
final private int field2;
final private long field3;
/* constructor & accessors here */
}
List<Record> records = new ArrayList<Record>();
If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:
public class OptimizedRecordStore {
final private int[] field1;
final private int[] field2;
final private long[] field3;
Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
/* constructor and other accessors & methods */
}
edit:
assume the # of records is something that is changed infrequently or never
I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.
The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.
The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.
Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?
I suspect it's the latter in which case the former makes far more sense.
If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.
If you have millions of records, the second approach has several advantages:
Memory usage: the first approach uses more memory because a) every Java object in heap has a header (containing class id, lock state etc.); b) objects are aligned in memory; c) each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOops for more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).
Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.
How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.
See this article in wikipedia: Parallel Array
A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.
I was curious so I actually ran a benchmark. If you don't re-create the object like you are[1], then SoA beats AoS by 5-100% depending on workload[2]. See my code here:
https://gist.github.com/twolfe18/8168262c5420c7a62d39
[1] I didn't add that because if you are concerned enough about speed to consider this refactor, it would be silly to do that.
[2] This also doesn't account for re-allocation, but again, this is often something you can either amortize away or know statically. This is a reasonable assumption for a pure-speed benchmark.
Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.
Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object).
Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO
Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.
I'd choose the first method (array of structures) unless you access the store relatively infrequently and are running into serious memory pressure issues.
First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.
Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.
Some other possibilities to consider:
An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.
A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".
(Not a direct answer, but one that I think should be given)
From your comment,
"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"
You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:
Writing a program that had its primary purpose to be faster than another product
Writing a smartphone app with the intention of reducing the amount of data sent between the client and server
In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.
In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.
If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.
Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.
Because you are making the int[] fields final, you are stuck with just the one initialization of the array and that is it. Thus, if you wanted 10^6 field1's, Java would need to separate that much memory for each of those int[], because you cannot reassign the size of those arrays. With an ArrayList, if you do not know the number of records beforehand and will be removing records potentially, you save a lot of space upfront and then later on as well when you go to remove records.
I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?
Edit You could also use a common long[][] matrix.
I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.

Categories