SQL: Long comporating or String, which one is faster - java

I'm running this off a java program which connects to my sql server on the same machine.
Basically I'm trying to call a certain 'String' which can be identified by the string self or by it's already stored 'long'(int64) which is a method that stores an unique long related to the string.
So in this case my question would be, would long comparison at a SQL lookup be faster or wouldn't it matter that much versus String comparison.
SELECT * FROM playerAccount WHERE playerName = {string in Java}
or
SELECT * FROM playerAccount WHERE nameHash = {long in Java}
Thanks in advance ;)

The comparison operation itself is rather negligible. However, in general in computer code, the comparison of the long is going to use fewer cycles than the comparison for a string.
The reason is that comparing the bits in a numeric value is unambiguous and the code doesn't need to worry about the length of the value. When comparing strings, the underlying code has to "parse" the strings, character by character, to make the comparison, figure out where they end, and handle collations and character pages.
But, this is rather unimportant. For speed, you want an index. And although an index using the numeric value might be an iota faster than an index using a string, this is the wrong criteria for choosing which to use. Your code should be designed to function correctly and to be maintainable. It is doubtful that an optimization of this sort would ever be necessary to achieve a real-world goal.

Generally comparing long values is faster than comparing string values.
If string and long are stored in a database the problem is not the comparison, but the presence or absence of an index.
So the better solution is using the long value with an index the database.
Please note that if "nameHash" is the hashCode of the field "playerName" the two search don't returns necessarely the same record, infact two players with different names can possible have the same hashCode, so consider exactly what are your needs and eventually update the code.

Related

HashMap - Fastest access using a key with several fields

I have an object that is identified by 3 fields. One them is a String that represents 6 hex bytes, the other two are integers of not more than 1 bytes each. This all summed up is 8 bytes of data, which fits in a 64 bit integer.
I need to map these objects for fast access, and I can think of two approaches:
Use the 3 fields to generate a 64 bit key used to map the objects. This however would mean parsing the String to Hex for every access (and there will a lot of accesses, which need to be fast).
Use 3 HashMap levels, each nested inside the next, to represent the 3 identifying fields.
My question is which of these approaches should be the fastest.
Why not use a MultiKeyMap?
This might be not related to your question.
I have a suggestion for you.
Create an object with the 3 attributes that will form the key. Use the object has the key because it will be unique.
Map<ObjectKey,Object> map = new HashMap<>();
This makes sense for your use case? If you can add a bit more explanation maybe I can go further in suggest you possible solutions.
EDIT: You can override the equals and do something using this kind of logic:
#Override
public boolean equals(Object obj) {
if (!(obj instanceof Key))
return false;
ObjectKey objectKey= (Key) obj;
return this.key1.equals(objectKey.key1) && this.key2.equals(objectKey.key2) &&
...
this.keyN.equals(objectKey.keyN)
}
I would take the following steps:
Write it in the most readable way first, and profile it.
Refactor it to an implementation you think might be faster, then profile it again.
Compare.
Repeat.
Your key fits into a 64-bit value. Assuming you will build the HashMap in one go and then read from it multiple times (using it as a lookup table), my hunch is that using a Long type as the key of your HashMap will be about as fast as you can get.
You are concerned about having to parse the string as a hex number every time you look up a key in the map. What's the alternative? If you use a key containing the three separate fields, you will still have to parse the string to calculate its hash code (or, rather, the Java API implementation will calculate its hash code by parsing the string contents). The HashMap will not only call String.hashCode() but also String.equals(), so your string will be iterated twice. By contrast, calculating a Long and comparing it to the precalculated keys in the HashMap will consist of iterating the string only once.
If you use three levels of HashMap, as per your second suggestion, you will still have to calculate the hash code of your string, as well as having to look up the values of all three fields anyway, so the multi-level map doesn't give you any performance advantage.
You should also experiment with the HashMap constructor arguments to get the most efficiency. These will determine how efficiently your data will get spread into separate buckets.

In hibernate search Integer value when indexed appears to be stored as characters

I have a base class with
#Field
protected Integer group;
on hibernate 5.6.0.Final
I set my objects value to values 0 or 1. But when I observe the index using luke it always shows 4 stored rows as h, p, x ,
My tests work fine actually, when I add a MustJunction with range query on one of the group I get properly filtered results back. Maybe I am interpreting luke wrong...?
Hibernate Search stores numeric values as numeric fields in Lucene, by default. Which means that even if the value was stored in the index as is, you wouldn't have the "0" string or the "1" string in your index, but some binary value.
But even with that in mind, you're probably surprised to see different binary encoding for identical source values. It's an optimization: remember you're looking at the content of an inverted index, whose purpose isn't to look up values for a given document but to find documents matching a particular value.
If you're interested in how numeric indexing works in Lucene, you can have a look at the IntField javadoc. But since you queries work, it would really only be out of curiosity :)

Is hashcode() a good way to compute an unique token from a list of informations?

I have an
HashMap<String,AnObject>
and I'd like to give the string key a value from some infos the AnObject value contains.
Suppose AnObject is made this way:
public class AnObject(){
public String name;
public String surname;
}
Is it correct to assign the key to:
String.valueOf(o.name.hashcode()+o.surname.hashcode());
? Or Is there a better way to compute a String hash code from a value list?
No, absolutely not. hashCode() is not guaranteed to be unique.
The rules of a hash code are simple:
Two equal values must have the same hash code
Two non-equal values will ideally have different hash codes, but can have the same hash code. In particular, there are only 232 possible values to return from hashCode(), but more than 232 possible strings, making uniqueness impossible.
The hash code of an object should not change unless some equality-sensitive aspect of it changes. Indeed, it's generally a good idea to make types implementing value equality immutable, at least in equality-sensitive aspects. Otherwise you can easily find that you can't look up an entry using the exact same object reference that you previously used for the key!
Hash codes are an optimization technique to make it quick to find a "probably small" set of candidate values equal to some target, which you then iterate through with a rigorous equality check to find whether any of them is actually equal to the target. That's what lets you quickly look something up by key in a hash-based collection. The key isn't the hash itself.
If you need to create a key from two strings, you're going to basically have to make it from those two strings (with some sort of delimiter so you can tell the difference between {"a", "bc"} and {"ab", "c"} - understanding that the delimiter itself might appear in the values if you're not careful).
See Eric Lippert's blog post on the topic for more information; that's based on .NET rather than Java, but they all apply. It's also worth understanding that the semantics of hashCode aren't necessarily the same as those of a cryptographic hash. In particular, it's fine for the result of hashCode() to change if you start a new JVM but create an object with the same fields - no-one should be persisting the results of hashCode. That's not the case with something like SHA-256, which should be permanently stable for a particular set of data.
The hash code for String is lossy; many String values will result in the same hash code. An integer has 32 bit positions and each position has two values. There's no way to map even just the 32-character strings (for instance) (each character having lots of possibilities) into 32 bits without collisions. They just won't fit.
If you want to use arbitrary precision arithmetic (say, BigInteger), then you can just take each character as an integer and concatenate them all together.
No, hashCode() (BTW pay attention on case of letter C) does not guarantee uniqueness. You can have a lot of objects that produce the same hash code.
If you need unique identifier use class java.util.UUID.

Query String with DatastoreService

Using the DatastoreService how can I do queries for String containing some string similar to Java String:
contains
startsWith
endsWith
When querying against a String property, exact matches are the easiest, since that behavior works "out of the box".
"startsWith" queries can be done fairly easily by turning property startsWith: abc into property >= 'abc' and property < 'abd', where you calculate the end of the range.
"endsWith" can be done by storing a reversed copy of the String, and creating a query as above, but with the target reversed. I.e., property endsWith: 'abc' becomes `propertyReversed >= 'cba' and propertyReversed < 'cbb'.
"contains" is a large challenge. There are several approaches, and the right one for your situation depends on your situation. If the string is relatively short (e.g., a name of an address), you could store list of trailing substrings, matching against them with a range query as above.
As Dave mentions in his answer, contains is not available as a Datastore primitive. If you're looking for containment queries, the Search API is a good place to look (note: it's still in experimental).

How to best represent Constants (Enums) in the Database (INT vs VARCHAR)?

what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!
In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.
I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.

Categories