How to best represent Constants (Enums) in the Database (INT vs VARCHAR)?

How to best represent Constants (Enums) in the Database (INT vs VARCHAR)? - java

what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!

In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.

I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.

Related

Database performance for filter by integer or boolean?

I will be having a database table with a few million entries, eg products of an online shop.
If one is out of stock, I want to mark it somehow, and I want to exclude it from any findAll() sql fetches.
Therefore I though one of the following options:
each product already has an integer count of availability. I anyhow have to set that = 0. select * from products where availcount > 0
or I could introduce a boolean available = 'true' field that I set to false if out of stock, and the query would then be ...where available = 'true'
Question: will this make any difference? Are there reasons one of these options should be preferred?

I would stick with the stock levels (int availcount). Bit fields are typically very difficult to index, unless there is a massive skew in the data such that there are of the order of 1% or less products out of stock (and since you will likely be searching for in-stock products only, any index on the flag will be unused).
Since it seems you already store the actual stock level in any event, not storing available in stock indicator will save you headaches on trying to keep the two columns in synch.
Finally, many RDBMS's allow you to add COMPUTED columns (or failing which, add the available indicator to a VIEW), which will allow you the logical derivation of available indicator from the actual availcount, without any storage overhead.
Edit
As per the comments below, note that an index on availcount (for queries WHERE availcount = 0 and availcount > 0) will be equally un-SARGable as an index on a bit field, although an index may not be needed if the products are generally searched by other criteria.
In addition to deriving is available in stock ? in the database, this determination can also be taken in code, e.g. an additional bool isAvailable() { return availcount > 0 ;} method on your entity class.

If you already have the availcount column anyway, there is no reason to add a new one, your availcount > 0 will do.
If you do not need the count for other reasons, and are just trying to decide between having a count or a boolean, consider how hard it is going to be to update that column rather than filtering.
If you only have a boolean, you'll only need to touch when the product goes out of stock (or comes back in). Having a count is more complex: you'd need to update it every time a sale is made or the item is restocked. This is more complicated, has possible performance implications, and a bunch or corner cases to care about. So, unless you need the count for other purposes, it's, probably, a better idea to stick with the boolean.

I think the two options would be equally efficient on SELECT as long as there's an index in the column in question.
Indexing availcount will have a small penalty on any update of this column (and I guess this column will be updated often). On the other hand, having an available column will add redundancy to your database (i.e. it will not be normalized) which you may want to avoid.

Size of Strings and Calendar Objects Java

I am doing some basic (edit: reading and writing to a txt file), which requires me to store a bunch of expenses, and their attributes (i.e. name, price, date of purchase, etc.) I would like to be able to compare dates of purchases if possible. It occured to me that I had a few options when it came to what type of object the date of purchase should be:
I could make the date a Calendar object, and store it on the .txt this would mean storing lots of Calendar objects at once, and then easily compare the dates
I could make the date a String, store it, transmute it to a Calendar object, and then compare them
I could leave the dates as strings and when I am ready to compare them, create some kind of code to go through individual characters and pick out a certain phrase or set of characters.
Which of these would probably be best for keeping the load on the computer down? Also, how would you go about loading objects as they build up over time? Once a person has a lot of spending, it would get pretty hefty to load every single item.

I would strongly suggest using Joda Time wherever possible, rather than Calendar and Date - it's a much cleaner date/time API.
Beyond that, definitely make your object model match your domain as closely as possible. You're dealing with dates, not strings - so make your object model reflect that. You should be converting between strings and dates as rarely as possible. It not clear what you mean by "store it on the .txt" (given that elsewhere you're talking about a database) but using JDBC you'd use parameters anyway, without string conversions.
As for load - work out your performance requirements beforehand, try the simplest approach that works, and test whether that meets your requirements. Usually when people talk about having to have an efficient solution they haven't actually considered what they need. You talk about it getting "pretty hefty" to load every single item - how many items? Can you load them in a batch? Where will the database be? You'd be amazed how much data can be processed these days - but you need to understand the parameters of your problem before you make too many decisions that are hard to change later.

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?

If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.

From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

How can I insert common data into a temp table from disparate schemas?

I am not sure how to solve this problem:
We import order information from a variety of online vendors ( Amazon, Newegg etc ). Each vendor has their own specific terminology and structure for their orders that we have mirrored into a database. Our data imports into the database with no issues, however the problem I am faced with is to write a method that will extract required fields from the database, regardless of the schema.
For instance assume we have the following structures:
Newegg structure:
"OrderNumber" integer NOT NULL, -- The Order Number
"InvoiceNumber" integer, -- The invoice number
"OrderDate" timestamp without time zone, -- Create date.
Amazon structure:
"amazonOrderId" character varying(25) NOT NULL, -- Amazon's unique, displayable identifier for an order.
"merchant-order-id" integer DEFAULT 0, -- A unique identifier optionally supplied for the order by the Merchant.
"purchase-date" timestamp with time zone, -- The date the order was placed.
How can I select these items and place them into a temporary table for me to query against?
The temporary table could look like:
"OrderNumber" character varying(25) NOT NULL,
"TransactionId" integer,
"PurchaseDate" timestamp with time zone
I understand that some of the databases represent an order number with an integer and others a character varying; to handle that I plan on casting the datatypes to String values.
Does anyone have a suggestion for me to read about that will help me figure this out?
I don't need an exact answer, just a nudge in the right direction.
The data will be consumed by Java, so if any particular Java classes will help, feel free to suggest them.

First, you can create a VIEW to provide this functionality:
CREATE VIEW orders AS
SELECT '1'::int AS source -- or any other tag to identify source
,"OrderNumber"::text AS order_nr
,"InvoiceNumber" AS tansaction_id -- no cast .. is int already
,"OrderDate" AT TIME ZONE 'UTC' AS purchase_date -- !! see explanation
FROM tbl_newegg
UNION ALL -- not UNION!
SELECT 2
"amazonOrderId"
,"merchant-order-id"
,"purchase-date"
FROM tbl_amazon;
You can query this view like any other table:
SELECT * FROM orders WHERE order_nr = 123 AND source = 2;
The source is necessary if the order_nr is not unique. How else would you guarantee unique order-numbers over different sources?
A timestamp without time zone is an ambiguous in a global context. It's only good in connection with its time zone. If you mix timestamp and timestamptz, you need to place the timestamp at a certain time zone with the AT TIME ZONE construct to make this work. For more explanation read this related answer.
I use UTC as time zone, you might want to provide a different one. A simple cast "OrderDate"::timestamptz would assume your current time zone. AT TIME ZONE applied to a timestamp results in timestamptz. That's why I did not add another cast.
While you can, I advise not to use camel-case identifiers in PostgreSQL ever. Avoids many kinds of possible confusion. Note the lower case identifiers (without the now unnecessary double-quotes) I supplied.
Don't use varchar(25) as type for the order_nr. Just use text without arbitrary length modifier if it has to be a string. If all order numbers consist of digits exclusively, integer or bigint would be faster.
Performance
One way to make this fast would be to materialize the view. I.e., write the result into a (temporary) table:
CREATE TEMP TABLE tmp_orders AS
SELECT * FROM orders;
ANALYZE tmp_orders; -- temp tables are not auto-analyzed!
ALTER TABLE tmp_orders
ADD constraint orders_pk PRIMARY KEY (order_nr, source);
You need an index. In my example, the primary key constraint provides the index automatically.
If your tables are big, make sure you have enough temporary buffers to handle this in RAM before you create the temp table. Else it will actually slow you down.
SET temp_buffers = 1000MB;
Has to be the first call to temp objects in your session. Don't set it high globally, just for your session. A temp table is dropped automatically at the end of your session anyway.
To get an estimate how much RAM you need, create the table once and measure:
SELECT pg_size_pretty(pg_total_relation_size('tmp_orders'));
More on object sizes under this related question on dba.SE.
All the overhead only pays if you have to process a number of queries within one session. For other use cases there are other solutions. If you know the source table at the time of the query, it would be much faster to direct your query to the source table instead. If you don't, I would question the uniqueness of your order_nr once more. If it is, in fact, guaranteed to be unique you can drop the column source I introduced.
For only one or a few queries, it might be faster to use the view instead of the materialized view.
I would also consider a plpgsql function that queries one table after the other until the record is found. Might be cheaper for a couple of queries, considering the overhead. Indexes for every table needed of course.
Also, if you stick to text or varchar for your order_nr, consider COLLATE "C" for it.

Sounds like you need to create an abstract class that will define the basics of interacting with the data, then derive a class per database schema you need to access. This will allow the core code to operate on a single object type, and each implementation can then specify the queries in a form specific to that database schema.
Something like:
public class Order
{
private String orderNumber;
private BigDecimal orderTotal;
... etc ...
}
public abstract class AbstractOrderInformation
{
public abstract ArrayList<Order> getOrders();
...
}
with a Newegg class:
public class NeweggOrderInformation extends AbstractOrderInformation
{
public ArrayList<Order> getOrders() {
... do the work of getting the newegg order
}
...
}
Then you can have an arbitrarily large number of formats and when you need information, you can just iterate over all the implementations and get the Orders from each.

Storing UUID in HSQLDB database

I wish to store UUIDs created using java.util.UUID in a HSQLDB database.
The obvious option is to simply store them as strings (in the code they will probably just be treated as such), i.e. varchar(36).
What other options should I consider for this, considering issues such as database size and query speed (neither of which are a huge concern due to the volume of data involved, but I would like to consider them at least)

HSQLDB has a built-in UUID type. Use that
CREATE TABLE t (
id UUID PRIMARY KEY
);

You have a few options:
Store it as a VARCHAR(36), as you already have suggested. This will take 36 bytes (288 bits) of storage per UUID, not counting overhead.
Store each UUID in two BIGINT columns, one for the least-significant bits and one for the most-significant bits; use UUID#getLeastSignificantBits() and UUID#getMostSignificantBits() to grab each part and store it appropriately. This will take 128 bits of storage per UUID, not counting any overhead.
Store each UUID as an OBJECT; this stores it as the binary serialized version of the UUID class. I have no idea how much space this takes up; I'd have to run a test to see what the default serialized form of a Java UUID is.
The upsides and downsides of each approach is based on how you're passing the UUIDs around your app -- if you're passing them around as their string-equivalents, then the downside of requiring double the storage capacity for the VARCHAR(36) approach is probably outweighed by not having to convert them each time you do a DB query or update. If you're passing them around as native UUIDs, then the BIGINT method probably is pretty low-overhead.
Oh, and it's nice that you're looking to consider speed and storage space issues, but as many better than me have said, it's also good that you recognize that these might not be critically important given the amount of data your app will be storing and maintaining. As always, micro-optimization for the sake of performance is only important if not doing so leads to unacceptable cost or performance. Otherwise, these two issues -- the storage space of the UUIDs, and the time it takes to maintain and query them in the DB -- are reasonably low-importance given the cheap cost of storage and the ability of DB indices to make your life much easier. :)

I would recommend char(36) instead of varchar(36). Not sure about hsqldb, but in many DBMS char is a little faster.
For lookups, if the DBMS is smart, then you can use an integer value to "get closer" to your UUID.
For example, add an int column to your table as well as the char(36). When you insert into your table, insert the uuid.hashCode() into the int column. Then your searches can be like this
WHERE intCol = ? and uuid = ?
As I said, if hsqldb is smart like mysql or sql server, it will narrow the search by the intCol and then only compare at most a few values by the uuid. We use this trick to search through million+ record tables by string, and it is essentially as fast as an integer lookup.

Using BINARY(16) is another possibility. Less storage space than character types. Use CREATE TYPE UUID .. or CREATE DOMAIN UUID .. as suggested above.

I think the easiest thing to do would be to create your own domain thus creating your own UUID "type" (not really a type, but almost).
You also should consider the answer to this question (especially if you plan to use it instead of a "normal" primary key)
INT, BIGINT or UUID/GUID in HSQLDB? (deleted by community ...)
HSQLDB: Domain Creation and Manipulation

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.