I am building a distributed application on top of Java and Cassandra. To generate unique sequential 32bit & 64 bit IDs, is an approach like using Flickr's ticket servers to generate primary IDs, a good one? I am particularly excited about this as it can help me reduce the size of the IDs to 32 bits or 64 bits as required, which otherwise may go up to 128 bits with UUIDs. I do not want these IDs to be perfectly sequential, but increasing at least!
Using a single database server may however introduce a single point of failure that was eliminated by Cassandra. However this may be OK for the initial stage of our application. Later we may introduce two servers to alleviate those problems.
Does this sound like a good strategy? In short, we are mixing MYSQL and Cassandra in one application. I know, if mySQL is down for some reason then we can't go ahead with Cassandra alone.
We have looked to other solutions like snowflake however it did not perfectly matched our requirements.
EDIT : I am seeking advice on whether using MySQL for generating unique primary IDs to key the data/ entities stored in Cassandra database is a good approach. What are the downsides, if any, of an approach like Flickr's Ticket servers?
I'm not a big fan of trying to attach meaning to surrogate keys (which you're trying to do if you want them to increase over time). As you're seeing, it makes your problem of generating keys more complicated. Assuming that you want the keys to increase over time simply so that you can sort data, why not include a timestamp of when the object was created and store that in your data store? This simplifies the key generation significantly and allows you to do pretty much everything you could do with keys that increase over time, with the added bonus of the fact it will be crystal clear to whoever has to maintain your code how objects should be sorted.
In general, you can't have both "always increasing" and "no SPOF and no complex synchronization".
If you want to have several ID-generators which do not have to ask each other on every new ID, each of them really need a separate ID-pool.
A really simple example is mentioned in the article linked by you, where one server creates odd ones while the other one create even ones. (You can expand this to more servers trivially). Of course, then you can't be sure that one server doesn't run ahead of the other, which can lead to a non-increasing sequence like 111, 120, 113, 122, 115, 124 ...
If you only want "roughly increasing", you can implement a scheme where each server in some intervals (like each minute or each 10000 IDs) tells the other one(s) his current ID, and the other one then jumps its own ID (only forward) if he hangs back too far. This should be done in a way which does not interrupt the ID-generation, for robustness if the other server is down.
Ah, for the "free bits at the end", simply multiply your ID by some number (the same one each time, and a power of two if you really want "free bits" and not only "space for data"), then add your data (which should be less than number). But of course then you'll run out of ID space quite a bit earlier (by factor number).
Related
Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.
100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.
As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.
During localhost development the ID's generated by GAE, starts with 1.
However in a real GAE deployment in the cloud, the ID generated even for the firsts entities are quite long like, 5639412304721232, is there a work around to make the first entities to start with 1, 2, 3.. and so on?
One might suggest to use Sharded Counters, and yes I've used this, however some suggests that sharded counters are not to be used as app might get the same count as it is eventually consistent.
In this case what could be the best solution?
The official post explaining the switch from sequential to 'scattered' ids is here.
The instructions for reverting to sequential behaviour are here, but note the warning that this option will eventually be removed.
The 'best' solution depends on what you need and why. You'll get better datastore performance with scattered ids, but honestly, you might not notice much difference if your app makes gets a small number of requests and makes light use of the datastore. If that's the case, you can use roll your own sequential ids based on a simple entity with a property that holds the the current high watermark id, and rely on having a low transaction rate to keep you from running into limits on the number of transactions per entity.
Reliably handing out sequential ids without gaps in a distributed systems is challenging.
Be aware that you may run into problems if you create a lot of entities very quickly, with sequential Long IDs. This post gives you an explanation why.
In theory there's a choice of auto ID generation policies, with scattered IDs being the default since 1.8.1, but the old monotonically increasing legacy policy is to be deprecated for the reasons discussed in the linked post.
If you're using a sharded counter, you will avoid this but, as you say, you may encounter other issues.
You might try using allocate_ds. We use this to get smaller integer values for system generated ids. In Python using a db kind:
model_key = db.Key.from_path('your_kind_name', 1)
key_batch = db.allocate_ids(model_key, 1)
id_new = key_batch[0]
idkey = db.Key.from_path('your_kind_name', id_new)
I would assign the key's identifier as the strings "1", "2", "3"... and so on, generating them from a sequencer. You can check to see if the entity already exists with a get_or_insert() function.
Similarly, you can use the auto-increment solution by storing the sequence number in an entity.
I have a SELECT query with lot of IF conditions, which I can do either in the query itself (takes DB machine's CPU) or I can put it in my java code (takes server machine's CPU).
Is there any preferred approach here (to put conditions in DB Vs in mid-tier)?
UPDATE: My query is a join on more than 2 tables,
and I am using left join to combine and there are some rows which will have corresponding row in 2nd table and some are not.
I need to have some default value for those columns when I don't have corresponding row in 2nd table.
SElECT CASE WHEN t2.col1 is null
then 'default' else t2.col1
END
FROM table1 t1
LEFT JOIN table2 t2 ON t1.id = t2.id
If it's really something that the DB cannot do any faster than the app server, and which actually reduces the load on the DB server if moved to the app server, then I'd move it to the app server.
The reason: if you reach the limits of your hardware, it's much easier to have multiple app servers than to have a clustered database.
However, the second condition above should be tested thoroughly: many things will not reduce (or even increase) the DB load if moved away from the DB.
Update: For the kind of thing you need, I doubt whether the first condition is satisfied - have you tested it? A simple CASE is completely insignificant, unless the condition or the branches contain some very expensive calculations.
Yes, though I would suggest another approach, one that adds no load to the app server and minimal load to the DBMS. It's a little hard to answer the question since you haven't provided a concrete example but I'll give it a shot.
My preferred solution is to get rid of the if conditions totally if you can. At a bare minimum, you can re-jig your database schema to move the cost of calculation away from the select (which happens a lot) and into the insert/update (which happens less often).
That's the normal case, I have seen databases that write more frequently than read, but they're the exception rather than the rule.
By way of example, let's say you store person information and you want to get a list of people whose first name is more than 5 characters long. Don't ask why, I'm the customer, you have to give me what I want :-)
Rather than a monstrous select statement to (possibly) split apart the name and count the characters in it, do that as an insert/update trigger when the data enters the table - that's the only time when the value can change after all.
Put that calculation in another column (indexed) and use that in your select. The cost of the calculation is amortised over al the selects, which will be blindingly fast.
It will take up more storage space but, if you compare the number of database "how can I make this faster?" questions against the number of "how can I use less space?" questions, you'll find the former greatly outweigh the latter.
And, yes, it does mean you store redundant data but the triggers mitigate the possibility of losing ACID properties. It's okay to bend rules if you know the possible consequences and how best to avoid them.
Based on your update, you should put the workload on to the machine where it causes the least impact. That may be the DBMS, it may be the app server, it may even be on the client side (of the app server) itself since that would distribute the cost across a lot of machines rather than concentrating it at a single point.
You should measure, not guess! Set up realistic performance test systems along with realistic production-quality data, then try the different approaches. That's the only real way to be certain.
I came across a question recently that was for "Generating primary key in a clustered environment of 5 App-Servers - [OAS Version 10] without using database".
Usually we generate PK by a DB sequence, or storing the values in a database table and then using a SP to generate the new PK value...However current requirement is to generate primary key for my application without referencing the database using JDK 1.4.
Need expert's help to arrive on better ways to handle this.
Thanks,
Use a UUID as your primary key and generate it client-side.
Edit:
Since your comment I felt I should expand on why this is a good way to do things.
Although sequential primary keys are the most common in databases, using a randomly generated primary key is frequently the best choice for distributed databases or (particularly) databases that support a "disconnected" user interface, i.e. a UI where the user is not continuously connected to the database at all times.
UUIDs are the best form of randomly generated key since they are guaranteed to be very unique; the likelyhood of the same UUID being generated twice is so extremely low as to be almost completely impossible. UUIDs are also ubiquitous; nearly every platform has support for the generation of them built in, and for those that don't there's almost always a third-party library to take up the slack.
The biggest benefit to using a randomly generated primary key is that you can build many complex data relationships (with primary and foreign keys) on the client side and (when you're ready to save, for example) simply dump everything to the database in a single bulk insert without having to rely on post-insert steps to obtain the key for later relationship inserts.
On the con side, UUIDs are 16 bytes rather than a standard 4-byte int -- 4 times the space. Is that really an issue these days? I'd say not, but I know some who would argue otherwise. The only real performance concern when it comes to UUIDs is indexing, specifically clustered indexing. I'm going to wander into the SQL Server world, since I don't develop against Oracle all that often and that's my current comfort zone, and talk about the fact that SQL Server will by default create a clustered index across all fields on the primary key of a table. This works fairly well in the auto-increment int world, and provides for some good performance for key-based lookups. Any DBA worth his salt, however, will cluster differently, but folks who don't pay attention to that clustering and who also use UUIDs (GUIDs in the Microsoft world) tend to get some nasty slowdowns on insert-heavy databases, because the clustered index has to be recomputed every insert and if it's clustered against a UUID, which could put the new key in the middle of the clustered sequence, a lot of data could potentially need to be rearranged to maintain the clustered index. This may or may not be an issue in the Oracle world -- I just don't know if Oracle PKs are clustered by default like they are in SQL Server.
If that run-on sentence was too hard to follow, just remember this: if you use a UUID as your primary key, do not cluster on that key!
You may find it helpful to look up UUID generation.
In the simple case, one program running one thread on each machine, you can do something such as
MAC address + time in nanseconds since 1970.
If you cannot use database at all, GUID/UUID is the only reliable way to go. However, if you can use database occasionally, try HiLo algorithm.
You should consider using ids in the form of UUID. Java5 has a class for representing them (and must also have a factory to generate them). With this factory class, you can backport the code to your anticated Java 1.4 in order to have the identifiers you require.
Take a look at these strategies used by Hibernate (section 5.1.5 in the link). You will surely find it useful.
It explains several methods, its pros and cons, also stating if they are safe in a clustered environment.
Best of all, there is available code that already implements it for you :)
If it fits your application, you can use a larger string key coupled with a UUID() function or SHA1(of random data).
For sequential int's, I'll leave that to another poster.
You can generate a key based on the combination of below three things
The IP address or MAC address of machine
Current time
An incremental counter on each instance (to ensure same key does not get generated twice on one machine as time may appear same in two immediate key creations because of underlying time precision)
by using Statement Object you can called statement.getGeneratedKeys(); method to retrieve the auto-generated key(s) generated by the execution of this Statement object.
Java doc
Here is how it's done in MongoDB: http://www.mongodb.org/display/DOCS/Object+IDs
They include a timestamp.
But you can also install Oracle Express and select sequences, you can select in bulk:
SQL> select mysequence.nextval from dual connect by level < 20;
NEXTVAL
1
2
3
4
5
..
20
Why are you not allowed to use the database? Money (Oracle express is free) or single point of failure? Or do you want to support other databases than Oracle in the future?
Its shipped OOB in many Spring-based applications like Hybris-
The typeCode is the name of your table like, User, Address, etc.
private PK generatePkForCode(final String typeCode)
{
final TypeInfoMap persistenceInfo = Registry.getCurrentTenant().getPersistenceManager().getPersistenceInfo(typeCode);
return PK.createCounterPK(persistenceInfo.getItemTypeCode());
}
I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.
Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.
How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)
I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.
You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.
This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8