UUID Generated randomly is having duplicates

UUID Generated randomly is having duplicates - java

I'm using the below function to generate UUID
UUID.randomUUID().toString()
In production we have 50+ servers (application server - each is a JVM on its own) and for requests that land in these servers, as a first step we generate a UUID which essentially uniquely identifies a transaction.
What we are observing is that in Server 6 and Server 11, the UUIDs generated are matching at least for 10 to 15 messages per day which is strange because given the load i.e. about 1 million transactions a day, these UUIDs being duplicate within the same day is very odd.
This is what we have done so far
Verified the application logs - we didn't find anything fishy in there, all logs are as normal
Tried replicating this issue in the test environment with similar load in production and with 50+ servers - but this didn't happen in the test environment
Checked the application logic - this doesn't seem to be an issue because all other 48 servers except 6 and 11 which have a copy of the same code base is working perfectly fine and they are generating unique UUIDs per transaction.
So far we haven't been able to trace the issue, my question is basically if there is something at JVM level we are missing or UUID parameter that we need to set for this one off kind of an issue?

Given time, I'm sure you'll find the culprit. In the meantime, there was a comment that I think deserves to be promoted to answer:
You are generating pseudo random UUIDs at multiple locations. If you don't find other bugs, consider either generating all the pseudo random UUIDs at one location, or generate real random UUIDs
So create a UUID server. It is just a process that churns out blocks of UUIDs. Each block consists maybe 10,000 (or whatever is appropriate) UUIDs. The process writes each block to disk after the process verifies the block contains no duplicates.
Create another process to distribute the blocks of UUIDs. Maybe it is just an a web service that returns an unused block when it gets a request. The transaction server makes a request for a block and then consumes those UUIDs as it creates transactions. When the server has used most of its assigned UUIDs, it requests another block.

I wouldn't waste time wondering how UUID.randomUUID() is generating a few duplicate UUIDs per day. The odds of that happening by chance are infinitesimal. (Generating a whole series of duplicates is possible—if the underlying RNG state is duplicated, but that doesn't seem to be the case.)
Instead, look for places where a UUID stored by one server could be clobbering one stored by another. Why does this only happen between 2 servers out of 50? That has something to do with the details of your environment and system that haven't been shared.

As stated above, the chances of a legit collision are impossibly small. A more likely possibly is if the values are ever transferred between objects in an improper way.
For languages like Java that behave as pass by reference, consider the following scenario
saveObject1.setUUID(initObj.getUUID())
initObj.setUUID(UUID.randomUUID());
saveObject2.setUUID(initObj.getUUID())
In this case saveObject1 & saveObject2 will have the same value, because they are both pointed to the same object reference (initObj's UUID reference).
An issue like this seems more likely than the actual UUIDs being a collision, esp if you can reproduce it. Naturally if it doesn't happen all the time it's probably something more complex, like a rare race condition where initObj doesn't get reinitialized in time, causing saveObject1 & 2 to share the same object reference.

Related

Choosing databasetype for a decentralized calendar project

I am developing a calendar system which is decentralised. It should save the data on each device and synchronise if they have both internet connection. My first idea was, just using a relational database and try to synchronise data after connection. But the theory says something else. The Brewers CAP-Theorem describes the theory behind it, but i am not sure if this theorem maybe is outdated. If i use this theorem i have "AP [Availability/Partition Tolerance] Systems". "A" because i need at any given time the data for my calendar and "P" because it can happen, that there is no connection between the devices and the data can't be synchronised. The example databases are CouchDB, RIAK or Cassandra. I have worked only with relational databases and doesn't know how to go on now. Is it that bad to use a relational Database for my project?
This is for my bachelor thesis. I just wanted to start using Postgres but then i found this theorem...
The whole project is based on Java.

I think the CAP theorem isn't really helpful to your scenario. Distributed systems that deal with partitions need to decide what to when one part wants to make a modification to the data, but can't reach the other part. One solution is to make the write wait - and this is giving up the "availability" because of the "partition", one of the options presented by the CAP theorem. But there are more useful options. The most useful (highly-available) option is to allow both parts to be written independently, and reconcile the conflicts when they can connect again. The question is how to do that, and different distributed systems choose different approaches.
Some systems, like Cassandra or Amazon's DynamoDB, use "last writer wins" - when we see a conflict between two conflicting writes, the last one (according some synchronized clock) wins. For this approach to make sense you need to be very careful about how you model your data (e.g., watch out for cases where the conflict resolution results in an invalid mixture of two states).
In other systems (and also in Cassandra and DynamoDB - in their "collection" types) writes can still happen independently on different nodes, but there is more sophisticat conflict resolution. A good example is Cassandra's "list": One can send an update saying "add item X to the list", and another update saying "add item Y to the list". If these updates happen on different partitions, the conflict is later resolved by adding both X and Y to the list. The data structures such as this list - which allows the content to be modified independently in certain ways on two nodes and then automatically reconciled in a sensible way, is known as a Conflict-free Replicated Data Type (CRDT).
Finally, another approach was used in Amazon's Dynamo paper (not to be confused by their current DynamoDB service!), known as "vector clocks": When you want to write to an object - e.g., a shopping cart - you first read the current state of the object and get with it a "vector clock", which you can think of as the "version" of the data you got. You then make the modification (e.g., add an item to the shopping cart), and write back the new version while saying what was the old version you started with. If two of these modifications happen on parallel on different partitions, we later need to reconcile the two updates. The vector clocks allow the system to determine if one modification is "newer" than the other (in which case there is no conflict), or they really do conflict. And when they do, application-specific logic is used to reconcile the conflict. In the shopping cart example, if we see the conflict is that in one partition item A was added to the shopping cart and in the other partition, item B was added to the shopping cart, the straightforward resolution is to just add both times A and B to the shopping cart.
You should probably pick one of these approaches. Just saying "the CAP theorem doesn't let me do this" is usually not an option ;-) In fact, in some ways, the problem you're facing is different than some of the systems I mentioned. In those systems, the common case is every node is always connected (no partition), with very low latency, and they want this common case to be fast. In your case, you can probably assume the opposite: the two parts are usually not connected, or if they are connected there is high latency, so conflict resolution because the norm, rather than the exception. So you need to decide how to do this conflict resolution - what happens if one adds a meeting on one device and a different meeting on the other device (most likely, just keep both as two meetings...), how do you know that one device modified a pre-existing meeting and didn't add a second meeting (vector clocks? unique meeting ids? etc.) so the conflict resolution ends up fixing the existing meeting instead of adding a second one? And so on. Once you do that, where you store the data on both partitions (probably completely different database implementations in the client and server) and which protocol you send the updates on become implementation details.
There's another issue you'll need to consider. When do we do these reconciliations? In many systems like I listed above, the reconciliation happens on read: If the client wants to read data and we suddenly see two conflicting versions on two reachable nodes, we reconcile. In your calendar application, you need a slightly different approach: It is possible that the client will only ever try to read (use) the calendar when not connected. You need to use the rare opportunities when he is connected to reconcile all the differences. Moreover, you may need to "push" changes - e.g., if the data on the server changed, the client may need to be told, "hey, I have some changed data, come and reconcile", so the end-user will immediately see an announcement on a new meeting, for example, that was added remotely (e.g., perhaps by a different user sharing the same calendar). You'll need to figure out how you want to do this. Again, there is no magic solution like "use Cassandra".

Android: Random Generation in a Networked Situation (Multiplayer)

I wrote a class that, given a seed and difficulty, will return a playing field to my game. The generation is consistent (no matter what, the same seed & difficulty level will always result in the same play field). As far as I know all android devices use Java 1.6 so here goes my question(s):
Is it safe to send only the seed and difficulty to other devices in a multiplayer environment?
Do I need to worry about when Google updates Java version level form 1.6? or will they likely update all android devices to that version level (I am assuming the Random class will have been changed)? And if not what would be a good way to detect if Random class is different?
Rephrased, what precautionary measures should be in place to ensure that the class java.util.Random, which my field generation class uses heavily, will result in the same play field for every device? Or, alternatively, would it be more wise to consider sending all play field data to the non-hosting device(s)?
I could probably accomplish the latter with a reliable message with size of:
byte[ROWS * COLUMNS]
In advance, I appreciate any guidance/suggestions in this matter. This is a difficult issue to search for so some links for future views may be appropriate.

There are a few options here, but I guess I was hoping for some magic JVM property defining the java.util.Random class revision version.
First option is to check the java version and compare it against the other device's version. If they are the same it is safe (as far as I know) to assume that the Random class is the same and thus the seed and difficulty can be sent. If, however, they are different you either send all the data or check the documentation/version release notes yourself to see when the Random class was changed and then determine if all the data should be sent based on previously acquired java version identifier.
The second option is to simply always send all the data. Which is what I will personally be doing.
If you're not as lucky as I and your data exceeds the value of Multiplayer.MAX_RELIABLE_MESSAGE_LEN (in bytes) you may have to break the data into multiple messages which could get ugly but is entirely doable.

why is my appengine IdGeneratorStrategy generating huge numbers?

I just moved my code from one machine to another, released it and suddenly it created an entry with a key of "576728208506880" so I re-released the exact same code from the original machine and created another field and this time the key created was "21134006"
Can anyone shed any light on why this might be?!
Thanks,
J

It's perfectly normal. App Engine generates numeric IDs between 0 and 2^53, and scatters them out throughout the entire range:
http://googlecloudplatform.blogspot.ca/2013/05/update-on-datastore-auto-ids.html
You can hack around it a bit by using the legacy auto id policy in your settings.

Appengine datastore IDs are not generated sequentially.
(Imagine that you had a burst of 1,000 new entities created in the same second - the short answer is that AppEngine needs a strategy to generate IDs that won't collide).
See this answer for more details and a potential solution.
See "Assigning Identifiers" of the AppEngine docs for more information.

Duplication detection for 3K incoming requests per second, recommended data structure/algorithm?

Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).
These requests will then be stored into mysql.
They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.
I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?
The posted data will look like:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).
I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.
Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.
Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.
Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.
Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.

Interesting question.
I would probably be looking at some kind of HashMap of HashMaps structure here where the first level of HashMaps would use the sources as keys and the second level would contain the actual data (the minimal for detecting duplicates) and use your hashcode function for hashing. For actual implementation, Java's ConcurrentHashMap would probably be the choice.
This way you have also set up the structure to partition your incoming load depending on sources if you need to distribute the load over several machines.
With regards to purging I think you have to measure the exact behavior with production like data. You need to learn how quickly the data grows when you successfully eliminate duplicates and how it becomes distributed in the HashMaps. With a good distribution and a not too quick growth I can imagine it is good enough to do a cleanup occasionally. Otherwise maybe a LRU policy would be good.

Sounds like you need a hashing structure that can add and check the existence of a key in constant time. In that case, try to implement a Bloom filter. Be careful that this is a probabilistic structure i.e. it may tell you that a key exists when it does not, but you can make the probability of failure extremely low if you tweak the parameters carefully.
Edit: Ok, so bloom filters are not acceptable. To still maintain constant lookup (albeit not a constant insertion), try to look into Cuckoo hashing.

1) Setup your database like this
ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);
INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (#prop1, #prop2, #prop3, #body)
ON DUPLICATE KEY UPDATE Body=#body
2) You don't need any algorithms or fancy hashing ADTs
shell> mysqlimport [options] db_name textfile1 [textfile2 ...]
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.
3) All your Java will do is...
a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv
b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command
c) delete the old CSV files if space is an issue, or archive them to network storage/backup device

Well you're basically looking for some kind of extremely large Hashmap and something like
if (map.put(key, val) != null) // send data
There are lots of different Hashmap implementations available, but you could look at NBHM. Non-blocking puts and designed with large, scalable problems in mind could work just fine. The Map also has iterators that do NOT throw a ConcurrentModificationException while using them to traverse the map which is basically a requirement for removing old data as I see it. Also putIfAbsent is all you actually need - but no idea if that's more efficient than just a simple put, you'd have to ask Cliff or check the source.
The trick then is to try to avoid resizing of the Map by making it large enough - otherwise the throughput will suffer while resizing (which could be a problem). And think about how to implement the removing of old data - using some idle thread that traverses an iterator and removes old data probably.

Use a java.util.ConcurrentHashMap for building a map of your hashes, but make sure you have the correct initialCapacity and concurrencyLevel assigned to the map at creation time.
The api docs for ConcurrentHashMap have all the relevant information:
initialCapacity - the initial capacity. The implementation performs
internal sizing to accommodate this many elements.
concurrencyLevel - the estimated number of concurrently updating threads. The
implementation performs internal sizing to try to accommodate this
many threads.
You should be able to use putIfAbsent for handling 3K requests as long as you have initialized the ConcurrentHashMap the right way - make sure this is tuned as part of your load testing.
At some point, though, trying to handle all the requests in one server may prove to be too much, and you will have to load-balance across servers. At that point you may consider using memcached for storing the index of hashes, instead of the CHP.
The interesting problems that you will still have to solve, though, are:
loading all of the hashes into memory at startup
determining when to knock off hashes from the in-memory map

If you use a strong hash formula, such as MD5 or SHA-1, you will not need to store any data at all. The probability of duplicate is virtually null, so if you find the same hash result twice, the second is a duplicate.
Given that MD5 is 16 bytes, and SHA-1 20 bytes, it should decrease memory requirements, therefore keeping more elements in the CPU cache, therefore dramatically improving speed.
Storing these keys requires little else than a small hash table followed by trees to handle collisions.

How to reduce the number of file writes when there are multiple threads?

here's the situation.
In a Java Web App i was assigned to mantain, i've been asked to improve the general response time for the stress tests during QA. This web app doesn't use a database, since it was supposed to be light and simple. (And i can't change that decision)
To persist configuration, i've found that everytime you make a change to it, a general object containing lists of config objects is serialized to a file.
Using Jmeter i've found that in the given test case, there are 2 requests taking up the most of the time. Both these requests add or change some configuration objects. Since the access to the file must be sinchronized, when many users are changing config, the file must be fully written several times in a few seconds, and requests are waiting for the file writing to happen.
I have thought that all these serializations are not necessary at all, since we are rewriting the most of the objects again and again, the changes in every request are to one single object, but the file is written as a whole every time.
So, is there a way to reduce the number of real file writes but still guarantee that all changes are eventually serialized?
Any suggestions appreciated

One option is to do changes in memory and keep one thread on the background, running at given intervals and flushing the changes to the disk. Keep in mind, that in the case of crash you'll lost data that wasn't flushed.
The background thread could be scheduled with a ScheduledExecutorService.
IMO, it would be better idea to use a DB. Can't you use an embedded DB like Java DB, H2 or HSQLDB? These databases support concurrent access and can also guarantee the consistency of data in case of crash.

If you absolutely cannot use a database, the obvious solution is to break your single file into multiple files, one file for each of config objects. It would speedup serialization and output process as well as reduce lock contention (requests that change different config objects may write their files simultaneously, though it may become IO-bound).

One way is to to do what Lucene does and not actually overwrite the old file at all, but to write a new file that only contains the "updates". This relies on your updates being associative but that is usually the case anyway.
The idea is that if your old file contains "8" and you have 3 updates you write "3" to the new file, and the new state is "11", next you write "-2" and you now have "9". Periodically you can aggregate the old and the updates. Any physical file you write is never updated, but may be deleted once it is no longer used.
To make this idea a bit more relevant consider if the numbers above are records of some kind. "3" could translate to "Add three new records" and "-2" to "Delete these two records".
Lucene is an example of a project that uses this style of additive update strategy very successfully.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.