Hashing algorithm to use for generating unique ids? - java

I have objects with the following properties:
class MyObject
{
int sourceId();
String id();
}
If I use id as the identifier, there could be collisions as there may be records with the same id but different sourceId
Therefore I'm looking into generating a hash of sourceId and id and using that to generate unique ids for each record. I was thinking of just md5ing String.valueOf(sourceId + id), but it seems that md5 collisions are not as uncommon as I'd like.
Which other algorithm would be recommended for this, something which produces a fast hash, and where it'd also be very improbable for a collision to occur?

If the id() String has a fixed length, you can simply concatenate the sourceId and the Id :
public String getUniqueID ()
{
return sourceID() + id();
}
If id() doesn't have a fixed length, you can pad it with zeroes (for example) to obtain a fixed length and then concatenate it to sourceID() as before.

Assuming this value can be a String, I'd just concatenate both values with a hyphen:
class MyObject
{
int sourceId;
String id;
String getUniqueKey() {
return sourceId+"-"+id;
}
}
Then you can obtain the original values using value.split("-");

Related

Hazelcast not working correctly with SqlPredicate and Index on optional field

We are storing complex objects in Hazelcast maps and need the possibility to search for objects not only based on the key but also on the content of these complex objects. In order to not take too large a performance hit, we are using indices on those search terms.
We are also using spring-data-hazelcast which provides repositories that allow us to use findByAbcXyz() type semantic queries. For some of the more complex queries we are using the #Query annotation (which spring-data-hazelcast internally translates to SqlPredicates).
We have now encountered an issue where under certain situations these #Query based search methods did not return any values, even if we could verify that the searched objects did in fact exist in the map.
I have managed to reproduce this issue with core hazelcast (i.e. without the use of spring-data-hazelcast).
Here is our object structure:
BetriebspunktKey.java
public class BetriebspunktKey implements Serializable {
private Integer uicLand;
private Integer nummer;
public BetriebspunktKey(final Integer uicLand, final Integer nummer) {
this.uicLand = uicLand;
this.nummer = nummer;
}
public Integer getUicLand() {
return uicLand;
}
public Integer getNummer() {
return nummer;
}
}
Betriebspunkt.java
public class Betriebspunkt implements Serializable {
private BetriebspunktKey key;
private List<BetriebspunktVersion> versionen;
public Betriebspunkt(final BetriebspunktKey key, final List<BetriebspunktVersion> versionen) {
this.key = key;
this.versionen = versionen;
}
public BetriebspunktKey getKey() {
return key;
}
}
BetriebspunktVersion.java
public class BetriebspunktVersion implements Serializable {
private List<BetriebspunktKey> zusatzbetriebspunkte;
public BetriebspunktVersion(final List<BetriebspunktKey> zusatzbetriebspunkte) {
this.zusatzbetriebspunkte = zusatzbetriebspunkte;
}
}
In my main file, I am now setting up hazelcast:
Config config = new Config();
final MapConfig mapConfig = config.getMapConfig("points");
mapConfig.addMapIndexConfig(new MapIndexConfig("versionen[any].zusatzbetriebspunkte[any].nummer", false));
HazelcastInstance instance = Hazelcast.newHazelcastInstance(config);
IMap<BetriebspunktKey, Betriebspunkt> map = instance.getMap("points");
I am also preparing my search criteria for later on:
Predicate equalPredicate = Predicates.equal("versionen[any].zusatzbetriebspunkte[any].nummer", 53090);
Predicate sqlPredicate = new SqlPredicate("versionen[any].zusatzbetriebspunkte[any].nummer=53090");
Next, I am creating two objects, one with the "full depth" of information, the other does not contain any "zusatzbetriebspunkte":
final Betriebspunkt abc = new Betriebspunkt(
new BetriebspunktKey(80, 166),
Collections.singletonList(new BetriebspunktVersion(
Collections.singletonList(new BetriebspunktKey(80, 53090))
))
);
final Betriebspunkt def = new Betriebspunkt(
new BetriebspunktKey(83, 141),
Collections.singletonList(new BetriebspunktVersion(
Collections.emptyList()
))
);
Here is, where things become interesting. If I first insert the "full" object into the map, the search using both the EqualPredicate as well as the SqlPredicate works:
map.put(abc.getKey(), abc);
map.put(def.getKey(), def);
Collection<Betriebspunkt> equalResults = map.values(equalPredicate);
Collection<Betriebspunkt> sqlResults = map.values(sqlPredicate);
assertEquals(1, equalResults.size()); // contains "abc"
assertEquals(1, sqlResults.size()); // contains "abc"
However, if I insert the objects into my map in reverse order (i.e. first the "partial" object and then the "full" one), only the EqualPredicate works correctly, the SqlPredicate returns an empty list, no matter what the content of the map or the search criteria.
map.put(abc.getKey(), abc);
map.put(def.getKey(), def);
Collection<Betriebspunkt> equalResults = map.values(equalPredicate);
Collection<Betriebspunkt> sqlResults = map.values(sqlPredicate);
assertEquals(1, equalResults.size()); // contains "abc"
assertEquals(1, sqlResults.size()); // --> this fails, it returns en empty list
What is the reason for this behaviour? It looks like a bug in the hazelcast code.
The reason for failing
After a lot of debugging, I have found the reason for this issue. The reasons can indeed be found in the hazelcast code.
When putting a value into a hazelcast map DefaultRecordStore.putInternal is called. At the end of this method DefaultRecordStore.saveIndex is called which finds the corresponding indexes and then calls Indexes.saveEntryIndex. This method iterates over each index and calls InternalIndex.saveEntryIndex (or rather its implementation IndexImpl.saveEntryIndex. The interesting part of that method are the following lines:
if (this.converter == null || this.converter == TypeConverters.NULL_CONVERTER) {
this.converter = entry.getConverter(this.attributeName);
}
Aparently each index stores a converter class when the first element is put into the map. Looking at QueryableEntry.getConverter explains what happens:
TypeConverter getConverter(String attributeName) {
Object attribute = this.getAttributeValue(attributeName);
if (attribute == null) {
return TypeConverters.NULL_CONVERTER;
} else {
AttributeType attributeType = this.extractAttributeType(attributeName, attribute);
return attributeType == null ? TypeConverters.IDENTITY_CONVERTER : attributeType.getConverter();
}
}
When first inserting the "full" object, extractAttributeType() will follow the "path" of our index definition "versionen[any].zusatzbetriebspunkte[any].nummer" and find out that nummer is an integer type, accordingly a TypeConverters.IntegerConverter will be returned and stored.
When first inserting the "partial" object, "zusatzbetriebspunkte[any]" is emtpy, and there is no way for extractAttributeType to find out what type nummer hast, it therefore returns null which means that TypeConverters.IdentityConverter is used.
Also, whenever a "full" element is inserted an entry is written into the index map using nummer as key, i.e. the index-map is of type Map.
So much for writing to the map. Let's now look at how data is read from the map. When calling map.values(predicate) we will eventually get to QueryRunner.runUsingGlobalIndexSafely which contains a line:
Collection<QueryableEntry> entries = indexes.query(predicate);
this will in turn after some boilerplate code call
Set<QueryableEntry> result = indexAwarePredicate.filter(queryContext);
For both of our predicates we will eventually get to IndexImpl.getRecords() which looks as follows:
public Set<QueryableEntry> getRecords(Comparable attributeValue) {
long timestamp = this.stats.makeTimestamp();
if (this.converter == null) {
this.stats.onIndexHit(timestamp, 0L);
return new SingleResultSet((Map)null);
} else {
Set<QueryableEntry> result = this.indexStore.getRecords(this.convert(attributeValue));
this.stats.onIndexHit(timestamp, (long)result.size());
return result;
}
}
The crucial call is this.convert(attributeValue) where attributeValue is the value of the predicate.
If we compare our two predicates, we can see that the EqualPredicate has two members:
attributeName = "versionen[any].zusatzbetriebspunkte[any].nummer"
value = {Integer} 53090
The SqlPredicate contains the initial string (which we passed to its constructor) but which at constructions was also parsed and mapped to a internal EqualPredicate (which when evaluating the predicate is eventually used and passed to getRecords() above):
sql = "versionen[any].zusatzbetriebspunkte[any].nummer=53090"
predicate = {EqualPredicate}
attributeName = "versionen[any].zusatzbetriebspunkte[any].nummer"
value = {String} "53090"
And this explains why the manually created EqualPredicate works in both cases: Its value is an integer. When passed to the converter, it does not matter whether it is the IntegerConverter or the IdentityConverter, as both will return the integer which can then be used as key in the index-map (which uses an integer as key).
With the SqlPredicate however, the value is a String. If this is passed to the IntegerConverter, it is converted to its corresponding integer value and accessing the index-map works. If it is passed to the IdentityConverter, the string is returned by the conversion and trying to access the index-map with a string will never find any results.
A possible solution
How can we solve this issue? I see several possibilities:
insert a "fully built" dummy value into our map during startup to ensure the converter is correctly initialised. While this works, it is ugly and not maintenance friendly
avoid using SqlPredicate and use the integer based EqualPredicate. This is not an option when working with spring-data-hazelcast as it always converts #Query based searches to SqlPredicates. We could of course use hazelcast directly and circumvent the spring-data wrapper but while that would work it means having two ways of accessing hazelcast which is also not very maintainable
use hazelcast's ValueExtractor class. This is the elegant solution that works both natively and using spring-data-hazelcast. I will outline what that looks like:
First we need to implement a value extractor which returns all zusatzbetriebspunkte of our Betriebspunkt in a form suitable for us
public class BetriebspunktExtractor extends ValueExtractor<Betriebspunkt, String> implements Serializable {
#Override
public void extract(final Betriebspunkt betriebspunkt, final String argument, final ValueCollector valueCollector) {
betriebspunkt.getVersionen().stream()
.map(BetriebspunktVersion::getZusatzbetriebspunkte)
.flatMap(List::stream)
.map(zbp -> zbp.getUicLand() + "_" + zbp.getNummer())
.forEach(valueCollector::addObject);
}
}
You'll notice that I am not only returning the nummer field but also include the uicLand field this is something we really wanted but couldn't get working using the "...[any]..." notation. We could of course only return the nummer if we wanted the exact same behavior as outlined above.
Now we need to modify our hazelcast configuration slightly:
Config config = new Config();
final MapConfig mapConfig = config.getMapConfig("points");
//mapConfig.addMapIndexConfig(new MapIndexConfig("versionen[any].zusatzbetriebspunkte[any].nummer", false));
mapConfig.addMapIndexConfig(new MapIndexConfig("zusatzbetriebspunkt", false));
mapConfig.addMapAttributeConfig(new MapAttributeConfig("zusatzbetriebspunkt", BetriebspunktExtractor.class.getName()));
You'll notice that the "long" index definition using the "...[any]..." notation is no longer needed.
Now we can use this "pseudo attribute" to query our values and it doesn't matter in which order the objects have been added to the map:
Predicate keyPredicate = Predicates.equal("zusatzbetriebspunkt", "80_53090");
Collection<Betriebspunkt> keyResults = map.values(keyPredicate);
assertEquals(1, keyResults.size()); // always contains "abc"
And in our spring-data-hazelcast repository we can now do this:
#Query("zusatzbetriebspunkt=%d_%d")
List<StammdatenBetriebspunkt> findByZusatzbetriebspunkt(Integer uicLand, Integer nummer);
If you do not need to use spring-data-hazelcast, instead of returning a string to the ValueCollector, you could return the BetriebspunktKey directly and then use it in the predicate as well. That would be the cleanest solution:
public class BetriebspunktExtractor extends ValueExtractor<Betriebspunkt, String> implements Serializable {
#Override
public void extract(final Betriebspunkt betriebspunkt, final String argument, final ValueCollector valueCollector) {
betriebspunkt.getVersionen().stream()
.map(BetriebspunktVersion::getZusatzbetriebspunkte)
.flatMap(List::stream)
//.map(zbp -> zbp.getUicLand() + "_" + zbp.getNummer())
.forEach(valueCollector::addObject);
}
}
and then
Predicate keyPredicate = Predicates.equal("zusatzbetriebspunkt", new BetriebspunktKey(80, 53090));
However, for this to work, BetriebspunktKey needs to implement Comparable and must also provide its own equals and hashCode methods.

Generating ID randomly and persisting it in Java efficiently

I need to generate a number of length 12, say variable finalId.
Out of those 12 digits, 5 digits are to be taken from another value, say partialid1.
Now finalId = partialId1(5 - digits)+ partialId2(7 digits).
I need to generate partialid2 randomly, where I can use Random class of Java.
Finally i have to insert this finalId in Database, as a Primary key.
So to make sure that newly generated finalId is not existing in Oracle Database, I need to query Oracle Database as well.
Is there any efficient way other than the one i have mentioned above to generate Id in Java and check in database before persisting it?
In general making one id from another has issues because you may be clumping two things together which would be easier just to keep separate. Specifically you may be trying to squeeze a foreign key into a primary key when you could just use two keys.
In any case if you really want to build a semi-random primary key from a stub then I would suggest doing it bitwise because that way it'll be easy to extract the original id in SQL and in Java.
As has been mentioned, if you generate a UUID then you don't really need to worry about checking if it's already used, otherwise you probably will want to.
That said the code for making your ids could look like this:
public class IdGenerator {
private SecureRandom random = new SecureRandom();
private long preservedMask;
private long randomMask;
public void init(int preservedBits) {
this.preservedMask = (1L << preservedBits) - 1;
this.randomMask = ~this.preservedMask;
}
public long makeIdFrom(long preserved) {
return (this.random.nextLong() & this.randomMask) | (preserved & this.preservedMask);
}
public UUID makeUuidFrom(long preserved) {
UUID uuid = UUID.randomUUID();
return new UUID(uuid.getMostSignificantBits(), (uuid.getLeastSignificantBits() & this.randomMask) | (preserved & this.preservedMask));
}
public boolean idsMatch(long id1, long id2) {
return (id1 & this.preservedMask) == (id2 & this.preservedMask);
}
}
Essentially this preserves a number of least significant bits in your original. That number you need to specify when you call init.
I would prefer the Java UUID.
You can get the random id using UUID using below code.
String id = UUID.randomUUID().toString().substring(0,7);
System.out.println("id "+ id);.
And can append it with your other partial id and have a unique key or primary constraint in DB, depending on the column where you want to store it as suggested by #Erwin.
Note:- we have done it in past for so many primary keys and never had a case where you id collided.

Hadoop Custom Partitioner not behaving according to the logic

Based on this example here, this works. Have tried the same on my dataset.
Sample Dataset:
OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;
Consider each line as string, my Mapper output is:
key-> string[2], value-> string.
My Partitioner code:
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
String keyStr = key.toString();
if(keyStr == "137176") {
return 0;
} else {
return 1 % reducersDefined;
}
}
In my data set most id's are 137176. Reducer declared -2. I expect two output files, one for 137176 and second for remaining Id's. I'm getting two output files but, Id's evenly distributed on both the output files. What's going wrong in my program?
Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class);. If you don't do that, the default HashPartitioner is used.
Change String comparison method from == to .equals(). i.e., change if(keyStr == "137176") { to if(keyStr.equals("137176")) {.
To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). But perhaps those are equivalent. So, what I suggest is:
Text KEY = new Text("137176");
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
return key.equals(KEY) ? 0 : 1 % reducersDefined;
}
Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly.

Meaning of initializing the Big Decimal to -99 in the below code

I know hashtable doesnt allow null keys ...but how is the below code working.
And what does initializing the Big Decimal to -99 in the below code do.
private static final BigDecimal NO_REGION = new BigDecimal (-99);
public List getAllParameters (BigDecimal region, String key) {
List values = null;
if (region==null) {
region = NO_REGION;
}
Hashtable paramCache = (Hashtable)CacheManager.getInstance().get(ParameterCodeConstants.PARAMETER_CACHE);
if (paramCache.containsKey(region)) {
values = (List) ((Hashtable)paramCache.get(region)).get(key);
}
return values;
}
Am struggling for a long time and dont understand it.
This is an implementation of the null object pattern: a special object, BigDecimal(-99), is designated to play the role of null in a situation where "real" nulls are not allowed.
The only requirement is that the null object must be different from all "regular" objects. This way, the next time the program needs to find entries with no region, all it needs to do is a lookup by the NO_REGION key.
Regions are identified by a BigDecimal in the hashtable (key) - when no region is provided (null) a default value of -99 is used.
It just looks like poor code to me - if something that short makes you "struggle for a long time", that is usually the best indicator.
Just cleaning it up a little and it probably will make a lot more sense:
private static Hashtable paramCache = (Hashtable)CacheManager.getInstance().get(ParameterCodeConstants.PARAMETER_CACHE);
public List getAllParameters (BigDecimal region, String key) {
List values = null;
if (region != null && paramCache.containsKey(region)) {
Hashtable regionMap = (Hashtable) paramCache.get(region);
values = (List) regionMap.get(key);
}
return values;
}
Seems the writer into hashtable used NO_REGION as key for values without a region. So, the reader is doing the same thing.

creating unique yet ordered ,order number for a customer's order

In my java app I need to generate a unique order number for a customer's order.I thought the time of creation of order is a good enough unique value.Two orders cannot be created at the same second.
To prevent others from using the ordernumber, by guessing some creationtime value,I appended a part of hash of the creationtime string to it and made it the final order number string.
Is there any unseen pitfall in this approach?Creating the order number based on time of creation was intended to give some sort order for the created orders in the system..
The code is given here
public static String createOrderNumber(Date orderDate) throws NoSuchAlgorithmException {
DateFormat df = new SimpleDateFormat("yyyyMMddHHmmss");
String datestring = df.format(orderDate).toString();
System.out.println("datestring="+datestring);
System.out.println("datestring size="+datestring.length());
String hash = makeHashString(datestring);//creates SHA1 hash of 16 digits
System.out.println("hash="+hash);
System.out.println("hash size="+hash.length());
int datestringlen = datestring.length();
String ordernum = datestring+hash.substring(datestringlen,datestringlen+5);
System.out.println("ordernum size="+ordernum.length());
return ordernum;
}
private static String makeHashString(String plain) throws NoSuchAlgorithmException {
final int MD_PASSWORD_LENGTH = 16;
final String HASH_ALGORITHM = "SHA1";
String hash = null;
try {
MessageDigest md = MessageDigest.getInstance(HASH_ALGORITHM);
md.update(plain.getBytes());
BigInteger hashint = new BigInteger(1, md.digest());
hash = hashint.toString(MD_PASSWORD_LENGTH);
} catch (NoSuchAlgorithmException nsae) {
throw(nsae);
}
return hash;
}
A sample output is
datestring=20110924103251
datestring size=14
hash=a9bcd51fc69d9225c5d96061d9c8628137df14e0
hash size=40
ordernum size=19
ordernum=2011092410325125c5d
One potential issue cn arise if your application runs on cluster of servers.
In this case if it happens that this code is executed simultanesously in two JVMs tha same orders will be generated.
If this is not the case, the unique order number generation based on the dates sounds ok to me.
I didn't really understood the meaning of hash here.
I mean from the cryptography point of view it doesn't really add a security to your code. If a "malicious" client guesses the order number, its enough to know that the SHA1 hash is applied, the algorithm itself is known, and may be applied to determine the order number.
Hope this helps
If needed an unique it should always be from a 3rd party system which is common and the receiving/calculating method should be through a synchronized method where this will happens sequential or can be generated through database system which will be almost always unique.

Categories