I have a dense symmetric matrix of size about 30000 X 30000 that contains distances between strings. Since the distance is symmetric, the upper triangle of the matrix is stored in a tab-separated 3-column file of the form
stringA<tab>stringB<tab>distance
I am using HashMap and org.javatuples.Pair to create a map to quickly look up distances for given pairs of string as follows:
import org.javatuples.Pair;
HashMap<Pair<String,String>,Double> pairScores = new HashMap<Pair<String,String>,Double>();
BufferedReader bufferedReader = new BufferedReader(new FileReader("data.txt"));
String line = null;
while((line = bufferedReader.readLine()) != null) {
String [] parts = line.split("\t");
String d1 = parts[0];
String d2 = parts[1];
Double score = Double.parseDouble(parts[2]);
Pair<String,String> p12 = new Pair<String,String>(d1,d2);
Pair<String,String> p21 = new Pair<String,String>(d2,d1);
pairScores.put(p12, score);
pairScores.put(p21, score);
}
data.txt is very big (~400M lines) and the process eventually slows down to a crawl with most time being spent in java.util.HashMap.put.
I don't think there should be (m)any hash code collisions on pairs but I might be wrong. How can I verify this? Is it enough to simply look at how unique p12.hashCode() and p12.hashCode() are?
If there are no collisions, what else could be causing to slow down?
Is there a batter way to construct this matrix for quick lookup?
I am now using Guava's Table<Integer, Integer, Double> after also realizing that my strings are unique enough that I could use their hashes, instead of the strings themselves, as keys, to reduce memory requirements. The creation of the table runs in reasonable time, however, there are issues with serializing and deserializing the resulting objects: I ran into out of memory errors even with the move from String to Integer. It seems to be working after I decided to not store both a-b and b-a pairs, but I might be balancing on the edge of what my machine can handle
I am implementing hashing with random access files in java to treat collisions. I need to use a method to generate the keys according to a name to try to minimize collisions. With the method that I have, if income 100 records, I generated 95 collisions.
Note that the hash method I use is that of division or modulo the input data string is of length 6.
Are there possible improvements to this method, or alternatives?
public int hashCode(String nombre ) {
int hash = 1;
hash = hash*31 + nombre.hashCode();
System.out.println("hsh " +hash);
return Math.abs(hash);
}
What your code boils down to :
hash = 31 + nombre.hashCode();
If your string would be same then you would get collision.
You should change this to be more meaningful.
public int hashCode(String nombre ) {
int hash = new Random().nextInt(); // PLEASE NOTE YOU SHOULD NOT CREATE NEW RANDOM EVERY TIME. CREATE IT ONCE AND JUST USE nextInt()
hash = hash*31 + nombre.hashCode();
System.out.println("hsh " +hash);
return Math.abs(hash);
}
I'm making an Android App and have currently created some code to be used to generate a code. The generated code will be checked agaisnt the database to see if the code is in use. If it is it will re-generate another code until it finds a code which is not in use. This is done using a do-while loop, whilst there are no codes there will be no noticeable delay to the user. However if there are loads of codes there will be a noticeable delay won't there? The code is below:
public static String generateCode(DBAdapter myDB, String mgrName){
String[] name = mgrName.split(" +");
String fName = name[0];
String lName = name[1];
String fLetter = fName.substring(0, 1).toUpperCase();
String lLetter = lName.substring(0, 3).toUpperCase();
int randomPIN = (int) (Math.random() * 9000) + 1000;
String pin = String.valueOf(randomPIN);
String letters = new StringBuilder().append(fLetter).append(lLetter).append(pin).toString();
Boolean result = checkCode(myDB, letters);
if(result == true){
return(letters);
}
else{
String code = "";
Boolean resultfail = false;
do{
int randomPINFail = (int) (Math.random() * 9000) + 1000;
String generatedCode = new StringBuilder().append(fLetter).append(lLetter).append(randomPINFail).toString();
Boolean check = checkCode(myDB, generatedCode);
if(check){
resultfail = true;
code = generatedCode;
}
}while(!resultfail);
return code;
}
}
public static Boolean checkCode(DBAdapter myDB, String code){
Cursor cursor = myDB.getRowComplaint(code);
if(cursor.getCount() == 0){
return true;
}
else{
return false;
}
}
My Question is what is the chance that the generator will choose a code that is already in use so many times that the user will notice a delay? Bearing in mind the generator will use different manager names as well as different numbers. And is this code safe to use? If not what can be done to make it safe?
EDIT: I can't use UUID as the user has requested the code to contain four letters and four digits. The code is used to retrieve data from the db and that's why it needs to be unique.
As with any performance related question, there's no way for us to answer - you should profile it yourself by creating a large number of existing rows and then seeing how slow it is.
In terms of safety, that's a very broad term and I don't know anything about what you're using these codes for, so I couldn't comfortably tell you that the code is safe. But there don't seem to be any horrible problems with the way you're accessing the database.
Just use UUID class or any other built-in pseudo-random number generator - don't reinvent the wheel. In theory, they will provide such a small collision rate that in absolute most cases you'll generate unique id on a first try. But once again, it depends on your use case. I assume that you're doing something sane and not generating&storing millions of millions of those codes on mobile device.
Be sure to not invoke this routine from the main thread - in this case use might notice the delay even if your DB is empty.
I have gone to this site many times and found answers to my questions but its finally time for me to post one of my own! So the objective of a particular class in my software is to generate random passwords of fixed length, comprised of 'low' ASCII characters. The main catch is that I do not want to generate the same password twice but always guarantee uniqueness. Initially I used a HashMap in order to hash each password I had generated so far and use as a check each time I created a new one before returning. However, Java HashMap objects are limited in size and eventually the Map would become too saturated to maintain acceptable retrieval time. The following is my latest crack at the problem:
package gen;
import java.util.Set;
import java.util.Random;
import java.util.HashSet;
public class Generator {
Random r;
int length;
Set<String> seen;
public Generator(int l){
seen = new HashSet<String>();
length = l;
r = new Random();
r.setSeed(System.currentTimeMillis());
}
public String generate(){
String retval = "";
int i = 0;
while(i<length){
int rand = r.nextInt(93)+33;
if(rand!=96){
retval+= (char)rand;
i++;
}
}
return retval;
}
public String generateNoRepeat(){
String retval;
int i;
do{
retval ="";
i = 0;
while(i<length){
int rand = r.nextInt(93)+33;
if(rand!=96){
retval+= (char)rand;
i++;
}
}
}while(!seen.add(retval));
return retval;
}
}
Edit: Thanks so much for the Set suggestion. It makes my code so much cleaner now too!
I may decide to just use the dumb generator method to fill up a BlockingQueue and just multithread it to death...
Further clarification: This is not meant to generate secure passwords. It must simply guarantee that it will eventually generate all possible passwords and only once for a given length and character set.
Note:
I have taken everyone's insight and have come to the conclusion that sequentially generating the possible passwords and storing them to the disk is probably my best option. Either that or simply allow duplicate passwords and supplement the inefficiency with multiple Generator threads.
Why not just encrypt sequential numbers?
Let n be the first number in your sequence (don't start with zero). Let e be some encryption algorithm (e.g. RSA).
Then your passwords are e(n), e(n+1), e(n+2), ...
But I heavily agree with Greg Hewgill and Ted Hopp, avoiding duplicates is more trouble than it is worth.
I usually use the UUID class to generate unique IDs. This works fine if these IDs are used by technical systems only, they don't care how long they are:
System.out.println(UUID.randomUUID().toString());
> 67849f28-c0af-46c7-8421-94f0642e5d4d
Is there a nice way to create user friendly unique IDs (like those from tinyurl) which are a bit shorter than the UUIDs? Usecase: you want to send out IDs via Mail to your customers which in turn visit your site and enter that number into a form, like a voucher ID.
I assume that UUIDs get generated equally through the whole range of the 128 Bit range of the UUID. So would it be sage to use just the lower 64 Bits for instance?
System.out.println(UUID.randomUUID().getLeastSignificantBits());
Any feedback is welcome.
I assume that UUIDs get generated
equally through the whole range of the
128 Bit range of the UUID.
First off, your assumption may be incorrect, depending on the UUID type (1, 2, 3, or 4). From the Java UUID docs:
There exist different variants of
these global identifiers. The methods
of this class are for manipulating the
Leach-Salz variant, although the
constructors allow the creation of any
variant of UUID (described below).
The layout of a variant 2 (Leach-Salz)
UUID is as follows: The most
significant long consists of the
following unsigned fields:
0xFFFFFFFF00000000 time_low
0x00000000FFFF0000 time_mid
0x000000000000F000 version
0x0000000000000FFF time_hi
The least significant long consists of
the following unsigned fields:
0xC000000000000000 variant
0x3FFF000000000000 clock_seq
0x0000FFFFFFFFFFFF node
The variant field contains a value
which identifies the layout of the
UUID. The bit layout described above
is valid only for a UUID with a
variant value of 2, which indicates
the Leach-Salz variant.
The version field holds a value that
describes the type of this UUID. There
are four different basic types of
UUIDs: time-based, DCE security,
name-based, and randomly generated
UUIDs. These types have a version
value of 1, 2, 3 and 4, respectively.
The best way to do what you're doing is to generate a random string with code that looks something like this (source):
public class RandomString {
public static String randomstring(int lo, int hi){
int n = rand(lo, hi);
byte b[] = new byte[n];
for (int i = 0; i < n; i++)
b[i] = (byte)rand('a', 'z');
return new String(b, 0);
}
private static int rand(int lo, int hi){
java.util.Random rn = new java.util.Random();
int n = hi - lo + 1;
int i = rn.nextInt(n);
if (i < 0)
i = -i;
return lo + i;
}
public static String randomstring(){
return randomstring(5, 25);
}
/**
* #param args
*/
public static void main(String[] args) {
System.out.println(randomstring());
}
}
If you're incredibly worried about collisions or something, I suggest you base64 encode your UUID which should cut down on its size.
Moral of the story: don't rely on individual parts of UUIDs as they are holistically designed. If you do need to rely on individual parts of a UUID, make sure you familiarize yourself with the particular UUID type and implementation.
Here is another approach for generating user friendly IDs:
http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx
(But you should go for the bad-word-filter)
Any UUID/Guid is just 16 Bytes of data. These 16 bytes can be easily encoded using BASE64 (or BASE64url), then stripped off all of the "=" characters at the end of the string.
This gives a nice, short string which still holds the same data as the UUID/Guid. In other words, it is possible to recreate the UUID/Guid from that data if such becomes necessary.
Here's a way to generate a URL-friendly 22-character UUID
public static String generateShortUuid() {
UUID uuid = UUID.randomUUID();
long lsb = uuid.getLeastSignificantBits();
long msb = uuid.getMostSignificantBits();
byte[] uuidBytes = ByteBuffer.allocate(16).putLong(msb).putLong(lsb).array();
// Strip down the '==' at the end and make it url friendly
return Base64.encode(uuidBytes)
.substring(0, 22)
.replace("/", "_")
.replace("+", "-");
}
For your use-case, it would be better to track a running count of registered user, and for each value, generate a string-token like this:
public static String longToReverseBase62(long value /* must be positive! */) {
final char[] LETTERS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".toCharArray();
StringBuilder result = new StringBuilder(9);
do {
result.append(LETTERS[(int)(value % 62)]);
value /= 62l;
}
while (value != 0);
return result.toString();
}
For security reasons, it would be better if you make the values non-sequential, so each time a user registers, you can increment the value let's say by 1024 (This would be good to generate uuids for 2^64 / 2^10 = 2^54 users which is quite certainly more than you'd ever need :)
At the time of this writing, this question's title is:
How to create user friendly unique IDs, UUIDs or other unique identifiers in Java
The question of generating a user-friendly ID is a subjective one. If you have a unique value, there are many ways to format it into a "user-friendly" one, and they all come down to mapping unique values one-to-one with "user-friendly" IDs — if the input value was unique, the "user-friendly" ID will likewise be unique.
In addition, it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are unique, random, and easy to type by end users. But other things you should think about are:
Are other users allowed to access the resource identified by the ID, whenever they know the ID? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG such as java.security.SecureRandom in Java). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Also, if you want IDs that have to be typed in by end users, you should consider choosing a character set carefully or allowing typing mistakes to be detected.
Only for you :) :
private final static char[] idchars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static String createId(int len) {
char[] id = new char[len];
Random r = new Random(System.currentTimeMillis());
for (int i = 0; i < len; i++) {
id[i] = idchars[r.nextInt(idchars.length)];
}
return new String(id);
}
How about this one? Actually, this code returns 13 characters(numbers and lowercase alphabets) max.
import java.nio.ByteBuffer;
import java.util.UUID;
/**
* Generate short UUID (13 characters)
*
* #return short UUID
*/
public static String shortUUID() {
UUID uuid = UUID.randomUUID();
long l = ByteBuffer.wrap(uuid.toString().getBytes()).getLong();
return Long.toString(l, Character.MAX_RADIX);
}