I am dealing with fixed length records, with fixed length fields. Some of these fields are sensitive ... think account number. Let's say the account number in my record is defined as a maximum of 19 bytes. I would like to find (or create) a hash of the account number, the result of which itself is no more than 19 bytes. This way, I can still correlate records by this field, the original value not recoverable, and importantly my fixed length record and field size is not changed. Basically, for any field a, f(a) = a' where sizeof(a) == sizeof(a'). Is this possible, even if not cryptographically secure?
If you want to limit the size of the hash to 19 bytes, you could simply truncate a standard hash. Obviously, this increases the chance of hash collisions (two account numbers hashing to the same value).
See also this question which discusses truncation.
However, the original values may be recoverable by brute force. The number of account numbers will probably not be huge, so someone can enumerate them all, run them through the same hashing algorithm, and determine the original account number of a given record. This is a real vulnerability which has been exploited in practice to de-anonymise data.
I can't answer your question directly, but what you are looking for is referred to as "tokenization," I believe. One reason to use tokenization instead of a simple digest or hashing scheme is to avoid problems related to collisions. Some providers even perform this exact type of tokenization for things like replacing a credit card number with a valid (as far as the format is concerned) token that can be processed like a regular credit card number without exposing any sensitive information.
Related
I am adding correlationIDs to our public and private APIs. This is to be able to trace a request progress through logs.
UUIDs being long strings, take up much space. I need a compact alternative to UUID as a correlation ID.
It will be ok, if a correlationId repeats after a fix period (say 2 months) since the API requests older than that won't be required to be traced.
I have considered using java.util.Random nextLong(). But it does not guarantee that it won't repeat.
Also, SecureRandom can pose some performance issues is what I understand and also, I don't need the correlationIDs to be secure.
It would be good to have other options considered.
If you can accept IDs up to 8 characters long, the number of possible IDs depends on the character set of those IDs.
For hexadecimal characters (16-character set), the number of IDs is 4294967296 or 2^32.
For A-Z, 0-9 (36-character set), the number of IDs is 2821109907456 or about 2^41.
For base64 or base64url, 0-9 (63-character set), the number of IDs is 248155780267521 or about 2^47.
You should decide which character set to use, as well as ask yourself whether your application can check IDs for randomness and whether you can tolerate the risk of duplicate IDs (so that you won't have a problem generating them randomly using SecureRandom). Also note the following:
The risk of duplicates depends on the number of possible IDs. Roughly speaking, after your application generates the square root of all possible IDs at random, the risk of duplicates becomes non-negligible (which in the case of hexadecimal IDs, will occur after just 65536 IDs). See "Birthday problem" for a more precise statement and formulas.
If your application is distributed across multiple computers, you might choose to assign each computer a unique value to include in the ID.
If you don't care about whether your correlation IDs are secure, you might choose to create unique IDs by doing a reversible operation on sequential IDs, such as a reversible mixing function or a linear congruential generator.
You say that SecureRandom "can pose some performance issues". You won't know if it will unless you try it and measure your application's performance. Generating IDs with SecureRandom may or may not be too slow for your purposes.
For further considerations and advice, see what I write on "Unique Random Identifiers".
I have been asked to come up with a solution where you have a file where each line represent a 10 digit phone number and we need to tell whether a given 10 digit phone number is present in the file or not.
I came up with Trie Data structure where each each children is nothing but a Map of integer as Key and Trie as Value.
class Trie{
boolean isEnd;
Map<Integer, Trie> map = new HashMap<>();
}
I can take int[] arr also to store the children.
As we have only numbers ranging from 0 - 9, so we can store these numbers in 4 bits only. Why to take 'int' or Integer as data type. How to reduce memory here?
How we can store this numbers in Map or array but not taking int as we will end up wasting lot of memory.
Moreover is there any better solution than Trie?
If you're going for memory efficiency, I would actually advise against using a trie and recommend a different data structure. As I understand it, you are only interested in answering queries of the form "have I see this phone number before?" While you could do this by treating the phone numbers as strings and throwing all of them into a trie, you wouldn't be taking advantage of the operations that tries are designed to support (fast prefix searching, retrieving elements in sorted order, etc.), so you'd be paying for things you wouldn't be using.
Moreover, let's think about the space usage of the trie. Even if every phone number had a long common prefix, each node in the trie requires space to store its child pointers. If you store even one (64-bit) pointer per node, you're using the same amount of space that you'd be using to store a 10-digit phone number (which fits comfortably into a 64-bit integer). If the phone numbers don't have long shared prefixes, you're potentially storing ten pointers per number, a huge space blowup, regardless of how big the hash table keys are.
Instead of throwing things into a trie, I'd consider just using a simple, vanilla hash table. After all, hash tables are specifically optimized to support membership queries and membership queries alone. Hashing phone numbers shouldn't be too bad, as they can be packed into 64-bit integers and hashed using a variety of simple hashing techniques. This lets you control what kind of time/space tradeoff you want to make (larger table sizes increase memory and decrease time, smaller tables increase time and decrease memory).
I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers.
With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?
Since I got feedback that my question does not have enough background information. I am adding some more information.
The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.
The identifiers themselves are long integers.
The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.
Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).
I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years
)
There's no need to involve String.
new Integer(i).hashCode()
... gives you a hash - designed for the very purpose of evenly distributing into buckets.
new Integer(i).hashCode() % n
... will give you a number in the range you want.
However Integer.hashCode() is just:
return value;
So new Integer(i).hashCode() % n is equivalent to i % n.
Your question as is cannot be answered. #slim's try is the best you will get, because crucial information is missing in your question.
To distribute a set of items, you have to know something about their initial distribution.
If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.
If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].
That is, half of your buckets end up with 50% more items than the other half.
If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.
This is when you need a hash function.
But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.
To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute
hash(input) = input * 2654435761 % 2^32
It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]
I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.
I have a Java application which works with MySQL database.
I want to be able to store long texts and check whether table contains them. For this I want to use index, and search by reduced "hash" of full_text.
MY_TABLE [
full_text: TEXT
text_hash: varchar(255) - indexed
]
Thing is, I cannot use String.hashCode() as:
Implementation may vary across JVM versions.
Value is too short, which means many collisions.
I want to find a fast hashing function that will read the long text value and produce a long hash value for it, say 64 symbols long.
Such reliable hash methods are not fast. They're probably fast enough, though. You're looking for a cryptographic message digest method (like the ones used to identify files in P2P networks or commits in Git). Look for the MessageDigest class, and pick your algorithm (SHA1, MD5, SHA256, etc.).
Such a hash function will take bytes as argument, and produce bytes as a result, so make sure to convert your strings using a constant encoding (UTF8, for example), and to transform the produced byte array (typically of 16 or 20 bytes) to a readable String using hexadecimal or Base64 encoding.
I'd suggest that you to revisit String.hashCode().
First, it does not vary across implementations. The exact hash is specified; see the String.hashCode javadoc specification.
Second, while the String hash algorithm isn't the best there possibly is (and certainly it will have more collisions than a cryptographic hash) it does do a reasonably good job of spreading the hashes over the 32-bit result space. For example, I did a quick check of a text file on my machine (/usr/share/dict/web2a) which has 235,880 words, and there were six collisions.
Third and fourth: String.hashCode() should be considerably faster, and the storage required for the hash values should be considerably smaller, than a cryptographic hash.
If you're storing strings in a database table, and their hash values are indexed, having a few collisions shouldn't matter. Looking up a string should get you the right database rows really quickly, and having to (maybe) check a couple actual strings should be very fast compared to the database I/O.
As previous discussed, confirmation emails should have a unique, (practically) un-guessable code--essentially a one-time password--in the confirmation link.
The UUID.randomUUID() docs say:
The UUID is generated using a cryptographically strong pseudo random
number generator.
Does this imply that the the UUID random generator in a properly implemented JVM is suitable for use as the unique, (practically) un-guessable OTP?
if you read the RFC that defines UUIDs, and which is linked to from the API docs, you'll see that not all bits of the UUID are actually random (the "variant" and the "version" are not random). so a type 4 UUID (the kind that you intend to use), if implemented correctly, should have 122 bits of (secure, for this implementation) random information, out of a total size of 128 bits.
so yes, it will work as well as a 122 bit random number from a "secure" generator. but a shorter value may contain a sufficient amount of randomness and might be easier for a user (maybe i am the only old-fashioned person who still reads email in a terminal, but confirmation URLs that wrap across lines are annoying....).
No. According to the UUID spec:
Do not assume that UUIDs are hard to guess; they should not be used as
security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will exacerbate
the situation.
Also, UUIDs only have 16 possible characters (0 through F). You can generate a much more compact and explicitly secure random password using SecureRandom (thanks to #erickson).
import java.security.SecureRandom;
import java.math.BigInteger;
public final class PasswordGenerator {
private SecureRandom random = new SecureRandom();
public String nextPassword() {
return new BigInteger(130, random).toString(32);
}
}
P.S.
I want to give a clear example of how using UUID as a security token may lead to issues:
In uuid-random we discovered an enormous speed-boost by internally re-using random bytes in a clever way, leading to predictable UUIDs. Though we did not release the change, the RFC allows it and such optimizations could sneak into your UUID library unnoticed.
Yes, using a java.util.UUID is fine, randomUUID methods generates from a cryptographically secure source. There's not much more that needs to be said.
Here's my suggestion:
Send the user a link with a huge password in it as the URL argument.
When user clicks the link, write your backend so that it will determine whether or not the argument is correct and that the user is logged in.
Invalidate the UUID 24 hours after it has been issued.
This will take some work, but it's necessary if you really care about writing a robust, secure system.
Password strength can be quantified based on the required entropy(higher the better).
For a binary computer,
entropy = password length * log2(symbol space)
symbol space is the total unique symbols(characters) available for selection
For a normal english speaking user with qwerty keyboard,
symbols are selected from 52 characters(26 * 2 for both cases) + 10 numbers + maybe 15 other characters like (*, + -, ...), the general symbol space is around 75.
if we expect a minimum password length of 8:
entropy = 8 * log275 ~= 8 * 6.x ~= 50
To achieve an entropy(50) for autogenerated one-time passwords with only hexadecimal (16 symbol space = 0-9,a-f),
password length = 50 / log216 = 50 / 4 ~= 12
If the application can be relaxed to consider the complete case sensitive english alphabets and numbers, the sample space will be 62 (26 * 2 + 10),
password length = 50 / log262 = 50 / 6 ~= 8
This has reduced the number of characters to be typed by the user to 8 (from 12 with hexadecimal).
With UUID.randomUUID(), two main concerns are
user has to enter 32 characters (not user friendly)
implementations has to ensure uniqueness criteria (close coupling with library versions and language dependencies)
I understand this is not a direct answer and its really up to the application owner to choose the best strategy considering the security and usability constraints.
Personally, i will not use UUID.randomUUID() as one-time password.
The point of the random code for a confirmation link is that the attacker should not be able to guess nor predict the value. As you can see, to find the correct code to your confirmation link, a 128bits length UUID yields 2^128 different possible codes, namely, 340,282,366,920,938,463,463,374,607,431,768,211,456 possible codes to try. I think your confirmation link is not for launching a nuclear weapon, right? This is difficult enough for attacker to guess. It's secure.
-- update --
If you don't trust the cryptographically strong random number generator provided, you can put some more unpredictable parameters with the UUID code and hash them. For example,
code = SHA1(UUID, Process PID, Thread ID, Local connection port number, CPU temperature)
This make it even harder to predict.
I think this should be suitable, as it is generated randomly rather than from any specific input (ie you're not feeding it with a username or something like that) - so multiple calls to this code will give different results. It states that its a 128-bit key, so its long enough to be impractical to break.
Are you then going to use this key to encrypt a value, or are you expecting to use this as the actual password? Regardless, you'll need to re-interpret the key into a format that can be entered by a keyboard. For example, do a Base64 or Hex conversion, or somehow map the values to alpha-numerics, otherwise the user will be trying to enter byte values that don't exist on the keyboard.
It is perfect as one time password, as even I had implemented the same for application on which am working. Moreover, the link which you've shared says it all.
I think java.util.UUID should be fine. You can find more information from this article: