Comparing if two list of strings are equal using hashcode?

Comparing if two list of strings are equal using hashcode? - java

I am writing a Java/JEE client server application. I have a requirement were the files present in the server should match with the files present in the client. I am only trying to validating if there is an exact match to the file names and number of files in a specific directory.
Example of what is required:
Server
DirectoryA
FileA
FileB
FileC
Client
DirectoryA
FileA
FileB
FileC
What would be the most efficient way for the server to make sure that all clients have the same files, assuming I can have over 100 clients and that I do not want my client/server communication to be too chatty.
Here is my current approach is using a REST API and REST Client:
Server:
Find list of files in the target directory
Create a checksum for the directory by making use of hashcode derived by file names and summing it up with number 31.
Clients:
Upon receiving a request to verify integrity of the target directory, the client takes the checksum provided by the server and runs the same algorithm to generate checksum on local directory. `
If the checksum matches the client responds to the server as success.
Is this approach correct?

Is this approach correct?
The approach is correct, but the proposed implementation is not (IMO).
I assume that "summing with 31" means something like this
int hash = 0;
for (String name : names)
hash = hash * 31 + name.hashCode();
Java hashcode values are 32 bit quantities. If we assume that the filenames are distributed uniformly, that means that there is a chance of 1 in 2^32 that two different sets of filenames will have the same hash (as calculated above). In other words, a "hash collision".
An algorithm that gets it wrong one time in 4 billion times is probably not acceptable. Worse still, if the algorithm is known, then someone can trivially manufacture a situation (i.e. a set of filenames) where the algorithm gives the wrong answer.
If you want to avoid these problems, you need longer checksums. If you want to protect against people manufacturing collisions, then you need to use a cryptographically strong hash / checksum. MD5 is a popular choice.
But if it was me, I would also consider just sending a complete list of filenames ... or using the (cheap) hashcode-based checksum as a just a hint that the directory contents could be the same. (Whether the latter makes sense depends on what you need to do next.)

Related

How to get reproducible Pbkdf2PasswordEncoder output in spring boot?

When running the encode method of a spring security Pbkdf2PasswordEncoder instance multiple times, the method returns different results for the same inputs. The snippet
String salt = "salt";
int iterations = 100000;
int hashWidth = 128;
String clearTextPassword = "secret_password";
Pbkdf2PasswordEncoder pbkdf2PasswordEncoder = new Pbkdf2PasswordEncoder(salt, iterations, hashWidth);
String derivedKey = pbkdf2PasswordEncoder.encode(clearTextPassword);
System.out.println("derivedKey: " + derivedKey);
String derivedKey2 = pbkdf2PasswordEncoder.encode(clearTextPassword);
System.out.println("derivedKey2: " + derivedKey2);
results in a output like
derivedKey: b6eb7098ee52cbc4c99c4316be0343873575ed4fa4445144
derivedKey2: 2bef620cc0392f9a5064c0d07d182ca826b6c2b83ac648dc
The expected output would be the same values for both derivations. In addition, when running the application another time, the outputs would be different again. The different output behavior also appears for two different Pbkdf2PasswordEncoder instances with same inputs. The encoding method behaves more like a random number generator. Spring boot version used is 2.6.1, spring-security-core version is 5.6.0 .
Is there any obvious setting that I am missing? The documentation does not give additional hints. Is there a conceptual error in the spring boot project set up?

Is there any obvious setting that I am missing?
Yes. The documentation you linked to is fairly clear, I guess you missed it. That string you pass to the Pbkdf2PasswordEncoder constructor is not a salt!
The encoder generates a salt for you, and generates a salt every time you ask it to encode something, which is how you're supposed to do this stuff1. (The returned string contains both this randomly generated salt as well as the result of applying the encoding, in a single string). Because a new salt is made every time you call .encode, the .encode call returns a different value every time you call it, even if you call it with the same inputs.
The string you pass in is merely 'another secret' - which can sometimes be useful (for example, if you can store this secret in a secure enclave, or it is sent by another system / entered upon boot and never stored on disk, then if somebody runs off with your server they can't check passwords. PBKDF means that if they did have the secret the checking will be very slow, but if they don't, they can't even start).
This seems like a solid plan - otherwise people start doing silly things. Such as using the string "salt" as the salt for all encodes :)
The real problem is:
The expected output would be the same values for both derivations
No. Your expectation is broken. Whatever code you are writing that made this assumption needs to be tossed. For example, this is how you are intended to use the encoder:
When a user creates a new password, you use .encode and store what this method returns in a database.
When a user logs in, you take what they typed, and you take the string from your database (the one .encode sent you) and call .matches.
It sounds like you want to again run .encode and see if it matches. Not how you're supposed to use this code.
Footnote1: The why
You also need to review your security policies. The idea you have in your head of how this stuff works is thoroughly broken. Imagine it worked like you wanted, and there is a single salt used for all password encodes. Then if you hand me a dump of your database, I can trivially crack about 5% of all accounts within about 10 minutes!!
How? Well, I sort all hashed strings and then count occurrences. There will be a bunch of duplicate strings inside. I can then take all users whose passhash is in this top 10 of most common hashes and then log in as them. Because their password is iloveyou, welcome123, princess, dragon, 12345678, alexsawesomeservice!, etcetera - the usual crowd of extremely oft-used passwords. How do I know that's their password? Because their password is the same as that of many other users on your system.
Furthermore, if none of the common passwords work, I can tell that likely these are really different accounts from the same user.
These are all things that I definitely should not be able to derive from the raw data. The solution is, naturally, to have a unique salt for everything, and then store the salt in the DB along with the hash value so that one can 'reconstruct' when a user tries to log in. These tools try to make your life easy by doing the work for you. This is a good idea, because errors in security implementations (such as forgetting to salt, or using the same salt for all users) are not (easily) unit testable, so a well meaning developer writes code, it seems to work, a casual glance at the password hashes seem to indicate "it is working" (the hashes seem random enough to the naked eye), and then it gets deployed, security issue and all.

Java HashMap/List alternative for huge data

In my Java application i have to scan a filesystem and store recursively the paths of founded files for an early search.
I tried List/ArrayList and HashMap as store structure but the memory usage is TOO much when the filesystem contains 1.000.000+ files.
How can i store and fast retrieve those 'strings' without use an half of my RAM (8 GB)?

You are storing large number of strings in main memory.It will take memory irrespective of data structure you use.One way might be not to store whole path all the time but to store them in a hierarchical structure eg. storing name of directory in map as a key and storing all values of that directory in list as a value recursively.

In the global hashmap instead of storing the full paths as Strings you can store pointers to Dir-Objects.
For each directory you find create a Dir-object. Each Dir-object has a pointer to its parent Dir-object and its local name.
Example:
/a/long...path/p/ is a Dir you already found.
/a/long...path/p/a
/a/long...path/p/b are two new Dirs
The two sub Dirs only have to store a reference to the parent Dir plus their local names "a" or "b".
Note that you do not have to find the parent Object first: When scanning the file system you should do this recursively or using a Stack explicitly. When you created a Dir-object (e.g. /p here) you then push that object onto a stack and then you visit (go into) that directory. When you are creating the /a and /b sub-Dirs you just look at the top of the stack to find their parent. When you are done with the whole contents of /a/long...path/p/ then you pop the Dir-object representing it off the stack.

This question can have many answers. People can offer you a wide range of data structure to use or may ask you to increase your hardware memory or heap size of JVM. But I think the problem is somewhere else.
This problem cannot be solved by using just basic datastructures. This may require a change at the design level too. Think about your need. You are asking for such a huge space which is not needed by today's operating system or even RDBMS with very large data in store.
Data structure as a Service.(DSAS - it already exist e.g. redis but hey I may have coined this term!).
In your application design, try introducing a component or service like redis, memcached or couchdb which is specialized for doing things like 'storing huge amount of data', 'fast search' over standard sockets or other high speed communication protocol like DBUS.
Do not worry about the internal working of such protocols. There are enough of libraries/apis to do it for you.

I can suggest you to use HashSet and store md5 sum for path:
Set<Md5Sum> paths = new HashSet<>();
//for each path
String path = ...
byte[] md5 = messageDigestObject.update(path.getBytes());
path.add(new Md5Sum(md5));
You can not use byte[] directly as key in hash set. So you need create simple helper class:
class Md5Sum{
//it is more memory effiecient than byte[]
long part1, part2;
//override equals and hashCode methods
//..........
}
About updates
You need rescan filesystem and recreate this hash set object, or you can subscribe for file system events (see WatchService).

How to generate a DomainKeys (not DKIM) signature?

I am using DKIM for JavaMail to sign outgoing mails with DKIM.
Now, I would like to add a DomainKey-Signature. From reading through the docs, specs and other related posts I know that the signing process is almost identical (using the same algorithm, DNS entries, etc.).
The only difference is that DKIM offers more options, e.g. In choosing which fields to sign. That makes it easy to select the signing fields (e.g. From, Subject) and generate the right hash values.
For DomainKeys I could not figure out which mail parts to hash though. I read the docs but it is not clearly stated if you should only hash the body or the entire source code.
On a different website it says
DomainKeys uses the ‘From’, and ‘Sender’ headers, as well as the message body, in
combination with the Private Key to generate a DomainKeys signature
That makes sense - but what does it mean for my other header fields (e.g. Date, Message-ID) and what is meant by message body?
So my overall question is:
What input (mail parts) do I use to generate the DomainKey hash?

To find which header field signed by "DKIM for JavaMail" have a look into the source "DKIMSigner.java", they are specified in the array " String[] defaultHeadersToSign".
Body means the message itself (stripped down simplified structure of an email: header fields + one empty line + body).

There is no need to use the depricated DomainKeys anymore, if you are already using DKIM.
You may want have a look at this Implementation http://www.badpenguin.co.uk/dkim/

Optimized way of doing string.endsWith() work.

I need to look for all web requests received by Application Server to check if the URL has extensions like .css, .gif, etc
Referred how tomcat is listening for every request and they pick the right configured Servlet to serve.
CharChunk , MessageBytes , Mapper
Here is my idea to implement:
Load all the extensions we like to compare and get the byte
representation of them.
get a unique value for this xtension by summing up the bytes in the byte Array // eg: "css".getBytes()
Add the result value to Sorted List
Whenever we receive the request, get the byte representation of the URL // eg: "flipkart.com/eshopping/images/theme.css".getBytes()
Start summing the bytes from the byte array's last index and break when we encounter "." dot byte value
Search for existence of the value thus summed with the Sorted List // Use binary Search here
Kindly give your feed backs about the implementation and issues if any.
-With thanks, Krishna

This sounds way more complicated than it needs to be.
Use String.lastIndeXOf to find the last dot in the URL
Use String.substring to get the extension based on that
Have a HashSet<String> for a set of supported extensions, or a HashMap<String, Whatever> if you want to map the extension to something else
I would be absolutely shocked to discover that this simple approach turned out to be a performance bottleneck - and indeed I suspect it would be more efficient than the approach you suggested, given that it doesn't require the entire URL to be converted into a byte array... (It's not clear why your approach uses byte arrays anyway instead of forming the hash from char values.)
Fundamentally, my preferred approach to performance is:
Do up-front design and testing around things which are hard to change later, architecturally
For everything else:
Determine the performance criteria first so you know when you can stop
Write the simplest code that works
Test it with realistic data
If it doesn't perform well enough, use profilers (etc) to work out where the bottleneck is, and optimize that making sure that you can prove the benefits using your existing tests

Shortening long urls with a hash?

I've got a file cache, the files being downloaded from different urls. I'd like to save each file by the name of their url. These names can be quite long though, and I'm on a device using a FAT32 file system - so the long names are eating up resources well before I run out of actual disk space.
I'm looking for a way to shorten the filenames, have gotten suggestions to hash the strings. But I'm not sure if the hashes are guaranteed to be unique for two different strings. It would be bad if I accidentally fetch the wrong image if two hashed urls come up with the same hash value.

You could generate an UUID for each URL and use it as the file name.
UUIDs are unique (or "practically unique") and are 36 characters long, so I guess the file name wouldn't be a problem.
As of version 5, the JDK ships with a class to generate UUIDs (java.util.UUID). You could use randomly generate UUIDs if there's a way to associate them with the URLs, or you could use name based UUIDs. Name based UUIDs are always the same, so the following is always true:
String url = ...
UUID urlUuid = UUID.nameUUIDFromBytes(url.getBytes);
assertTrue(urlUuid.equals(UUID.nameUUIDFromBytes(url.getBytes)));

There's no (shortening) hash which can guarantee different hashes for each input. It's simply not possible.
The way I usually do it is by saving the original name at the beginning (e.g., first line) of the cache file. So to find a file in the cache you do it like this:
Hash the URL
Find the file corresponding to that hash
Check the first line. If it's the same as the full URL:
The rest of the file is from line two and forward
You can also consider saving the URL->file mapping in a database.

But I'm not sure if the hashes are guaranteed to be unique for two different strings.
They very much aren't (and cannot be, due to the pigeonhole principle). But if the hash is sufficiently long (at least 64 bit) and well-distributed (ideally a cryptographic hash), then the likelihood of a collision becomes so small that it's not worth worrying about.
As a rough guideline, collisions will become likely once the number of files approaches the square root of the number of possible different hashes (birthday paradox). So for a 64 bit hash (10 character filenames), you have about a 50% chance of one single collision if you have 4 billion files.
You'll have to decide whether that is an acceptable risk. You can reduce the chance of collision by making the hash longer, but of course at some point that will mean the opposite of what you want.

Currently, the SHA-1 algorithm is recommended. There are no known ways to intentionally provoke collisions for this algorithm, so you should be safe. Provoking collisions with two pieces of data that have common structure (such as the http:// prefix) is even harder. If you save this stuff after you get a HTTP 200 response, then the URL obviously fetched something, so getting two distinct, valid URLs with the same SHA-1 hash really should not be a concern.
If it's of any re-assurance Git uses it to identify all objects, commits and folders in the source code repository. I've yet to hear of someone with a collision in the object store.

what you can do is save the files by an index and use a index file to find the location of the actual file
in the directory you have:
index.txt
file1
file2
...
etc.
and in index.txt you use some datastructure to find the filenames efficiently (or replace with a DB)

Hashes are not guaranteed to be unique, but the chance of a collision is vanishingly small.
If your hash is, say, 128 bits then the chance of a collision for any pair of entries is 1 in 2^128. By the birthday paradox, if you had 10^18 entries in your table then the chance of a collision is only 1%, so you don't really need to worry about it. If you are extra paranoid then increase the size of the hash by using SHA256 or SHA512.
Obviously you need to make sure that the hashed representation actually takes up less space than the original filename. Base-64 encoded strings represent 6 bits per character so you can do the math to find out if it's even worth doing the hash in the first place.
If your file system barfs because the names are too long then you can create prefix subdirectories for the actual storage. For example, if a file maps the the hash ABCDE then you can store it as /path/to/A/B/CDE, or maybe /path/to/ABC/DE depending on what works best for your file system.
Git is a good example of this technique in practice.

Look at my comment.
One possible solution (there are a lot) is creating a local file (SQLite? XML? TXT?) in which you store a pair (file_id - file_name) so you can save your downloaded files with their unique ID as filename.
Just an idea, not the best...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.