How is the "empty string" sequence represented under the hood in Java?

How is the "empty string" sequence represented under the hood in Java? - java

Throughout my career I've often seen calls like this:
if( "".equals(foo) ) { //do stuff };
How is the empty string understood in terms of data in the lower-levels of Java?
Specifically, by "Lower-levels of Java" I'm referring to the actual contents of memory or some C/C++ construct being used to represent the "" sequence, rather than high-level implementations in Java.
I had previously checked the Java Language Specification which lead me to this, and noting that the "empty string" wasn't really given much more definition than that, this is then what led to the head-scratching.
I then ran javap on some various classes trying to tease out an answer through bytecode, but the behavior in regards to "How is the machine dealing with the sequence "" wasn't really any more clear. Having then excluded byte code and Java code I then posted the question here, hoping that someone would shed some light on the issue from a lower-level perspective.

There's no such thing as "the empty string character". A character is always a UTF-16 code unit, and there's no "empty" code unit. There's "an empty string" which is represented exactly the same way as any other string:
A char[] reference
An index into that char[]
A length
In this case, the length would be 0. The char[] reference could potentially be a reference to an empty char array, which could potentially be shared between all instance of String which have a length of 0.
(Code such as substring could be implemented by detecting 0-length requests and always returning the same reference to an empty string, but I'm not aware of implementations doing that.)

Related

Why using default trash value for string is wrong?

tl;dr;
Why using
string myVariable = "someInitialValueDifferentThanUserValue99999999999";
as default value is wrong?
explanation of situation:
I had a discussion with a colleague at my workplace.
He proposed to use some trash value as default in order to differentiate it from user value.
An easy example it would be like this:
string myVariable = "someInitialValueDifferentThanUserValue99999999999";
...
if(myVariable == "someInitialValueDifferentThanUserValue99999999999")
{
...
}
This is quite obvious and intuitive for me that this is wrong.
But I could not give a nice argument for this, beyond that:
this is not professional.
there is a slight chance that someone would input the same value.
Once I read that if you have such a situation your architecture or programming habits are wrong.
edit:
Thank you for the answers. I found a solution that satisfied me, so I share with the others:
It is good to make a bool guard value that indicates if the initialization of a specific object has been accomplished.
And based on this private bool variable I can deduce if I play with a string that is default empty value "" from my mechanism (that is during initialization) or empty value from the user.
For me, this is a more elegant way.

Optional
Optional can be used.
Returns an empty Optional instance. No value is present for this Optional.
API Note:
Though it may be tempting to do so, avoid testing if an object is empty by comparing with == against instances returned by Option.empty(). There is no guarantee that it is a singleton. Instead, use isPresent().
Ref: Optional
Custom escape sequence shared by server and client
Define default value
When the user enter's the default value, escape the user value
Use a marker character
Always define the first character as the marker character
Take decision based on this character and strip this character for any actual comparison
Define clear boundaries for the check as propagating this character across multiple abstractions can lead to code maintenance issues.

Small elaboration on "It's not professional":
It's often a bad idea, because
it wastes memory when not a constant (at least in Java - of course, unless you're working with very limited space that's negligible).
Even as constant it may introduce ambiguity once you have more classes, packages or projects ("Was it NO_INPUT, INPUT_NOT_PROVIDED, INPUT_NONE?")
usually it's a sign that there will be no standardized scope-bound Defined_Marker_Character in the Project Documentation like suggested in the other answers
it introduces ambiguity for how to deal with deciding if an input has been provided or not
In the end you will either have a lot of varying NO_INPUT constants in different classes or end up with a self-made SthUtility class that defines one constant SthUtility.NO_INPUT and a static method boolean SthUtility.isInputEmpty(...) that compares a given input against that constant, which basically is reinventing Optional. And you will be copy-pasting that one class into every of your projects.

There is really no need as you can do the following as of Java 11 which was four releases ago.
String value = "";
// true only if length == 0
if (value.isEmpty()) {
System.out.println("Value is empty");
}
String value = " ";
// true if empty or contains only white space
if (value.isBlank()) {
System.out.println("Value is blank");
}
And I prefer to limit uses of such strings that can be searched in the class file that might possibly lead to exploitation of the code.

Comparing Strings with equivalent but different Unicode code points in Java/Kotlin

I ran into an issue while comparing two strings with different coders. My code is actually in Kotlin but it's running on the JVM and is effectively using Java's String implementation. Also, my question is of a more general nature and my actual code will not be of concern.
The problem is that I have two strings, lets say a and b, where
a = "something something äöü something"
b = "äöü"
you'd expect that a.contains(b) returns true, and that is the case if you retrieve your strings like shown above. But in my case, the strings come from different sources and happen to have different coders. String a has the coder 1, which is UTF16, and String b has the coder 0, which is LATIN1. In this case, a.contains(b) returns false. Now you might have noticed that I included special characters (ä, ö and ü), because that is where, according to my debugging, the comparison fails.
While I am at the stackframe where the a.contains(b) call happens, both strings appear correctly displayed in my debugger (IntelliJ IDEA Ultimate 2020.2). However if I subsequently step into the comparing functions, I notice that in java.lang.StringLatin1.regionMatchesCI_UTF16(), where the byte arrays are converted back char by char, the special characters of b are now not correct (ä -> a, ö -> o, ü -> u). And of course the comparison fails then.
Now as I said, both strings are displayed correctly in the debugger originally, so the information has to be somewhere. My question is: what do I have to do to let the a.contains(b) call return true, as expected?
EDIT:
I was certain that the problem would originate from the strings having two different coders. However, even though the different coders hint at the fact that different encodings were at work, they are not the source of the problem. Generally speaking, different coders do not affect the result of .equals(), .contains() or similar calls. #OrangeDog pointed this out, while also suggesting that I actually ended up with two different representations of the same character, which really was the case. And still, my question remains the same: How do I compare these two strings that are "semantically" the same, but differ in the representation of certain characters?
Java 11 (11.0.2, openJDK 11)
Kotlin/JVM 1.4.0
IntelliJ IDEA Ultimate 2020.2

Ignore the internal details of String. As far as you are concerned it does not have an encoding, it just stores sequences of characters (or "code point units" as the Kotlin docs describe them).
I'm guessing one of your strings (that was Latin-1) uses the character U+00E4 (ä) and the other uses the sequence U+0061 U+0308 (ä). You can verify using toCharArray().
To be able to compare such strings sensibly, there is the class java.text.Normalizer:
Normalizer.normalize(a, Form.NFKD).contains(Normalizer.normalize(b, Form.NFKD))
Or, ensure that any Strings you are receiving are already in the recommended NFC form.

Creating unnecessary string objects

This is a bit of Java String 101. I came across this recently in some existing code. My initial reaction is that this is redundant
car.setDetails(new String(someStringBufferObj.toString));
In my opinion even this would be redundant...
car.setDetails(new String(someOtherStringObj));
because String is immutable so there is never a risk that the car details would be changed accidentally (by changing someOtherStringObj) in a later line of code
Are am I wrong?

The first snippet above looks unnecessary. However the second may be necessary. Consider the following.
The constructor String(String) is useful since it'll take a copy of the underlying character array of the original string.
Why is this useful ? You have to understand that a string object has a character array underlying it, and getting a substring() of an existing string actually uses that original character array. This is a flyweight pattern. Consider the following
String s = longstring.substring(2,4);
The string s points to the character array underlying longstring (somewhat unintuitively). If you want to bin longstring (using garbage collection) the underlying character array won't be binned since s still references it, and you end up consuming a potentially huge amount of memory for a 2 character string.
The String(String) constructor resolves this by creating a new character array from that referenced by the string being used to construct from. When the original string is removed via garbage collection its character array won't be referenced by the substring() result and hence that will be removed too.
Note that this behaviour has changed very recently in Java (release 7u4, I think) and strings don't support the above mode of operation anymore.

You're absolutely right Rob, there's no need to new up a String in this instance. Just providing a call to someStringBufferObj.toString() should be sufficient!

Is StringBuffer the same as Strings in Ruby and Symbols the same as regular Java strings?

I just started reading this book Eloquent Ruby and I have reached the chapter about Symbols in Ruby.
Strings in Ruby are mutable, which means each string allocate memory since the content can change, and even though the content is equal. If I need a mutable String in Java I would use StringBuffer. However since regular Java Strings are immutable one String object can be shared by multiple references. So if I had two regular Strings with the content of "Hello World", both references would point to the same object.
So is the purpose of Symbols in Ruby actually the same as "normal" String objects in Java? Is it a feature given to the programmer to optimize memory?
Is something of what I written here true? Or have I misunderstood the concept of Symbols?

Symbols are close to strings in Ruby, but they are not the equivalent to regular Java strings, although they, too, do share some commonalities such as immutability. But there is a slight difference - there is more than one way to obtain a reference to a Symbol (more on that later on).
In Ruby, it is entirely possible to convert the two back and forth. There is String#to_sym to convert a String into a Symbol and there is Symbol#to_s to convert a Symbol into a String. So what is the difference?
To quote the RDoc for Symbol:
The same Symbol object will be created for a given name or string for the duration of a program‘s execution, regardless of the context or meaning of that name.
Symbols are unique identifiers. If the Ruby interpreter stumbles over let's say :mysymbol for the first time, here is what happens: Internally, the symbol gets stored in a table if it doesn't exist yet (much like the "symbol table" used by parsers; this happens using the C function rb_intern in CRuby/MRI), otherwise Ruby will look up the existing value in the table and use that. After the symbol gets created and stored in the table, from then on wherever you refer to the Symbol :mysymbol, you will get the same object, the one that was stored in that table.
Consider this piece of code:
sym1 = :mysymbol
sym2 = "mysymbol".to_sym
puts sym1.equal?(sym2) # => true, one and the same object
str1 = "Test"
str2 = "Test"
puts str1.equal?(str2) # => false, not the same object
to notice the difference. It illustrates the major difference between Java Strings and Ruby Symbols. If you want object equality for Strings in Java you will only achieve it if you compare exactly the same reference of that String, whereas in Ruby it's possible to get the reference to a Symbol in multiple ways as you saw in the example above.
The uniqueness of Symbols makes them perfect keys in hashes: the lookup performance is improved compared to regular Strings since you don't have to hash your key explicitly as it would be required by a String, you can simply use the Symbol's unique identifier for the lookup directly. By writing :somesymbol you tell Ruby to "give me that one thing that you stored under the identifier 'somesymbol'". So symbols are your first choice when you need to uniquely identify things as in:
hash keys
naming or referring to variable, method and constant names (e.g. obj.send :method_name )
But, as Jim Weirich points out in the article below, Symbols are not Strings, not even in the duck-typing sense. You can't concatenate them or retrieve their size or get substrings from them (unless you convert them to Strings first, that is). So the question when to use Strings is easy - as Jim puts it:
Use Strings whenever you need … umm … string-like behavior.
Some articles on the topic:
Ruby Symbols.
Symbols are not immutable strings
13 Ways of looking at a Ruby Symbol

The difference is that Java Strings need not point to the same object if they contain the same text. When declaring constant strings in your code, this normally is the case since the compiler will put it in the constant pool.
However, if you create a String dynamically at runtime in Java, two Strings can perfectly point to different objects and still contain the same literal text. You can however force this by internalizing the String objects (calling String.intern(), see Java API
A nice example can be found here.

How does a person go about learning Java? (convert byte array to hex string)

I know this sounds like a broad question but I can narrow it down with an example. I am VERY new at Java. For one of my "learning" projects, I wanted to create an in-house MD5 file hasher for us to use. I started off very simple by attempting to hash a string and then moving on to a file later. I created a file called MD5Hasher.java and wrote the following:
import java.security.*;
import java.io.*;
public class MD5Hasher{
public static void main(String[] args){
String myString = "Hello, World!";
byte[] myBA = myString.getBytes();
MessageDigest myMD;
try{
myMD = MessageDigest.getInstance("MD5");
myMD.update(myBA);
byte[] newBA = myMD.digest();
String output = newBA.toString();
System.out.println("The Answer Is: " + output);
} catch(NoSuchAlgorithmException nsae){
// print error here
}
}
}
I visited java.sun.com to view the javadocs for java.security to find out how to use MessageDigest class. After reading I knew that I had to use a "getInstance" method to get a usable MessageDigest object I could use. The Javadoc went on to say "The data is processed through it using the update methods." So I looked at the update methods and determined that I needed to use the one where I fed it a byte array of my string, so I added that part. The Javadoc went on to say "Once all the data to be updated has been updated, one of the digest methods should be called to complete the hash computation." I, again, looked at the methods and saw that digest returned a byte array, so I added that part. Then I used the "toString" method on the new byte array to get a string I could print. However, when I compiled and ran the code all that printed out was this:
The Answer Is: [B#4cb162d5
I have done some looking around here on StackOverflow and found some information here:
How can I generate an MD5 hash?
that gave the following example:
String plaintext = 'your text here';
MessageDigest m = MessageDigest.getInstance("MD5");
m.reset();
m.update(plaintext.getBytes());
byte[] digest = m.digest();
BigInteger bigInt = new BigInteger(1,digest);
String hashtext = bigInt.toString(16);
// Now we need to zero pad it if you actually want the full 32 chars.
while(hashtext.length() < 32 ){
hashtext = "0"+hashtext;
}
It seems the only part I MAY be missing is the "BigInteger" part, but I'm not sure.
So, after all of this, I guess what I am asking is, how do you know to use the "BigInteger" part? I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java? I have a background in C so this Java thing seems pretty weird. Any advice on how I can get better without having to "cheat" by Googling how to do something all the time?
Thank you all for taking the time to read. :-)

The key in this particular case is that you need to realize that bytes are not "human readable", but characters are. So you need to convert bytes to characters in a certain format. For arbitrary bytes like hashes, usually hexadecimal is been used as "human readable" format. Every byte is then to be converted to a 2-character hexadecimal string which you in turn concatenate together.
This is unrelated to the language you use. You just have to understand/realize how it works "under the hoods" in a language agnostic way. You have to understand what you have (a byte array) and what you want (a hexstring). The programming language is just a tool to achieve the desired result. You just google the "functional requirement" along with the programming language you'd like to use to achieve the requirement. E.g. "convert byte array to hex string in java".
That said, the code example you found is wrong. You should actually determine each byte inside a loop and test if it is less than 0x10 and then pad it with zero instead of only padding the zero depending on the length of the resulting string (which may not necessarily be caused by the first byte being less than 0x10!).
StringBuilder hex = new StringBuilder(bytes.length * 2);
for (byte b : bytes) {
if ((b & 0xff) < 0x10) hex.append("0");
hex.append(Integer.toHexString(b & 0xff));
}
String hexString = hex.toString();
Update as per the comments on the answer of #extraneon, using new BigInteger(byte[]) is also the wrong solution. This doesn't unsign the bytes. Bytes (as all primitive numbers) in Java are signed. They have a negative range. The byte in Java ranges from -128 to 127 while you want to have a range of 0 to 255 to get a proper hexstring. You basically just need to remove the sign to make them unsigned. The & 0xff in the above example does exactly that.
The hexstring as obtained from new BigInteger(bytes).toString(16) is NOT compatible with the result of all other hexstring producing MD5 generators the world is aware of. They will differ whenever you've a negative byte in the MD5 digest.

You have actually successfully digested the message. You just don't know how to present the found digest value properly. What you have is a byte array. That's a bit difficult to read, and a toString of a byte array yields [B#somewhere which is not useful at all.
The BigInteger comes into it as a tool to format the byte array to a single number.
What you do is:
construct a BigInteger with the proper value (in this case that value happens to be encoded in the form of a byte array - your digest
Instruct the BigInteger object to return a String representation (e.g. plain, readable text) of that number, base 16 (e.g. hex)
And the while loop prefixes that value with 0-characters to get a width of 32. I'd probably use String.format for that, but whatever floats your boat :)

MessageDigests compute a byte array of something, the string that you usually see (such as 1f3870be274f6c49b3e31a0c6728957f) is actually just a conversion of the byte array to a hexadecimal string.
When you call MessageDigest.toString(), it calls MessageDigest.digest().toString(), and in Java, the toString method for a byte[] (returned by MessageDigest.digest()) returns a sort of reference to the bytes, not the actual bytes.
In the code you posted, the byte array is changed to an integer (in this case a BigInteger because it would be extremely large), and then converted to hexadecimal to be printed to a String.
The byte array computed by the digest represents a number (a 128-bit number according to http://en.wikipedia.org/wiki/MD5), and that number can be converted to any other base, so the result of the MD5 could be represented as a base-10 number, a base-2 number (as in a byte array), or, most commonly, a base-16 number.

It is OK to google for answers as long as you (eventually) understand what you copy-pasted into your app :-)
In general, I recommend starting with a good Java introductory book, or web tutorial. See these threads for more tips:
https://stackoverflow.com/questions/77839/what-are-the-best-resources-for-learning-java-books-websites-etc
Learning Java
https://stackoverflow.com/questions/78293/good-book-to-learn-to-program-well-in-java-engineering-or-architecture-wise-not

Though I'm afraid that I have no experience whatsoever using Java to play with MD5 hashes, I can recommend Sun's Java Tutorials as a fantastic resource for learning Java. They go through most of the language, and helped me out a ton when I was learing Java.
Also look around for other posts asking the same thing and see what suggestions popped up there.

The reason BigInteger is used is because the byte array is very long, too big too fit into an int or long. However, if you do want to see everything in the byte array, there's an alternate approach. You could just replace the line:
String output = newBA.toString();
with:
String output = Arrays.toString(newBA);
This will print out the contents of the array, not the reference address.

Use an IDE that shows you where the "toString()" method is coming from. In most cases it's just from the Object class and won't be very useful. It's generally recommended to overwrite the toString-method to provide some clean output, but many classes don't do this.

I'm also a newbie to development. For the current problem, I suggest the Book "Introduction To Cryptography With Java Applets" by David Bishop. It demonstrates what you need and so forth...

Any advice on how I can get better
without having to "cheat" by Googling
how to do something all the time?
By by not starting out with an MD5 hasher! Seriously, work your way up little by little on programs that you can complete without worrying about domain-specific stuff like MD5.
If you're dumping everything into main, you're not programming Java.
In a program of this scale, your main() should do one thing: create an MD5Hasher object and then call some methods on it. You should have a constructor that takes an initial string, a method to "do the work" (update, digest), and a method to print the result.
Get some tutorials and spend time on simple, traditional exercises (a Fibonacci generator, a program to solve some logic puzzle), so you understand the language basics before bothering with the libraries, which is what you are struggling with now. Then you can start doing useful stuff.

I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java?
You could replace here Java with the language of your choice that you don't know/haven't mastered yet. Even if you worked 10 years in a specific language, you will still get those "Aha! This is the way it's working!"-effects, though not that often as in the beginning.
The point you need to learn here is that toString() is not returning the representation you want/expect, but any the implementer has chosen. The default implementation of toString() is like this (javadoc):
Returns a string representation of the object. In general, the toString method returns a string that "textually represents" this object. The result should be a concise but informative representation that is easy for a person to read. It is recommended that all subclasses override this method.
The toString method for class Object returns a string consisting of the name of the class of which the object is an instance, the at-sign character `#', and the unsigned hexadecimal representation of the hash code of the object. In other words, this method returns a string equal to the value of:
getClass().getName() + '#' + Integer.toHexString(hashCode())

How is a person supposed to know which
way to go in Java? I have a background
in C so this Java thing seems pretty
weird. Any advice on how I can get
better without having to "cheat" by
Googling how to do something all the
time?
Obvious answers are 1- google when you have questions (and it's not considered cheating imo) and 2- read books on the subject matter.
Apart from these two, I would recommend trying to find a mentor for yourself. If you do not have experienced Java developers at work, then try to join a local Java developer user group. You can find more experienced developers there and perhaps pick their brains to get answers to your questions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.