Comparing Strings with equivalent but different Unicode code points in Java/Kotlin

Comparing Strings with equivalent but different Unicode code points in Java/Kotlin - java

I ran into an issue while comparing two strings with different coders. My code is actually in Kotlin but it's running on the JVM and is effectively using Java's String implementation. Also, my question is of a more general nature and my actual code will not be of concern.
The problem is that I have two strings, lets say a and b, where
a = "something something äöü something"
b = "äöü"
you'd expect that a.contains(b) returns true, and that is the case if you retrieve your strings like shown above. But in my case, the strings come from different sources and happen to have different coders. String a has the coder 1, which is UTF16, and String b has the coder 0, which is LATIN1. In this case, a.contains(b) returns false. Now you might have noticed that I included special characters (ä, ö and ü), because that is where, according to my debugging, the comparison fails.
While I am at the stackframe where the a.contains(b) call happens, both strings appear correctly displayed in my debugger (IntelliJ IDEA Ultimate 2020.2). However if I subsequently step into the comparing functions, I notice that in java.lang.StringLatin1.regionMatchesCI_UTF16(), where the byte arrays are converted back char by char, the special characters of b are now not correct (ä -> a, ö -> o, ü -> u). And of course the comparison fails then.
Now as I said, both strings are displayed correctly in the debugger originally, so the information has to be somewhere. My question is: what do I have to do to let the a.contains(b) call return true, as expected?
EDIT:
I was certain that the problem would originate from the strings having two different coders. However, even though the different coders hint at the fact that different encodings were at work, they are not the source of the problem. Generally speaking, different coders do not affect the result of .equals(), .contains() or similar calls. #OrangeDog pointed this out, while also suggesting that I actually ended up with two different representations of the same character, which really was the case. And still, my question remains the same: How do I compare these two strings that are "semantically" the same, but differ in the representation of certain characters?
Java 11 (11.0.2, openJDK 11)
Kotlin/JVM 1.4.0
IntelliJ IDEA Ultimate 2020.2

Ignore the internal details of String. As far as you are concerned it does not have an encoding, it just stores sequences of characters (or "code point units" as the Kotlin docs describe them).
I'm guessing one of your strings (that was Latin-1) uses the character U+00E4 (ä) and the other uses the sequence U+0061 U+0308 (ä). You can verify using toCharArray().
To be able to compare such strings sensibly, there is the class java.text.Normalizer:
Normalizer.normalize(a, Form.NFKD).contains(Normalizer.normalize(b, Form.NFKD))
Or, ensure that any Strings you are receiving are already in the recommended NFC form.

Related

Is there a way to set the value of an Aspose Cell to a string representation while keeping its type/style?

We have a program that does pattern replacement in a variety of files (including Excel) and are using Aspose. So for example we might have a cell that is PATTERN_TO_REPLACE and then we replace that with "6" or "6.0" or "1,000.00" (in the US) or "1.000,00" (in the UK).
(PATTERN_TO_REPLACE could also be in a formula).
The problem is that we have to use a Java matcher to find (and replace) the string, which means we are calling cell.setValue(String). This led Aspose to change the CellValueType to string, even when it was previously numeric.
Our initial solution was to look and see if we thought the string was a number, then cast it to a BigDecimal and pass it in as such, but that leads to a lot of work (particularly for localization of types of currencies...)
What I'd like to do is call cell.getStyle(), cell.getType(), save them, set the value, then return them, but there is no cell.setType() method. Is there an easy way to do what I want?

When you use cell.setValue(String), surely, it will set string type. I think you may try to use the overloads like, cell.putValue(String, true) instead. This way, if you are inserting numbers (as string), it will be automatically converted to numbers and set it to numeric type.
PS. I am working as Support developer/ Evangelist at Aspose.

While debugging Java codes what does # mean in statements like {Instance#789} or "SomeThread"#321: RUNNING? [duplicate]

This question already has an answer here:
Deciphering variable information while debugging Java
(1 answer)
Closed 6 months ago.
The "#" seems to be everywhere when I debug. They are always preceded by some instance/variable name and followed by a (usually three digits) number. What does it mean? I have an image below
Taken from https://medium.com/#andrey_cheptsov/intellij-idea-pro-tips-6da48acafdb7 .

#730 means the 730th object created since the application started.
It is not the hashcode. Length of this can be more or less than 3 digits.
It's totally depends upon which IDE you are using, may eclipse will give something else instead of #730 and in different format also, so it is the way of intellij to maintaining the debugging.

This is Intellij debugger's way of displaying a "unique identifier" for an object. It consists of the short classname and a unique number. The unique number seems to be generated using a simple counter, so the "meaning" of 729 in Owner#729 is (presumably) "this is the 729th object that the debugger has allocated an identifier for". However, you probably shouldn't rely on that.
There is no overt relationship between these numbers and Java identity hashcode values, though I expect Intellij maintains a mapping behind the scenes.
The Owner#5f9d02cb in the screenshot is reminiscent of the result of Object::toString ... when it hasn't been overridden. If that it is what it is, then the 5f9d02cb will be the object's identity hashcode.

How is the "empty string" sequence represented under the hood in Java?

Throughout my career I've often seen calls like this:
if( "".equals(foo) ) { //do stuff };
How is the empty string understood in terms of data in the lower-levels of Java?
Specifically, by "Lower-levels of Java" I'm referring to the actual contents of memory or some C/C++ construct being used to represent the "" sequence, rather than high-level implementations in Java.
I had previously checked the Java Language Specification which lead me to this, and noting that the "empty string" wasn't really given much more definition than that, this is then what led to the head-scratching.
I then ran javap on some various classes trying to tease out an answer through bytecode, but the behavior in regards to "How is the machine dealing with the sequence "" wasn't really any more clear. Having then excluded byte code and Java code I then posted the question here, hoping that someone would shed some light on the issue from a lower-level perspective.

There's no such thing as "the empty string character". A character is always a UTF-16 code unit, and there's no "empty" code unit. There's "an empty string" which is represented exactly the same way as any other string:
A char[] reference
An index into that char[]
A length
In this case, the length would be 0. The char[] reference could potentially be a reference to an empty char array, which could potentially be shared between all instance of String which have a length of 0.
(Code such as substring could be implemented by detecting 0-length requests and always returning the same reference to an empty string, but I'm not aware of implementations doing that.)

Strings transcoding in Java

I've found a piece of code recently, which does the following:
String s = ... // whatever
...
s = new String(s.getBytes(myEncoding), myEncoding);
For me it appears to be absolutely non-sense.
Is it possible that under certain circumstances (some specific combination of locale settings, used technologies, etc.), this code will do something useful?
Thanks in advance

yes, that code is generally nonsense. yes, it's possible that that code could be doing "something" to the string (probably not anything good). generally speaking, if you have already incorrectly converted bytes to chars, trying to re-convert is rarely going to give you legitimate results. (there may be isolated instances where the right combination of character encodings may work).

Is StringBuffer the same as Strings in Ruby and Symbols the same as regular Java strings?

I just started reading this book Eloquent Ruby and I have reached the chapter about Symbols in Ruby.
Strings in Ruby are mutable, which means each string allocate memory since the content can change, and even though the content is equal. If I need a mutable String in Java I would use StringBuffer. However since regular Java Strings are immutable one String object can be shared by multiple references. So if I had two regular Strings with the content of "Hello World", both references would point to the same object.
So is the purpose of Symbols in Ruby actually the same as "normal" String objects in Java? Is it a feature given to the programmer to optimize memory?
Is something of what I written here true? Or have I misunderstood the concept of Symbols?

Symbols are close to strings in Ruby, but they are not the equivalent to regular Java strings, although they, too, do share some commonalities such as immutability. But there is a slight difference - there is more than one way to obtain a reference to a Symbol (more on that later on).
In Ruby, it is entirely possible to convert the two back and forth. There is String#to_sym to convert a String into a Symbol and there is Symbol#to_s to convert a Symbol into a String. So what is the difference?
To quote the RDoc for Symbol:
The same Symbol object will be created for a given name or string for the duration of a program‘s execution, regardless of the context or meaning of that name.
Symbols are unique identifiers. If the Ruby interpreter stumbles over let's say :mysymbol for the first time, here is what happens: Internally, the symbol gets stored in a table if it doesn't exist yet (much like the "symbol table" used by parsers; this happens using the C function rb_intern in CRuby/MRI), otherwise Ruby will look up the existing value in the table and use that. After the symbol gets created and stored in the table, from then on wherever you refer to the Symbol :mysymbol, you will get the same object, the one that was stored in that table.
Consider this piece of code:
sym1 = :mysymbol
sym2 = "mysymbol".to_sym
puts sym1.equal?(sym2) # => true, one and the same object
str1 = "Test"
str2 = "Test"
puts str1.equal?(str2) # => false, not the same object
to notice the difference. It illustrates the major difference between Java Strings and Ruby Symbols. If you want object equality for Strings in Java you will only achieve it if you compare exactly the same reference of that String, whereas in Ruby it's possible to get the reference to a Symbol in multiple ways as you saw in the example above.
The uniqueness of Symbols makes them perfect keys in hashes: the lookup performance is improved compared to regular Strings since you don't have to hash your key explicitly as it would be required by a String, you can simply use the Symbol's unique identifier for the lookup directly. By writing :somesymbol you tell Ruby to "give me that one thing that you stored under the identifier 'somesymbol'". So symbols are your first choice when you need to uniquely identify things as in:
hash keys
naming or referring to variable, method and constant names (e.g. obj.send :method_name )
But, as Jim Weirich points out in the article below, Symbols are not Strings, not even in the duck-typing sense. You can't concatenate them or retrieve their size or get substrings from them (unless you convert them to Strings first, that is). So the question when to use Strings is easy - as Jim puts it:
Use Strings whenever you need … umm … string-like behavior.
Some articles on the topic:
Ruby Symbols.
Symbols are not immutable strings
13 Ways of looking at a Ruby Symbol

The difference is that Java Strings need not point to the same object if they contain the same text. When declaring constant strings in your code, this normally is the case since the compiler will put it in the constant pool.
However, if you create a String dynamically at runtime in Java, two Strings can perfectly point to different objects and still contain the same literal text. You can however force this by internalizing the String objects (calling String.intern(), see Java API
A nice example can be found here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.