Java String length() and substring(int, int) not consider certain characters? - java

Trying to solve an issue with trimming a string.
Are there any ascii chars that are not counted in either length() or substring(int, int)?
Ex. if the string is coming from a serialized object outside your program and contains characters such as "start of text" (ascii hx2) or "bell" (ascii hx7) will those characters be considered in either length() or substring(int, int)?

See the documentation for String#length:
Returns the length of this string. The length is equal to the number of Unicode code units in the string.
This means that all characters are included in the length. Specifically, this will return the number of chars required to represent the string in Java.
However, of note is that certain Unicode character will actually take up two chars in the string due to the way Java handles Unicode characters using UTF-16. See the relevant documentation for more information.

Are there any ascii chars that are not counted in either length() or substring(int, int)?
No, there aren't any. Both of these methods are "dumb" and will return the number of chars stored in the underlying character array of the String object (and in fact, .length() is inherited from CharSequence).
Whether they be ASCII control characters, "non characters" such as U+0000 and U+FFFF, all will be counted.

Related

Escaping non-latin characters in Java

I have a Java program that takes in a string and escapes it so that it can be safely passed to a program in bash. The strategy is basically to escape any of the special characters mentioned here and wrap the result in double quotes.
The algorithm is pretty simple -- just loop over the input string and use input.charAt(i) to check whether the current character needs to be escaped.
This strategy works quite well for characters that aren't represented by surrogate pairs, but I have some concerns if non-latin characters or something like an emoji is embedded in the string. In that case, if we assumed that an emoji was the first character in my input string, input.charAt(0) would give me the first code unit while input.charAt(1) would return the second code unit. My concern is that some of these code units might be interpreted as one of the special characters that need to be escaped. If that happened, I'd try to escape one of the code units which would irrevocably garble the input.
Is such a thing possible? Or is it safe to use input.charAt(i) for something like this?
From the Java docs:
The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes. In this representation,
supplementary characters are represented as a pair of char values, the
first from the high-surrogates range, (\uD800-\uDBFF), the second from
the low-surrogates range (\uDC00-\uDFFF).
From the UTF-16 Wikipedia page:
U+D800 to U+DFFF: The Unicode standard permanently reserves these code point values for
UTF-16 encoding of the high and low surrogates, and they will never be
assigned a character, so there should be no reason to encode them. The
official Unicode standard says that no UTF forms, including UTF-16,
can encode these code points.
From the charAt javadoc:
Returns the char value at the specified index. An index ranges from 0
to length() - 1. The first char value of the sequence is at index 0,
the next at index 1, and so on, as for array indexing.
If the char value specified by the index is a surrogate, the surrogate
value is returned.
There is no overlap between the surrogate pair code point range and the range where my special characters ($,`,\ etc) exist as they're all using the ASCII character mappings (i.e. they're all mapped between 0 and 255).
Therefore, if I scan through a string that contains, say, an emoji (which definitely is outside of the supplementary character range) I won't mistake either of the items in the surrogate pair for a special character. Here's a simple test program:

In Java, how are Unicode chars and Java UTF-16 codepoints handled?

I'm struggling with Unicode characters in Java 10.
I'm using the java.text.BreakIterator package.
For this output:
myString="a𝓞b" hex=0061d835dcde0062
myString.length()=4
myString.codePointCount(0,s.length())=3
BreakIterator output:
a hex=0061
𝓞 hex=d835dcde
b hex=0062
Seems correct.
Using the same Java code, then with this output:
myString="G̲íl" hex=0047033200ed006c
myString.length()=4
myString.codePointCount(0,s.length())=4
BreakIterator output:
G̲ hex=00470332
í hex=00ed
l hex=006c
Seems correct too, EXCEPT for the codePointCount=4.
Why isn't it 3, and is there a means of getting
a 3 value without using BreakIterator?
My goal is to determine if all (output) chars of a string are
16-bit, or are surrogate or combining chars present?
"G̲íl" is four code points: U+0047, U+0332, U+00ED, U+006C.
U+0332 is a combining character, but it is a separate code point. That's not the same as your first example, which requires using a surrogate pair (2 UTF-16 code units) to represent U+1D4DE - but the latter is still a single code point.
BreakIterator finds boundaries in text - the two code points here that are combined don't have a boundary between them in that sense. From the documentation:
Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation through character strings, regardless of how the character is stored.
So I think everything is working correctly here.
A codepoint corresponds to one Unicode character.
Java represents Unicode in UTF-16, i.e., in 16-bit units. Characters with codepoint values larger than U+FFFF are represented by a pair of 'surrogate characters', as in your first example. Thus the first result of 3.
In the second case, you have an example that is not a single Unicode character. It is one character, LETTER G, followed by another character COMBINING CHARACTER LOW LINE. That is two codepoints per the definition. Thus the second result of 4.
In general, Unicode has tables of character attributes (I'm not sure if I have the right word here) and it is possible to find out that one of your codepoints is a combining character.
Take a look at the Character class. getType(character) will tell you if a codepoint is a combining character or a surrogate.

How to substring in java based on length?

I want to substring(starting chars) a string in java based on length.
for example if string1 is greater than 4000 bytes I want make that string into less than or equal to 4000 bytes string .(starting chars need to be trimmed not last chars)
Try this:
trimmed = str.substring(Math.max(0, str.length() - 4000));
(Bonus points if you can figure out what it is doing :-) )
However, note that this trims str to at most 4000 characters. Trimming a Java string to a given number of bytes makes no sense unless you specify the character encoding. And even if you do, it is a bit gnarly ... for variable length encodings such as UTF-8.
And it is worth noting that this can fail if your string contains Unicode codepoints outside of plane 0.
it is literally this
String s = sourceString.substring(/*Position to substring from*/ 0);

How to take a certain set of characters from a string and use those as a unicode value?

[Java] So I have this hexadecimal: 0x6c6c6548. I need to take two characters out at a time, use those two characters to get a unicode value and then concatenate them all into a string.
My idea was to take the last two digits using the charAt() method, and then adding them to a string starting with "\u00", but that doesn't work because the compiler thinks of the slash as an escape and you can't add another slash in front of the first because then it just prints a slash and doesn't convert it to unicode.
So like I need to take the 48 out and somehow convert it to it's unicode value, which is 'H' and then do that for all the pairs and put them into one string.

Print unicode literal string as Unicode character

I need to print a unicode literal string as an equivalent unicode character.
System.out.println("\u00A5"); // prints ¥
System.out.println("\\u"+"00A5"); //prints \u0045 I need to print it as ¥
How can evaluate this string a unicode character ?
As an alternative to the other options here, you could use:
int codepoint = 0x00A5; // Generate this however you want, maybe with Integer.parseInt
String s = String.valueOf(Character.toChars(codepoint));
This would have the advantage over other proposed techniques in that it would also work with Unicode codepoints outside of the basic multilingual plane.
If you have a string:
System.out.println((char)(Integer.parseInt("00A5",16)));
probably works (haven't tested it)
Convert it to a character.
System.out.println((char) 0x00A5);
This will of course not work for very high code points, those may require 2 "characters".

Categories