Escaping non-latin characters in Java

Escaping non-latin characters in Java - java

I have a Java program that takes in a string and escapes it so that it can be safely passed to a program in bash. The strategy is basically to escape any of the special characters mentioned here and wrap the result in double quotes.
The algorithm is pretty simple -- just loop over the input string and use input.charAt(i) to check whether the current character needs to be escaped.
This strategy works quite well for characters that aren't represented by surrogate pairs, but I have some concerns if non-latin characters or something like an emoji is embedded in the string. In that case, if we assumed that an emoji was the first character in my input string, input.charAt(0) would give me the first code unit while input.charAt(1) would return the second code unit. My concern is that some of these code units might be interpreted as one of the special characters that need to be escaped. If that happened, I'd try to escape one of the code units which would irrevocably garble the input.
Is such a thing possible? Or is it safe to use input.charAt(i) for something like this?

From the Java docs:
The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes. In this representation,
supplementary characters are represented as a pair of char values, the
first from the high-surrogates range, (\uD800-\uDBFF), the second from
the low-surrogates range (\uDC00-\uDFFF).
From the UTF-16 Wikipedia page:
U+D800 to U+DFFF: The Unicode standard permanently reserves these code point values for
UTF-16 encoding of the high and low surrogates, and they will never be
assigned a character, so there should be no reason to encode them. The
official Unicode standard says that no UTF forms, including UTF-16,
can encode these code points.
From the charAt javadoc:
Returns the char value at the specified index. An index ranges from 0
to length() - 1. The first char value of the sequence is at index 0,
the next at index 1, and so on, as for array indexing.
If the char value specified by the index is a surrogate, the surrogate
value is returned.
There is no overlap between the surrogate pair code point range and the range where my special characters ($,`,\ etc) exist as they're all using the ASCII character mappings (i.e. they're all mapped between 0 and 255).
Therefore, if I scan through a string that contains, say, an emoji (which definitely is outside of the supplementary character range) I won't mistake either of the items in the surrogate pair for a special character. Here's a simple test program:

Related

In Java, how are Unicode chars and Java UTF-16 codepoints handled?

I'm struggling with Unicode characters in Java 10.
I'm using the java.text.BreakIterator package.
For this output:
myString="a𝓞b" hex=0061d835dcde0062
myString.length()=4
myString.codePointCount(0,s.length())=3
BreakIterator output:
a hex=0061
𝓞 hex=d835dcde
b hex=0062
Seems correct.
Using the same Java code, then with this output:
myString="G̲íl" hex=0047033200ed006c
myString.length()=4
myString.codePointCount(0,s.length())=4
BreakIterator output:
G̲ hex=00470332
í hex=00ed
l hex=006c
Seems correct too, EXCEPT for the codePointCount=4.
Why isn't it 3, and is there a means of getting
a 3 value without using BreakIterator?
My goal is to determine if all (output) chars of a string are
16-bit, or are surrogate or combining chars present?

"G̲íl" is four code points: U+0047, U+0332, U+00ED, U+006C.
U+0332 is a combining character, but it is a separate code point. That's not the same as your first example, which requires using a surrogate pair (2 UTF-16 code units) to represent U+1D4DE - but the latter is still a single code point.
BreakIterator finds boundaries in text - the two code points here that are combined don't have a boundary between them in that sense. From the documentation:
Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation through character strings, regardless of how the character is stored.
So I think everything is working correctly here.

A codepoint corresponds to one Unicode character.
Java represents Unicode in UTF-16, i.e., in 16-bit units. Characters with codepoint values larger than U+FFFF are represented by a pair of 'surrogate characters', as in your first example. Thus the first result of 3.
In the second case, you have an example that is not a single Unicode character. It is one character, LETTER G, followed by another character COMBINING CHARACTER LOW LINE. That is two codepoints per the definition. Thus the second result of 4.
In general, Unicode has tables of character attributes (I'm not sure if I have the right word here) and it is possible to find out that one of your codepoints is a combining character.
Take a look at the Character class. getType(character) will tell you if a codepoint is a combining character or a surrogate.

What is the equivalent in .Net of the :print: character class from PHP or Java? [duplicate]

Is there a special regex statement like \w that denotes all printable characters? I'd like to validate that a string only contains a character that can be printed--i.e. does not contain ASCII control characters like \b (bell), or null, etc. Anything on the keyboard is fine, and so are UTF chars.
If there isn't a special statement, how can I specify this in a regex?

Very late to the party, but this regexp works: /[ -~]/.
How? It matches all characters in the range from space (ASCII DEC 32) to tilde (ASCII DEC 126), which is the range of all printable characters.
If you want to strip non-ASCII characters, you could use something like:
$someString.replace(/[^ -~]/g, '');
NOTE: this is not valid .net code, but an example of regexp usage for those who stumble upon this via search engines later.

If your regex flavor supports Unicode properties, this is probably the best the best way:
\P{Cc}
That matches any character that's not a control character, whether it be ASCII -- [\x00-\x1F\x7F] -- or Latin1 -- [\x80-\x9F] (also known as the C1 control characters).
The problem with POSIX classes like [:print:] or \p{Print} is that they can match different things depending on the regex flavor and, possibly, the locale settings of the underlying platform. In Java, they're strictly ASCII-oriented. That means \p{Print} matches only the ASCII printing characters -- [\x20-\x7E] -- while \P{Cntrl} (note the capital 'P') matches everything that's not an ASCII control character -- [^\x00-\x1F\x7F]. That is, it matches any ASCII character that isn't a control character, or any non-ASCII character--including C1 control characters.

TLDR Answer
Use this Regex...
\P{Cc}\P{Cn}\P{Cs}
Working Demo
In this demo, I use this regex to search the string "Hello, World!_". I'm going to add a weird character at the end, (char)4 — this is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static void Main() {
// your code goes here
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
var matches = regex.Matches("Hello, World!" + (char)4);
Console.WriteLine("Results: " + matches.Count);
foreach (Match match in matches) {
Console.WriteLine("Result: " + match);
}
}
}
Full Working Demo at IDEOne.com
TLDR Explanation
\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

There is a POSIX character class designation [:print:] that should match printable characters, and [:cntrl:] for control characters. Note that these match codes throughout the ASCII table, so they might not be suitable for matching other encodings.
Failing that, the expression [\x00-\x1f] will match through the ASCII control characters, although again, these could be printable in other encodings.

In Java, the \p{Print} option specifies the printable character class.

It depends wildly on what regex package you are using. This is one of these situations about which some wag said that the great thing about standards is there are so many to choose from.
If you happen to be using C, the isprint(3) function/macro is your friend.

Adding on to #Alan-Moore, \P{Cc} is actually as example of Negative Unicode Category or Unicode Block (ref: Character Classes in Regular Expressions). \P{name} matches any character that does not belong to a Unicode general category or named block. See the referred link for more examples of named blocks supported in .Net

Java String length() and substring(int, int) not consider certain characters?

Trying to solve an issue with trimming a string.
Are there any ascii chars that are not counted in either length() or substring(int, int)?
Ex. if the string is coming from a serialized object outside your program and contains characters such as "start of text" (ascii hx2) or "bell" (ascii hx7) will those characters be considered in either length() or substring(int, int)?

See the documentation for String#length:
Returns the length of this string. The length is equal to the number of Unicode code units in the string.
This means that all characters are included in the length. Specifically, this will return the number of chars required to represent the string in Java.
However, of note is that certain Unicode character will actually take up two chars in the string due to the way Java handles Unicode characters using UTF-16. See the relevant documentation for more information.

Are there any ascii chars that are not counted in either length() or substring(int, int)?
No, there aren't any. Both of these methods are "dumb" and will return the number of chars stored in the underlying character array of the String object (and in fact, .length() is inherited from CharSequence).
Whether they be ASCII control characters, "non characters" such as U+0000 and U+FFFF, all will be counted.

How to take a certain set of characters from a string and use those as a unicode value?

[Java] So I have this hexadecimal: 0x6c6c6548. I need to take two characters out at a time, use those two characters to get a unicode value and then concatenate them all into a string.
My idea was to take the last two digits using the charAt() method, and then adding them to a string starting with "\u00", but that doesn't work because the compiler thinks of the slash as an escape and you can't add another slash in front of the first because then it just prints a slash and doesn't convert it to unicode.
So like I need to take the 48 out and somehow convert it to it's unicode value, which is 'H' and then do that for all the pairs and put them into one string.

How to encode a string to replace all special characters

I have a string which contains special character. But I have to convert the string into a string without having any special character so I used Base64 But in Base64 we are using equals to symbol (=) which is a special character. But I want to convert the string into a string which will have only alphanumerical letters. Also I can't remove special character only i have to replace all the special characters to maintain unique between two different strings. How to achieve this, Which encoding will help me to achieve this?

The simplest option would be to encode the text to binary using UTF-8, and then convert the binary back to text as hex (two characters per byte). It won't be terribly efficient, but it will just be alphanumeric.
You could use base32 instead to be a bit more efficient, but that's likely to be significantly more work, unless you can find a library which supports it out of the box. (Libraries to perform hex encoding are very common.)

There are a number of variations of base64, some of which don't use padding. (You still have a couple of non-alphanumeric characters for characters 62 and 63.)
The Wikipedia page on base64 goes into the details, including the "standard" variations used for a number of common use-cases. (Does yours match one of those?)
If your strings have to be strictly alphanumeric, then you'll need to use hex encoding (one byte becomes 2 hex digits), or roll your own encoding scheme. Your stated requirements are rather unusual ...

Commons codec has a url safe version of base64, which emits - and _ instead of + and / characters
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafe(byte[])

The easiest way would be to use a regular expression to match all nonalphanumeric characters and replace them with an empty string.
// This will remove all special characters except space.
var cleaned = stringToReplace.replace(/[^\w\s]/gm, '')
Adding any special characters to the above regex will skip that character.
// This will remove all special characters except space and period.
var cleaned = stringToReplace.replace(/[^\w\s.]/gm, '')
A working example.
const regex = /[^\w\s]/gm;
const str = `This is a text with many special characters.
Hello, user, your password is 543#!\$32=!`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Regex explained.
[^\w\s]/gm
Match a single character not present in the list below [^\w\s]
\w matches any word character (equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (equivalent to [\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

If you truly can only use alphanumerical characters you will have to come up with an escaping scheme that uses one of those chars for example, use 0 as the escape, and then encode the special char as a 2 char hex encoding of the ascii. Use 000 to mean 0.
e.g.
This is my special sentence with a 0.
encodes to:
This020is020my020special020sentence020with020a02000002e

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Escaping non-latin characters in Java - java

Related

In Java, how are Unicode chars and Java UTF-16 codepoints handled?

What is the equivalent in .Net of the :print: character class from PHP or Java? [duplicate]

Java String length() and substring(int, int) not consider certain characters?

How to take a certain set of characters from a string and use those as a unicode value?

How to encode a string to replace all special characters

Categories

Resources