How to substring in java based on length? - java

I want to substring(starting chars) a string in java based on length.
for example if string1 is greater than 4000 bytes I want make that string into less than or equal to 4000 bytes string .(starting chars need to be trimmed not last chars)

Try this:
trimmed = str.substring(Math.max(0, str.length() - 4000));
(Bonus points if you can figure out what it is doing :-) )
However, note that this trims str to at most 4000 characters. Trimming a Java string to a given number of bytes makes no sense unless you specify the character encoding. And even if you do, it is a bit gnarly ... for variable length encodings such as UTF-8.
And it is worth noting that this can fail if your string contains Unicode codepoints outside of plane 0.

it is literally this
String s = sourceString.substring(/*Position to substring from*/ 0);

Related

Escaping non-latin characters in Java

I have a Java program that takes in a string and escapes it so that it can be safely passed to a program in bash. The strategy is basically to escape any of the special characters mentioned here and wrap the result in double quotes.
The algorithm is pretty simple -- just loop over the input string and use input.charAt(i) to check whether the current character needs to be escaped.
This strategy works quite well for characters that aren't represented by surrogate pairs, but I have some concerns if non-latin characters or something like an emoji is embedded in the string. In that case, if we assumed that an emoji was the first character in my input string, input.charAt(0) would give me the first code unit while input.charAt(1) would return the second code unit. My concern is that some of these code units might be interpreted as one of the special characters that need to be escaped. If that happened, I'd try to escape one of the code units which would irrevocably garble the input.
Is such a thing possible? Or is it safe to use input.charAt(i) for something like this?
From the Java docs:
The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes. In this representation,
supplementary characters are represented as a pair of char values, the
first from the high-surrogates range, (\uD800-\uDBFF), the second from
the low-surrogates range (\uDC00-\uDFFF).
From the UTF-16 Wikipedia page:
U+D800 to U+DFFF: The Unicode standard permanently reserves these code point values for
UTF-16 encoding of the high and low surrogates, and they will never be
assigned a character, so there should be no reason to encode them. The
official Unicode standard says that no UTF forms, including UTF-16,
can encode these code points.
From the charAt javadoc:
Returns the char value at the specified index. An index ranges from 0
to length() - 1. The first char value of the sequence is at index 0,
the next at index 1, and so on, as for array indexing.
If the char value specified by the index is a surrogate, the surrogate
value is returned.
There is no overlap between the surrogate pair code point range and the range where my special characters ($,`,\ etc) exist as they're all using the ASCII character mappings (i.e. they're all mapped between 0 and 255).
Therefore, if I scan through a string that contains, say, an emoji (which definitely is outside of the supplementary character range) I won't mistake either of the items in the surrogate pair for a special character. Here's a simple test program:

How To check whether the string of characters are ASCII in Java

In order to check whether the string of characters are ASCII or not. Which one of the below is better choice ?
java.nio.charset.Charset.forName("US-ASCII").newEncoder().canEncode("Desire
character string to be checked")or
Convert the String to character array and use :
org.apache.commons.lang.CharUtils.isAscii() method to check whether ASCII.
What are their differences, and which one is good performance wise. I know for the second option there is additional step of converting string to the character array first and then, need to check each character.
You can use regex as a quick shortcut.
String asciiText = "Hello";
System.out.println(asciiText.matches("\\A\\p{ASCII}*\\z"));
this will check only ASCII characters.
Regards.

How do I use high-order unicode characters in java?

How do I use unicode characters in Java, like the Negative Squared Latin Capital Letter E? Using "\u1F174" doesn't work as the \u escape only accepts 4 hex-digits.
You need to specify it as a surrogate pair - two UTF-16 code units.
For example, if you copy and paste the character into my Unicode explorer you can see that U+1F174 is represented in UTF-16 code units as U+D83C U+DD74. (You can work this out manually, of course.) So you could write it in a Java string literal as:
String text = "\uD83C\uDD74";
Other options include:
String text = new StringBuilder().appendCodePoint(0x1f174).toString();
String text = new String(new int[] { 0x1f174 }, 0, 1);
char[] chars = Character.toChars(0x1f174);
"\uD83C\uDD74"
Or indeed
"🅴"
Because Java characters represent UTF-16 units rather than actual Unicode characters, you need to represent it as a string, that will have the two UTF-16 surrogates.

Print unicode literal string as Unicode character

I need to print a unicode literal string as an equivalent unicode character.
System.out.println("\u00A5"); // prints ¥
System.out.println("\\u"+"00A5"); //prints \u0045 I need to print it as ¥
How can evaluate this string a unicode character ?
As an alternative to the other options here, you could use:
int codepoint = 0x00A5; // Generate this however you want, maybe with Integer.parseInt
String s = String.valueOf(Character.toChars(codepoint));
This would have the advantage over other proposed techniques in that it would also work with Unicode codepoints outside of the basic multilingual plane.
If you have a string:
System.out.println((char)(Integer.parseInt("00A5",16)));
probably works (haven't tested it)
Convert it to a character.
System.out.println((char) 0x00A5);
This will of course not work for very high code points, those may require 2 "characters".

How to encode a string to replace all special characters

I have a string which contains special character. But I have to convert the string into a string without having any special character so I used Base64 But in Base64 we are using equals to symbol (=) which is a special character. But I want to convert the string into a string which will have only alphanumerical letters. Also I can't remove special character only i have to replace all the special characters to maintain unique between two different strings. How to achieve this, Which encoding will help me to achieve this?
The simplest option would be to encode the text to binary using UTF-8, and then convert the binary back to text as hex (two characters per byte). It won't be terribly efficient, but it will just be alphanumeric.
You could use base32 instead to be a bit more efficient, but that's likely to be significantly more work, unless you can find a library which supports it out of the box. (Libraries to perform hex encoding are very common.)
There are a number of variations of base64, some of which don't use padding. (You still have a couple of non-alphanumeric characters for characters 62 and 63.)
The Wikipedia page on base64 goes into the details, including the "standard" variations used for a number of common use-cases. (Does yours match one of those?)
If your strings have to be strictly alphanumeric, then you'll need to use hex encoding (one byte becomes 2 hex digits), or roll your own encoding scheme. Your stated requirements are rather unusual ...
Commons codec has a url safe version of base64, which emits - and _ instead of + and / characters
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafe(byte[])
The easiest way would be to use a regular expression to match all nonalphanumeric characters and replace them with an empty string.
// This will remove all special characters except space.
var cleaned = stringToReplace.replace(/[^\w\s]/gm, '')
Adding any special characters to the above regex will skip that character.
// This will remove all special characters except space and period.
var cleaned = stringToReplace.replace(/[^\w\s.]/gm, '')
A working example.
const regex = /[^\w\s]/gm;
const str = `This is a text with many special characters.
Hello, user, your password is 543#!\$32=!`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Regex explained.
[^\w\s]/gm
Match a single character not present in the list below [^\w\s]
\w matches any word character (equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (equivalent to [\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
If you truly can only use alphanumerical characters you will have to come up with an escaping scheme that uses one of those chars for example, use 0 as the escape, and then encode the special char as a 2 char hex encoding of the ascii. Use 000 to mean 0.
e.g.
This is my special sentence with a 0.
encodes to:
This020is020my020special020sentence020with020a02000002e

Categories