Removing supplementary characters from a Java string [duplicate]

Removing supplementary characters from a Java string [duplicate] - java

This question already has answers here:
What is the regex to extract all the emojis from a string?
(18 answers)
Closed 5 years ago.
I have a Java string that contains supplementary characters (characters in the Unicode standard whose code points are above U+FFFF). These characters could for example be emojis. I want to remove those characters from the string, i.e. replace them with the empty string "".
How do I remove supplementary characters from a string?
How do I remove characters from an arbitrary code point range? (For example all characters in the range 1F000–1FFFF)?

There are a couple of approaches. As regex replace is expensive, maybe do:
String basic(String s) {
StringBuilder sb = new StringBuilder();
for (char ch : s.toCharArray()) {
if (!Character.isLowSurrogate(ch) && !Character.isHighSurrogate(ch)) {
sb.append(ch);
}
}
return sb.length() == s.length() ? s : sb.toString();
}

You can get a character's unicode value by simply converting it to an int.
Therefore, you'll want to do the following:
Convert your String to a char[], or do something like have the loop condition iterate through each character in the String using String.charAt()
Check if the unicode value is one you want to remove.
If so, replace the character with "".
This is just to start you off, however if you're still struggling I can try type out a whole example.
Good luck!

Here is a code snippet that collects characters between code point 60 and 100:
public class Test {
public static void main(String[] args) {
new Test().go();
}
private void go() {
String s = "ABC12三￮";
String ret = "";
for (int i = 0; i < s.length(); i++) {
System.out.println(s.codePointAt(i));
if ((s.codePointAt(i) > 60) & (s.codePointAt(i) < 100)) {
ret += s.substring(i, i+1);
}
}
System.out.println(ret);
}
}
the result:
code point: 65
code point: 66
code point: 67
code point: 49
code point: 50
code point: 19977
code point: 65518
result: ABC
Hope this helps.

Java strings are UTF-16 encoded. The String type has a codePointAt() method for retrieving a decoded codepoint at a given char (codeunit) index.
So, you can do something like this, for instance:
String removeSupplementaryChars(String s)
{
int len = s.length();
if (len == 0)
return "";
StringBuilder sb = new StringBuilder(len);
int i = 0;
do
{
if (s.codePointAt(i) <= 0xFFFF)
sb.append(s.charAt[i]);
i = s.offsetByCodePoints(i, 1);
}
while (i < len);
return sb.toString();
}
Or this:
String removeCodepointsinRange(String s, int lower, int upper)
{
int len = s.length();
if (len == 0)
return "";
StringBuilder sb = new StringBuilder(len);
int i = 0;
do
{
int cp = s.codePointAt(i);
if ((cp < lower) || (cp > upper))
sb.appendCodePoint(cp);
i = s.offsetByCodePoints(i, 1);
}
while (i < len);
return sb.toString();
}

Related

How to efficiently remove consecutive same characters in a string

I wrote a method to reduce a sequence of the same characters to a single character as follows. It seems its logic is correct while there is a room for improvement in terms of performance, according to my tutor. Could anyone shed some light on this?
Comments of aspects other than performance is also really appreciated.
public class RemoveRepetitions {
public static String remove(String input) {
String ret = "";
String last = "";
String[] stringArray = input.split("");
for(int j=0; j < stringArray.length; j++) {
if (! last.equals(stringArray[j]) ) {
ret += stringArray[j];
}
last = stringArray[j];
}
return ret;
}
public static void main(String[] args) {
System.out.println(RemoveRepetitions.remove("foobaarrbuzz"));
}
}

We can improve the performance by using StringBuilder instead of using string as string operations are costlier. Also, the split function is also not required (it will make the program slower as well).
Here is a way to solve this:
public static String remove(String input)
{
StringBuilder answer = new StringBuilder("");
int N = input.length();
int i = 0;
while (i < N)
{
char c = input.charAt(i);
answer.append( c );
while (i<N && input.charAt(i)==c)
++i;
}
return answer.toString();
}
The idea is to iterate over all characters of the input string and keep appending every new character to the answer and skip all the same consecutive characters.

Possible change which you could think of in your code is:
Time Complexity: Your code is achieving output in O(n) time complexity, which might be the best possible way.
Space Complexity: Your code is using extra memory space which arises due to splitting.
Question to ask: Can you achieve this output, without using the extra space for character array that you get after splitting the string? (as character by character traversal is possible directly on string).
I can provide you the code here but, it would be great if you could try it on your own, once you are done with your attempts
you can lookup for the best solution here (you are almost there)
https://www.geeksforgeeks.org/remove-consecutive-duplicates-string/
Good luck!

As mentioned before, it is much better to access the characters in the string using method String::charAt or at least by iterating a char array retrieved with String::toCharArray instead of splitting the input string into String array.
However, Java strings may contain characters exceeding basic multilingual plane of Unicode (e.g. emojis 😂😍😊, Chinese or Japanese characters etc.) and therefore String::codePointAt should be used. Respectively, Character.charCount should be used to calculate appropriate offset while iterating the input string.
Also the input string should be checked if it's null or empty, so the resulting code may look like this:
public static String dedup(String str) {
if (null == str || str.isEmpty()) {
return str;
}
int prev = -1;
int n = str.length();
System.out.println("length = " + n + " of [" + str + "], real length: " + str.codePointCount(0, n));
StringBuilder sb = new StringBuilder(n);
for (int i = 0; i < n; ) {
int cp = str.codePointAt(i);
if (i == 0 || cp != prev) {
sb.appendCodePoint(cp);
}
prev = cp;
i += Character.charCount(cp); // for emojis it returns 2
}
return sb.toString();
}
A version with String::charAt may look like this:
public static String dedup2(String str) {
if (null == str || str.isEmpty()) {
return str;
}
int n = str.length();
StringBuilder sb = new StringBuilder(n);
sb.append(str.charAt(0));
for (int i = 1; i < n; i++) {
if (str.charAt(i) != str.charAt(i - 1)) {
sb.append(str.charAt(i));
}
}
return sb.toString();
}
The following test proves that charAt fails to deduplicate repeated emojis:
System.out.println("codePoint: " + dedup ("😂😂😍😍😊😊😂 hello"));
System.out.println("charAt: " + dedup2("😂😂😍😍😊😊😂 hello"));
Output:
length = 20 of [😂😂😍😍😊😊😂 hello], real length: 13
codePoint: 😂😍😊😂 helo
charAt: 😂😂😍😍😊😊😂 helo

How to i mask all string characters except for the last 4 characters in Java using parameters?

i will like to know how do i mask any number of string characters except the last 4 strings.
I want to masked all strings using "X"
For example
Number:"S1234567B"
Result
Number :"XXXXX567B
Thank you guys

Solution 1
You can do it with a regular expression.
This is the shortest solution.
static String mask(String input) {
return input.replaceAll(".(?=.{4})", "X");
}
The regex matches any single character (.) that is followed (zero-width positive lookahead) by at least 4 characters ((?=.{4})). Replace each such single character with an X.
Solution 2
You can do it by getting a char[]1, updating it, and building a new string.
This is the fastest solution, and uses the least amount of memory.
static String mask(String input) {
if (input.length() <= 4)
return input; // Nothing to mask
char[] buf = input.toCharArray();
Arrays.fill(buf, 0, buf.length - 4, 'X');
return new String(buf);
}
1) Better than using a StringBuilder.
Solution 3
You can do it using the repeat(int count) method that was added to String in Java 11.
This is likely the easiest solution to understand.
static String mask(String input) {
int maskLen = input.length() - 4;
if (maskLen <= 0)
return input; // Nothing to mask
return "X".repeat(maskLen) + input.substring(maskLen);
}

Kotlin extension which will take care of the number of stars that you want to set and also number of digits for ex: you have this string to be masked: "12345678912345" and want to be ****2345 then you will have:
fun String.maskStringWithStars(numberOfStars: Int, numberOfDigitsToBeShown: Int): String {
var stars = ""
for (i in 1..numberOfStars) {
stars += "*"
}
return if (this.length > numberOfDigitsToBeShown) {
val lastDigits = this.takeLast(numberOfDigitsToBeShown)
"$stars$lastDigits"
} else {
stars
}
}
Usage:
companion object{
const val DEFAULT_NUMBER_OF_STARS = 4
const val DEFAULT_NUMBER_OF_DIGITS_TO_BE_SHOWN = 4
}
yourString.maskStringWithStars(DEFAULT_NUMBER_OF_STARS,DEFAULT_NUMBER_OF_DIGITS_TO_BE_SHOWN)

You can do it with the help of StringBuilder in java as follows,
String value = "S1234567B";
String formattedString = new StringBuilder(value)
.replace(0, value.length() - 4, new String(new char[value.length() - 4]).replace("\0", "x")).toString();
System.out.println(formattedString);

You can use a StringBuilder.
StringBuilder sb = new StringBuilder("S1234567B");
for (int i = 0 ; i < sb.length() - 4 ; i++) { // note the upper limit of the for loop
// sets every character to X until the fourth to last character
sb.setCharAt(i, 'X');
}
String result = sb.toString();

My class to mask simple String
class MaskFormatter(private val pattern: String, private val splitter: Char? = null) {
fun format(text: String): String {
val patternArr = pattern.toCharArray()
val textArr = text.toCharArray()
var textI = 0
for (patternI in patternArr.indices) {
if (patternArr[patternI] == splitter) {
continue
}
if (patternArr[patternI] == 'A' && textI < textArr.size) {
patternArr[patternI] = textArr[textI]
}
textI++
}
return String(patternArr)
}
}
Example use
MaskFormatter("XXXXXAAAA").format("S1234567B") // XXXXX567B
MaskFormatter("XX.XXX.AAAA", '.').format("S1234567B") // XX.XXX.567B
MaskFormatter("**.***.AAAA", '.').format("S1234567B") // **.***.567B
MaskFormatter("AA-AAA-AAAA",'-').format("123456789") // 12-345-6789

Why can't I store Japanese UTF-8 characters in char array in Java?

I have a string "1234567(Asics (アシックスワーキング) )". It has unicode character, some are a part of ASCII and some are not. What java does is that it takes one byte for ASCII character and two bytes for other unicode characters.
Some part of my program is unable to process the string in this format. So I wanted to encode the values into escaped sequences.
So the string
"1234567(Asics (アシックスワーキング) )"
would map to
"\u0031\u0032\u0033\u0034\u0035\u0036\u0037\u0028\u0041\u0073\u0069\u0063\u0073\u0020\u0028\u30a2\u30b7\u30c3\u30af\u30b9\u30ef\u30fc\u30ad\u30f3\u30b0\u0029\u0020\u0029"
.
I wrote this function to do this :-
public static String convertToEscaped(String utf8) throws java.lang.Exception
{
char[] str = utf8.toCharArray();
StringBuilder unicodeStringBuilder = new StringBuilder();
for(int i = 0; i < str.length; i++){
char charValue = str[i];
int intValue = (int) charValue;
String hexValue = Integer.toHexString(intValue);
unicodeStringBuilder.append("\\u");
for (int length = hexValue.length(); length < 4; length++) {
unicodeStringBuilder.append("0");
}
unicodeStringBuilder.append(hexValue);
}
return unicodeStringBuilder.toString();
}
This was working fine outside of my program but caused issues inside my program. This was happening to the line char[] str = utf8.toCharArray();
Somehow I was loosing my japanese unicode characters and this was happening because t was dividing these characters into 2 in the char array.
So I decided to go with byte [] instead.
public static String convertToEscaped(String utf8) throws java.lang.Exception
{
byte str[] = utf8.getBytes();
StringBuilder unicodeStringBuilder = new StringBuilder();
for(int i = 0; i < str.length - 1 ; i+=2){
int intValue = (int) str[i]* 256 + (int)str[i+1];
String hexValue = Integer.toHexString(intValue);
unicodeStringBuilder.append("\\u");
for (int length = hexValue.length(); length < 4; length++) {
unicodeStringBuilder.append("0");
}
unicodeStringBuilder.append(hexValue);
}
return unicodeStringBuilder.toString();
}
Output :
\u3132\u3334\u3536\u3738\u2841\u7369\u6373\u2028\uffffe282\uffffa1e3\uffff81b7\uffffe283\uffff82e3\uffff81af\uffffe282\uffffb8e3\uffff82af\uffffe283\uffffbbe3\uffff81ad\uffffe283\uffffb2e3\uffff81b0\u2920
But this is also wrong as I am merging two single byte characters into one. What can I do to overcome this?

I don't know your other code's specific requirements. But my advice is to not reinvent the wheel and use the built-in encoding capabilities of the API.
For instance call getBytes with either StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE based on the endian-ness you need:
String s = "1234567(Asics (アシックスワーキング) )";
byte[] utf8 = s.getBytes(StandardCharsets.UTF_8);
byte[] utf16 = s.getBytes(StandardCharsets.UTF_16BE); // high order byte first
System.out.println(s.length()); // 28
System.out.println(utf8.length); // 48
System.out.println(utf16.length); // 56 (2 bytes for each char)

As they commented above the internal representation of string in java is utf-16. Found
Character.codePointAt() and Integer.toHexString() that are helpful in your case.
Renamed the parameter to just theString, also removed the throws Exception clause from your original method since no exception was thrown. (it is bad practice in general to throw these generic exceptions)
public static String convertToEscaped(String theString) {
char[] charArr = theString.toCharArray();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < charArr.length; i++) {
String hexString = Integer.toHexString(Character.codePointAt(charArr, i));
sb.append("\\u");
if (hexString.length() == 2) {
sb.append("00");
}
sb.append(hexString);
}
return sb.toString();
}

Remove duplicate characters in a string in Java

I started to read the famous "cracking the Coding Interview" book.
Design an algorithm and write code to remove the duplicate characters in a string
without using any additional buffer. NOTE: One or two additional variables are fine.
An extra copy of the array is not.
I found a similar topic here : Remove the duplicate characters in a string
The solution given by the author was that :
public static void removeDuplicates(char[] str) {
if (str == null) return;
int len = str.length;
if (len < 2) return;
int tail = 1;
for (int i = 1; i < len; ++i) {
int j;
for (j = 0; j < tail; ++j) {
if (str[i] == str[j]) break;
}
if (j == tail) {
str[tail] = str[i];
++tail;
}
}
str[tail] = 0;
}
The problem here is that the author used an array to be an argument for this function. So my question is : how can you write an algorithms with a STRING as an argument? Because I felt like it's really easier to use an array here and it's like that you "avoid the difficulty" of the exercice (in my opinion, I'm a newly Java developer).
How can you write such an algorithm?

Java strings are immutable, so you can't do it with a string without copying the array into a buffer.

for this to work with a String you'd have to return a String from the method that represents the modified str with no duplicates. not sure if it'll go against the rules, but here's how I'd solve the problem with String's:
for each character in the string, i would split the string at that character. i would remove all instances of that character from the latter substring. i would then concatenate the former substring with the modified latter substring, making sure that the character is still kept in it's place. something like this:
public static String removeDuplicates( String str ) {
if( str == null || str.length() < 2 )
return str;
String temp;
for( int x = 0; x + 1 < str.length(); x++ ) {
temp = str.charAt( x ) + "";
str = str.substring( 0, x ) + temp + str.substring( x + 1 ).replaceAll( temp, "" );
}
return str;
}

In Java 8 we can do it like this
private void removeduplicatecharactersfromstring() {
String myString = "aabcd eeffff ghjkjkl";
StringBuilder builder = new StringBuilder();
System.out.println(myString);
Arrays.asList(myString.split(" "))
.forEach(s -> {
builder.append(Stream.of(s.split(""))
.distinct().collect(Collectors.joining()).concat(" "));
});
System.out.println(builder); // abcd ef ghjkl
}

Java: Remove non alphabet character from a String without regex

Is there a way to remove all non alphabet character from a String without regex?
I'm trying to check if the String is a palindrome
This is what i tried so far.
public static boolean isPalindrome( String text )
{
int textLength = text.length() - 1;
String reformattedText = text.trim().toLowerCase();
for( int i = 0; i <= textLength; i++ )
{
if( reformattedText.charAt( i ) != reformattedText.charAt( textLength - i ) )
{
return false;
}
}
return true;
}
But if the input is:
System.out.println( isPalindrome( "Are we not pure? No sir! Panama’s moody"
+ "Noriega brags. It is garbage! Irony dooms a man; a prisoner up to new era." ) );
It should be true.
I'm really having a hard time thinking of how to remove or ignore those non alphabet characters on the String.

I would do something like this:
public static String justAlphaChars(String text) {
StringBuilder builder = new StringBuilder();
for (char ch : text.toCharArray())
if (Character.isAlphabetic(ch))
builder.append(ch);
return builder.toString();
}
Just tested method above in your example bellow and worked. Returned true.
System.out.println( isPalindrome( justAlphaChars ( "Are we not pure? No sir! Panama’s moody"
+ "Noriega brags. It is garbage! Irony dooms a man; a prisoner up to new era." ) ) );

OOPS. Java, not Python.
You can still use list-like access in Java, just a bit more work.
char[] letters = text.toCharArray();
int nletters = 0;
for (int i=0; i<letters.length; ++i) {
if (Character.isLetter(letters[i])
letters[nletters++] = Character.toUpperCase(letters[i]);
}
// print out letters in array:
System.out.print("letters only: ");
for (int i=0; i<nletters; ++i) {
System.out.print(letters[i]);
}
System.out.println();
Now use the first nletters positions only in the letters array, since those positions will hold the lowercased letters from the input. An example that just displays the remaining characters is included above.
Now write a loop to compare letters[0] with letters[nletters-1], letters[1] with letters[nletters-2], and so on. If all pairs are equal, you have a palindrome.

String removeNonAlpha(final String word) {
final StringBuilder result = new StringBuilder();
for (final char ch : word.toCharArray()) {
final int ascii = ch;
if (((ascii >= 65) && (ascii <= 90)) || ((ascii >= 97) && (ascii <= 122))) {
result.append(ch);
}
}
return result.toString();
}
Explanation:
The method will retrieve a string containing only A-Z and a-z characters.
I am simply verifying the ascii code for the given char.
Please refer to the ASCII code table

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Removing supplementary characters from a Java string [duplicate] - java

Related

How to efficiently remove consecutive same characters in a string

How to i mask all string characters except for the last 4 characters in Java using parameters?

Why can't I store Japanese UTF-8 characters in char array in Java?

Remove duplicate characters in a string in Java

Java: Remove non alphabet character from a String without regex

Categories

Resources