StringBuilder#appendCodePoint(int) behaves unexpectedly - java

java.lang.StringBuilder's appendCodePoint(...) method, to me, behaves in an unexpected manner.
For unicode code points above Character.MAX_VALUE (which will need 3 or 4 bytes to encode in UTF-8, which is my Eclipse workspace setting), it behaves strangely.
I append a String's Unicode code points one by one to a StringBuilder, but its output looks different in the end.
I suspect that a call to Character.toSurrogates(codePoint, value, count) in AbstractStringBuilder#appendCodePoint(...) causes this, but I don't know how to work around it.
My code:
// returns random string in range of unicode code points 0x2F800 to 0x2FA1F
// e.g. 槪𥥼報悔𦖨嘆汧犕尢𦔣洴真硎尢趼犀㠯弢卿𢛔芋玥峀䔫䩶莭型築𡷦𩐊
String s = getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(length);
System.out.println(s);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. 槪?𥥼?報?悔?𦖨?嘆?汧?犕?尢?𦔣?洴?真?硎?尢?趼?
System.out.println(sb.toString());
// returns random string in range of unicode code points 0x20000 to 0x2A6DF
// e.g. 𤸥𤈍𪉷𪉔𤑺𡹋𠋴𨸁𦧖𣯠𨚾𣥷𪂶𦄃𧊈𤧘𢙕𪚋𤧒𥩛𧆞𨕌𣸑𡚊𥽚𡛳𣐸𩆟𩣞𥑡
s = getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(length);
// prints the CJK characters correctly
System.out.println(s);
sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. 𤸥?𤈍?𪉷?𪉔?𤑺?𡹋?𠋴?𨸁?𦧖?𣯠?𨚾?𣥷?𪂶?𦄃?𧊈?
System.out.println(sb.toString());
With:
public static int getCodePointCount(String s) {
return s.codePointCount(0, s.length());
}
public static String getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x20000, 0x2A6DF);
}
public static String getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x2F800, 0x2FA1F);
}
private static String getRandomStringOfMaxLengthInRange(int length, int from, int to) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++) {
// try to find a valid character MAX_TRIES times
for (int j = 0; j < MAX_TRIES; j++) {
int unicodeInt = from + random.nextInt(to - from);
if (Character.isValidCodePoint(unicodeInt) &&
(Character.isLetter(unicodeInt) || Character.isDigit(unicodeInt) ||
Character.isWhitespace(unicodeInt))) {
sb.appendCodePoint(unicodeInt);
break;
}
}
}
return new String(sb.toString().getBytes(), "UTF-8");
}

You're iterating over the code points incorrectly. You should use the strategy presented by Jonathan Feinberg here
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
or since Java 8
s.codePoints().forEach(/* do something */);
Note the Javadoc of String#codePointAt(int)
Returns the character (Unicode code point) at the specified index. The
index refers to char values (Unicode code units) and ranges from 0 to
length()- 1.
You were iterating from 0 to codePointCount. If the character is not a high-low surrogate pair, it's returned alone. In that case, your index should only increase by 1. Otherwise, it should be increased by 2 (Character#charCount(int) deals with this) as you're getting the codepoint corresponding to the pair.

Change your loops from this:
for (int i = 0; i < getCodePointCount(s); i++) {
to this:
for (int i = 0; i < getCodePointCount(s); i = s.offsetByCodePoints(i, 1)) {
In Java, a char is a single UTF-16 value. Supplemental codepoints take up two chars in a String.
But you are looping every single char in your String. This means that you are reading each supplemental codepoint twice: The first time, you are reading both of its UTF-16 surrogate chars; the second time, you are reading and appending just the low surrogate char.
Consider a string which contains only one codepoint, 0x2f8eb. A Java String representing that codepoint would actually contain this:
"\ud87e\udceb"
If you loop through each individual char index, then your loop would effectively do this:
sb.appendCodePoint(0x2f8eb); // codepoint found at index 0
sb.appendCodePoint(0xdceb); // codepoint found at index 1

Related

Not able to understand the code to Count Duplicates in a string?

This program finds the count of duplicates in a string.
Example 1:
Input:
"abbdde"
Output:
2
Explanation:
"b" and "d" are the two duplicates.
Example 2:
Input:
"eefggghii22"
Output:
3
Explanation:
duplicates are "e", "g", and "2".
Help me with this code.
public class CountingDuplicates {
public static int duplicateCount(String str1) {
// Write your code here
int c = 0;
str1 = str1.toLowerCase();
final int MAX_CHARS = 256;
int ctr[] = new int[MAX_CHARS];
countCharacters(str1, ctr);
for (int i = 0; i < MAX_CHARS; i++) {
if(ctr[i] > 1) {
// System.out.printf("%c appears %d times\n", i, ctr[i]);
c = ctr[i];
}
}
return c;
}
static void countCharacters(String str1, int[] ctr)
{
for (int i = 0; i < str1.length(); i++)
ctr[str1.charAt(i)]++;
}
}
You need to maintain a count and if the value of that character exceeds 1, you need to increment the count.
Return that count to know the count of duplicates.
Added comments to understand the code better.
public class CountingDuplicates {
public static int duplicateCount(String str1) {
// Initialised integer to count the duplicates
int count = 0;
// Converting a string to lowercase to count lowerCase and Uppercase as duplicates
str1 = str1.toLowerCase();
// According to ASCII, the Maximum number of characters is 256,
// So, initialized an array of size 256 to maintain the count of those characters.
final int MAX_CHARS = 256;
int ctr[] = new int[MAX_CHARS];
countCharacters(str1, ctr);
for (int i = 0; i < MAX_CHARS; i++) {
if(ctr[i] > 1) {
// System.out.printf("%c appears %d times\n", i, ctr[i]);
count = count + 1;
}
}
return count;
}
static void countCharacters(String str1, int[] ctr)
{
for (int i = 0; i < str1.length(); i++)
ctr[str1.charAt(i)]++;
}
}
In short it is counting the number of characters appearing in the String str and saving it in ctr array.
How? ctr is the array that has a length of 256. So it can have 256 values (0-255 indexed). str1 is the string that contains the String. charAt(i) method returns the character at index i. Because String acts like an array where you can access each char a index values of an array.
Now assuming your input will always ASCII characters, each ASCII chars contain a value of 0-255 (i.e. ASCII value 'a' is 97). ++ after any variable means adding 1 to that. i.e.c++ means c = c+1
Now coming to the loop, ctr[str1.charAt(i)]++;, you can see the loops starts from 0 and ends at the length of the String str where 0 is the first value str. So if value of 0 indexed value (first value) of the String str is a, str.charAt(0) would return 97(well actually it will return 'a' but java takes the ASCII value). so the line actually is (for 0 th index) ctr[97]++; so it's incrementing the value of the 97th index (which is initially 0) by 1. So now the value is 1.
Like this way it will only increment the index values that matches with the ASCII values of the character in the String, thus counting the amount of time the characters occur.

How to create a char[] using data from a boolean array?

I have a Boolean array and I am trying to make a corresponding char array, so that to each true in the new array corresponds a 1 and for each false a 0. this is my code but it seems the new array is empty, because nothing prints, the Boolean nums[] prints fine.
char[] digits = new char[n];
for (int i = 0; i < n; i++) {
if (nums[i]) {
digits[i] = 1;
}
else if (!nums[i]) {
digits[i] = 0;
}
}
for (int k = 0; k < n; k++) {
System.out.print (digits[k]);
}
Your problem is that you don't have quotes surrounding the 1 and 0.
for (int i = 0; i < n; i++) {
if (nums[i]) {
digits[i] = '1';
}
else {
digits[i] = '0';
}
}
Without the quotes, they are cast from ints to chars. 0 is actually the null character (NUL), and 1 is start of heading or something like that. Java chars are encoded using UTF-16 (they're 16 bits long). The characters '0' and '1' are actually encoded by 48 and 49 respectively (in decimal).
EDIT: Actually, don't look at the ASCII table, look at the Unicode character set. Unicode is really a superset of ASCII, but it'll probably be more useful than the ascii table
According to Primitive Data Types in the Language Basics lesson of trail Learning the Java Language in Oracle's Java tutorials:
The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
Unicode value 0 (zero) is a non-printing character, as is unicode value 1 (one). That's why you aren't seeing anything printed. Either change digits to a int array or fill it with character literals such as '0' or '1'
If you use int array, the following code will suffice:
int[] digits = new int[n];
for (int i=0; i<n; i++) {
if (nums[i]) {
digits[i] = 1;
}
}
for (int k=0; k<n; k++) {
System.out.print (digits[k]);
}
Note that a int array is implicitly initialized such that all the elements are initially 0 (zero).
you can do something like that
char[] myChars = new char[n/16];
for(int i=0;i<nums.length/16;i++);{
String myChar ="";
for(int j=0;j<16;j++){
if(nums[i*16+j])
myChar+="1";
else
myChar+="0";
}
myChars[i]=Integer.parseInt(myChar,2);
}
You can convert like this:
public static void main(String[] args) {
int n = 5;
boolean[] nums = { true, false, true, false, true };
char[] digits = new char[n];
for (int i = 0; i < n; i++) {
digits[i] = nums[i] ? '1' : '0';
}
}

Unsure how to implement for loop

Hello I am having trouble implementing this function
Function:
Decompress the String s. Character in the string is preceded by a number. The number tells you how many times to repeat the letter. return a new string.
"3d1v0m" becomes "dddv"
I realize my code is incorrect thus far. I am unsure on how to fix it.
My code thus far is :
int start = 0;
for(int j = 0; j < s.length(); j++){
if (s.isDigit(charAt(s.indexOf(j)) == true){
Integer.parseInt(s.substring(0, s.index(j))
Assuming the input is in correct format, the following can be a simple code using for loop. Of course this is not a stylish code and you may write more concise and functional style code using Commons Lang or Guava.
StringBuilder builder = new StringBuilder();
for (int i = 0; i < s.length(); i += 2) {
final int n = Character.getNumericValue(s.charAt(i));
for (int j = 0; j < n; j++) {
builder.append(s.charAt(i + 1));
}
}
System.out.println(builder.toString());
Here is a solution you may like to use that uses Regex:
String query = "3d1v0m";
StringBuilder result = new StringBuilder();
String[] digitsA = query.split("\\D+");
String[] letterA = query.split("[0-9]+");
for (int arrIndex = 0; arrIndex < digitsA.length; arrIndex++)
{
for (int count = 0; count < Integer.parseInt(digitsA[arrIndex]); count++)
{
result.append(letterA[arrIndex + 1]);
}
}
System.out.println(result);
Output
dddv
This solution is scalable to support more than 1 digit numbers and more than 1 letter patterns.
i.e.
Input
3vs1a10m
Output
vsvsvsammmmmmmmmm
Though Nami's answer is terse and good. I'm still adding my solution for variety, built as a static method, which does not use a nested For loop, instead, it uses a While loop. And, it requires that the input string has even number of characters and every odd positioned character in the compressed string is a number.
public static String decompress_string(String compressed_string)
{
String decompressed_string = "";
for(int i=0; i<compressed_string.length(); i = i+2) //Skip by 2 characters in the compressed string
{
if(compressed_string.substring(i, i+1).matches("\\d")) //Check for a number at odd positions
{
int reps = Integer.parseInt(compressed_string.substring(i, i+1)); //Take the first number
String character = compressed_string.substring(i+1, i+2); //Take the next character in sequence
int count = 1;
while(count<=reps)//check if at least one repetition is required
{
decompressed_string = decompressed_string + character; //append the character to end of string
count++;
};
}
else
{
//In case the first character of the code pair is not a number
//Or when the string has uneven number of characters
return("Incorrect compressed string!!");
}
}
return decompressed_string;
}

Vowel check - array out of bounds error

I'm trying to write a program which accepts a word in lowercase, converts it into uppercase and changes the vowels in the word to the next alphabet. So far, I've done this:
import java.util.*;
class prg11
{
public static void main(String args[])
{
Scanner sc = new Scanner(System.in);
System.out.println("Enter a word in lowercase.");
String word = sc.next();
word = word.toUpperCase();
int length = word.length();
char ch[] = new char[length+1];
for (int i = 0; i<=length; i++)
{
ch[i] = word.charAt(i);
if("aeiou".indexOf(ch[i]) == 0)
{
ch[i]+=1;
}
}
String str = new String(ch);
System.out.println(str);
}
}
The code compiles fine. But, when I run the program and enter a word, say 'hey', the word is printed in uppercase only. The vowels in it (in this case, 'e'), do not get changed to the next alphabet.
How do I resolve this? TIA.
Need to change three places, according to the code in the question.
word = word.toUpperCase();
int length = word.length();
// yours: char ch[] = new char[length + 1];
// resulting array needs to be as same length as the original word
// if not, there will be array index out of bound issues
char ch[] = new char[length];
// yours: for (int i = 0; i<=length; i++)
// need to go through valid indexes of the array - 0 to length-1
for (int i = 0; i < length; i++) {
ch[i] = word.charAt(i);
// yours: if ("aeiou".indexOf(ch[i]) == 0) {
// two problems when used like that
// 1. indexOf() methods are all case-sensitive
// since you've uppercased your word, need to use AEIOU
// 2. indexOf() returns the index of the given character
// which would be >= 0 when that character exist inside the string
// or -1 if it does not exist
// so need to see if the returned value represents any valid index, not just 0
if ("AEIOU".indexOf(ch[i]) >= 0) {
ch[i] += 1;
}
}
Here's a little concise version. Note the changes I've done.
String word = sc.next().toUpperCase();
char ch[] = word.toCharArray();
for (int i = 0; i < ch.length; i++) {
if ("AEIOU".indexOf(ch[i]) >= 0) {
ch[i] += 1;
}
}
Java doc of indexOf().
public int indexOf(int ch)
Returns the index within this string of the first occurrence of the specified character.
If a character with value ch occurs in the character sequence represented by this String object,
then the index (in Unicode code units) of the first such occurrence is returned.
For values of ch in the range from 0 to 0xFFFF (inclusive), this is the smallest value k such that:
this.charAt(k) == ch
is true. For other values of ch, it is the smallest value k such that:
this.codePointAt(k) == ch
is true. In either case, if no such character occurs in this string, then -1 is returned.
Parameters:
ch - a character (Unicode code point).
Returns:
the index of the first occurrence of the character in the character sequence represented by this object,
or -1 if the character does not occur.
I think this should do it, let me know if it doesn't
public class prg11 {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
System.out.println("Enter a word.");
String word = sc.next();
sc.close();
word = word.toUpperCase();
int length = word.length();
char ch[] = new char[length+1];
for (int i = 0; i<length; i++) {
ch[i] = word.charAt(i);
if("AEIOU".indexOf(ch[i]) > -1) {
ch[i]+=1;
}
}
String str = new String(ch);
System.out.println(str);
}
}
Let me know if it works.
Happy coding ;) -Charlie
Use:
for (int i = 0; i<length; i++)
instead as the last index is length-1.
use for (int i = 0; i<=length-1; i++) instead of for (int i = 0; i<=length; i++) and if("AEIOU".indexOf(ch[i]) != -1) instead of if("aeiou".indexOf(ch[i]) == 0)
reason
1.array index starts from 0 that's why length-1
2. As you already made your string in upper case so check condition on "AEIOU"
3. every non-vowel character will return -1 so use if("AEIOU".indexOf(ch[i]) != -1)
"aeiou".indexOf(ch[i]) == 0 will only match 'a' characters (since that is the character at index 0). You should be looking for any index that is greater than -1. Additionally, since you've already converted the string to uppercase, you should be checking against "AEIOU" instead of "aeiou".

Why is my forloop not working?

This is my first question on this site so I'm not sure how to do this, but my question is as follows:
This is just a small piece of a code with multiple methods.
I need to print the ASCII codes of all the characters in a String (input from the user). Now I am trying to use a for-loop which scans the first character prints the ASCII code of it, then scans the next one etc. But at the moment its just printing the first character's ASCII code a few times. Obviously there's something wrong with my for-loop but I've been trying to figure it out and I really can't find it.
static String zin(String zin) {
int length = zin.length();
char letter = zin.charAt(0);
int ascii = (int) letter;
for (int i = 0; i < zin.length(); i++ ) {
System.out.println((int) ascii);
}
return zin;
}
The reason is because you don't re-assign ascii. Try this:
static String zin(String zin) {
int i = 0;
int length = zin.length();
for ( i = 0; i < zin.length(); i++ ) {
int ascii = (int)zin.charAt(i);
System.out.println(ascii);
}
return zin;
}
The problem in your code is though you have a for loop you are not iterating through the strung using that for loop. You only get the 1st char of that string. Instead of that use
static String zin(String zin) {
for (int i = 0; i < zin.length(); i++) {
System.out.println((int) zin.charAt(i));
}
return zin;
}

Categories