I am not able to find a method in java or scala to do a substring on non-ascii character using the absolute length from getBytes
val string = "achâth33Franklin"
string.length
Int = 16
string.getBytes.length
Int = 17
string.substring(0,7)
String = achâth3
I need a method that results in achâth as it has non-ascii character whose length is 2
val test = "â"
test.getBytes.length
res26: Int = 2
To give more perspective on the problem.
The length of the field is constant which is 7, it always will be ascii value. Some times, they send non ascii value in the string.
The result substring(0,7), when they non-ascii values moving the next field values into current value.
Explination for #VGR
scala> val string = "achâth33Franklin"
string: String = achâth33Franklin
scala> new String(string.getBytes,0,7)
res30: String = achâth
scala> string.substring(0,7)
res31: String = achâth3
One way to do that is to combine the getBytes() method with this constructor.
So your method would look like this:
String string = "achâth33Franklin";
string.substring(0,7); //achâth3
new String(string.getBytes(), 0, 7)); //achâth
That constructor takes an array of bytes, an offset into the array, and the number of bytes to use. so new String(string.getBytes(), a, b) works with the same logic as string.substring(a, b), but per-byte instead of per-character.
Related
I received an unicode string to contain Emoji code, example: "U+1F44F" (from Emoji table : http://apps.timwhitlock.info/emoji/tables/unicode).
I want to convert this string to an Integer how can I do that ?
I tried this, but it crashs:
int hex = Integer.parseInt(unicodeStr, 16);
Thanks guys!
The comment of #flakes gives the correct answere. The U+ only indicates that the following codepoint (or hex number) is a Unicode. The value you want to convert into an Integer is the codepoint, so you have to omit the 2 first characters with .substring(2)
You wil obtain the following code:
int hex = Integer.parseInt(unicodeStr.substring(2), 16);
Unicode numbers such "characters," code points, upto the 3 byte range, such as U+1F44F.
Java String has a constructor with code points.
int[] codepoints = { 0x1F44F };
String s = new String(codepoints, 0, codepoints.length);
public static String fromCodepoints(int... codepoints) {
return new String(codepoints, 0, codepoints.length);
}
s = fromCodepoints(0x1F44F, 0x102);
Java String contains Unicode as an internal array of chars. Every char '(2 bytes) being UTF-16 encoded. For lower ranges a char can be a code point. And U+0102 could be written as "\u0102" containing the char '\u0102'.
Note that emoji must be representable in the font.
Font font = ...
if (!font.canDisplay(0x1F44F)) {
...
}
I can easily create a unicode character and print it with the following lines of code
String uniChar = Character.toString((char)0000);
System.out.println(uniChar);
However, now I want to retrieve the number above, add 3, and print out the new unicode character that the numbers 0003 corresponds to. Is there a way for me to retrieve the ACTUAL string of unichar? As in "\u0000"? That way I could substring just the "0000", convert it to an int, add 3, and reverse the entire process.
I think you're looking for String#codePointAt:
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.
If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.
For instance (live copy):
// String containing smiling face with smiling eyes emoji
String str = "😊";
// Get the code point
int cp = str.codePointAt(0);
// Show it
System.out.println(str + ", code point = U+" + toHex(cp));
// Increase it
++cp;
// Get the updated string (from an array of code points)
String updated = new String(new int[] { cp }, 0, 1);
// Show it
System.out.println(updated + ", code point = U+" + toHex(cp));
(toHex is just return Integer.toString(n, 16).toUpperCase();)
That outputs:
😊, code point = U+1F60A
😋, code point = U+1F60B
This code will work in both cases, for codepoints from Unicode BMP and from Unicode supplemental panes which uses 4 bytes in UTF-8 to encode a character. 4 byte code point requires 2 Java char entities to be stored, so in this case string.length() = 2.
// array will contain one or two characters
char[] chars = Character.toChars(codePoint);
// string.length will be 1 or 2
String str = new String(chars);
Unicode is a numbering of "characters" - code points - upto a 3-byte int range.
The UTF-16 encoding uses a sequance of byte pairs, and a java char is such a byte pair. The (int) cast of a char is imperfect and covers only a part of the Unicode. The correct way to convert a code point to possibly more than one char:
int codePoint = 0x263B;
char[] chars = Character.chars(codePoint);
To work with Unicode code points, one can do:
int[] codePoints = {0x2639, 0x263a, 0x263b};
String s = new String(codePoints, 0, codePoints.length);
codePoints[0} += 2;
You code use an int array of 1 code point.
In java 8 one can get an IntStream of code points:
s.codePoints().forEach(cp -> {
System.out.printf("U+%X = %s%n", cp, Character.getName(cp));
};
I want to hash a word into fixed bit hash value say 64 bit,32 bit (binary).
I used the following code
long murmur_hash= MurmurHash.hash64(word);
Then murmur_hash value is converted into binary by the following function
public static String intToBinary (int n, int numOfBits) {
String binary = "";
for(int i = 0; i < numOfBits; ++i) {
n/=2;
if(n%2 == 0)
{
binary="0"+binary;
}
else
binary="1"+binary;
}
return binary;
}
Is there any direct hash method to convert into binary?
Just use this
Integer.toBinaryString(int i)
If you want to convert into a fixed binary string, that is, always get a 64-character long string with zero padding, then you have a couple of options. If you have Apache's StringUtils, you can use:
StringUtils.leftPad( Long.toBinaryString(murmurHash), Long.SIZE, "0" );
If you don't, you can write a padding method yourself:
public static String paddedBinaryFromLong( long val ) {
StringBuilder sb = new StringBuilder( Long.toBinaryString(val));
char[] zeros = new char[Long.SIZE - sb.length()];
Arrays.fill(zeros, '0');
sb.insert(0, zeros);
return sb.toString();
}
This method starts by using the Long.toBinaryString(long) method, which conveniently does the bit conversion for you. The only thing it doesn't do is pad on the left if the value is shorter than 64 characters.
The next step is to create an array of 0 characters with the missing zeros needed to pad to the left.
Finally, we insert that array of zeros at the beginning of our StringBuilder, and we have a 64-character, zero-padded bit string.
Note: there is a difference between using Long.toBinaryString(long) and Long.toString(long,radix). The difference is in negative numbers. In the first, you'll get the full, two's complement value of the number. In the second, you'll get the number with a minus sign:
System.out.println(Long.toString(-15L,2));
result:
-1111
System.out.println(Long.toBinaryString(-15L));
result:
1111111111111111111111111111111111111111111111111111111111110001
Another other way is using
Integer.toString(i, radix)
you can get string representation of the first argument i in the radix ( Binary - 2, Octal - 8, Decimal - 10, Hex - 16) specified by the second argument.
I have a String[] with byte values
String[] s = {"110","101","100","11","10","1","0"};
Looping through s, I want to get int values out of it.
I am currently using this
Byte b = new Byte(s[0]); // s[0] = 110
int result = b.intValue(); // b.intValue() is returning 110 instead of 6
From that, I am trying to get the results, {6, 5, 4, 3, 2, 1}
I am not sure of where to go from here. What can I do?
Thanks guys. Question answered.
You can use the overloaded Integer.parseInt(String s, int radix) method for such a conversion. This way you can just skip the Byte b = new Byte(s[0]); piece of code.
int result = Integer.parseInt(s[0], 2); // radix 2 for binary
You're using the Byte constructor which just takes a String and parses it as a decimal value. I think you actually want Byte.parseByte(String, int) which allows you to specify the radix:
for (String text : s) {
byte value = Byte.parseByte(text, 2);
// Use value
}
Note that I've used the primitive Byte value (as returned by Byte.parseByte) instead of the Byte wrapper (as returned by Byte.valueOf).
Of course, you could equally use Integer.parseInt or Short.parseShort instead of Byte.parseByte. Don't forget that bytes in Java are signed, so you've only got a range of [-128, 127]. In particular, you can't parse "10000000" with the code above. If you need a range of [0, 255] you might want to use short or int instead.
You can directly convert String bindery to decimal representation using Integer#parseInt() method. No need to convert to Byte then to decimal
int decimalValue = Integer.parseInt(s[0], 2);
You should be using Byte b = Byte.valueof(s[i], 2). Right now it parse the string treating it as decimal value. You should use valueOf and pass 2 as radix.
Skip the Byte step. Just parse it into an int with Integer.parseInt(String s, int radix):
int result = Integer.parseInt(s[0], 2);
The 2 specifies base 2, whereas the code you're using treats the input strings as decimal.
So I have a set of base digits like "BCDFGHJKLMNPQRSTVWXZ34679"
how do I convert a value say "D6CN96W6WT" to binary string in Java?
This should work (assuming 0,1 for you binary digits):
// your arbitrary digits
private static final String DIGITS = "BCDFGHJKLMNPQRSTVWXZ34679";
public String base25ToBinary(String base25Number) {
long value = 0;
char[] base25Digits = base25Number.toCharArray();
for (char digit : base25Digits) {
value = value * 25 + DIGITS.indexOf(digit);
}
return Long.toString(value, 2);
}
Off the top of my head, for base-25 strings.
Integer.toString(Integer.valueof(base25str, 25), 2)
Its a little unclear from your question whether you're talking about actual 0-9-Z bases, or a number encoding with an arbitrary list of symbols. I'm assuming the first, if its the later then you're out of luck on built-ins.