Trim a string based on its byte length

Trim a string based on its byte length - java

I want to trim the string based on byte length (not character length), then how to achieve this?
Example:
String country = "日本日本日";
One Japanese character will be 3 bytes. Above string length is 5, byte length is 15. If I give 3, only 1st character should be printed. If I give 5, only 1st character should come, because 2 characters size is 6 bytes. If I give 6, first 2 characters should be printed.
Edit: Byte size varies depends on the String. It may Japanese (or) Japanese with numerals (or) some other language.

Divide your required byte with 3 and fetch those characters. For ex.
int requiredBytes = 5;
int requiredLength = 5 / 3;
System.out.println(country.subString(0,requiredLength));

Related

Java find the index of string based on the utf-8 encoding index

Consider the following string:
String text="un’accogliente villa del.";
I have the begin index of word "accogliente" which is 5. But it is pre calculated based on utf-8 encoding.
I want the exact index of the word , which is 3 as output. ie, I want to get 3 as output from 5. What is the best way of calculating it?

String text = "un’accogliente villa del."; // Unicode text
text = Normalizer.normalize(text, Form.NFC); // Normalize text
byte[] bytes = text.getBytes(StandardCharsets.UTF_8); // Index 5 UTF-8; 1 byte
char[] chars = text.toCharArray(); // Index 3 UTF-16; 2 bytes (indexOf)
int[] codePoints = text.codePoints().toArray(); // Index 3 UTF-32; 4 bytes
int charIndex = text.indexOf("accogliente");
int codePointIndex = (int) text.substring(0, charIndex).codePoints().count();
int byteIndex = text.substring(0, charIndex).getBytes(StandardCharsets.UTF_8).length;
UTF-32 is the Unicode code points, the numbering of all symbols with U+XXXX where there maybe more (or less) than 4 hexadecimal digits.
Text normalisation is needed as é could be one code point, or two code points, a zero-width ´ followed by a e.
The question of UTF-8 byte index to UTF-16 char index:
int charIndex = new String(text.getBytes(StandardCharsets.UTF_8),
0, byteIndex, StandardCharsets.UTF_8).length();

Below code will return output as 3 am i missing something in your question?
String text="un’accogliente villa del.";
text.indexOf("accogliente");

Well assuming that this startIndex can only be a letter (ASCII one), you could do:
String text = "un’accogliente villa del.";
char c = text.charAt(5);
String normalized = Normalizer.normalize(text, Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", " ");
Pattern p = Pattern.compile("\\p{L}*?" + c + "\\p{L}*?[$|\\s]");
Matcher m = p.matcher(normalized);
if (m.find()) {
System.out.println(m.start(0));
}

Java String to Char Array

String c="12345";
for(char k:c.toCharArray())
System.out.print(k+4);
This program outputs:
5354555657
I don't really understand why this is out putting those numbers. The only pattern I see is that it prints a "5" then takes the "1" from the string and adds 2 to make "3". Then prints a "5" then takes the "2" from the string and adds 2 to make "4" then prints a "5" and so on.

The characters in the array, when promoted to int for the addition of 4, take on their underlying Unicode value, of which the ASCII values are a subset. The digits 0-9 are represented by codes 48-57 respectively. Characters '1' through '5' are 49-53, then you add 4 and get 53-57.
After adding, cast the sum back to char so print can interpret it as a char.
System.out.print( (char) (k+4));
Output:
56789

This is because you're adding an int to a character (it's casting your character to an int and then adding it to 4, then printing it out).
You need to do:
System.out.println(Character.getNumericValue(k) + 4);

How to print the middle three characters of a String in Java?

I am new to programming and working on a Java String assignment. The question says:
Declare a variable of type String named middle3 (put your declaration with the other declarations near the top of the program) and use an assignment statement and thesubstring method to assign middle3 the substring consisting of the middle three characters of phrase (the character at the middle index together with the character to the left of that and the one to the right). Add a println statement to print out the result. Save, compile, and run to test what you have done so far.
Here is what I have done:
String phrase = new String ("This is a String test.");
String middle3 = new String ("tri"); //I want to print the middle 3 characters in "phrase" which I think is "tri".
middle3 = phrase.substring (9, 11); // this only prints out 1 letter instead of 3
System.out.println ("middle3: " + middle3);
Here is my output:
Original phrase: This is a String test.
Length of the phrase: 18 characters
middle3: S
I also think that there are 18 characters in the string "phrase", but if there isn't, please let me know. Thank you in advance for all your help!

Think about how to retrieve the middle 3 characters without hardcoding the bounds of the substring() method. You can use the length() method in this regard. For example, the middle character of a string of odd length will always be at index str.length()/2, while the two middle characters of a string of even length will always be at index str.length()/2 and (str.length()/2) - 1. So the definition of the three middle characters will be up to you. But for our sakes, we'll just make the 3 middle characters at indexes (str.length()/2)-1, str.length()/2, and (str.length()/2)+1. With this information you can modify this line of code you had before,
middle3 = phrase.substring (9, 11);
to
middle3 = phrase.substring (phrase.length()/2 - 1, phrase.length()/2 + 2);
As to why the original line of code you had before was returning only one letter, it has to do with the parameters of the substring method. The first parameter is inclusive, but the second parameter isn't. Therefore, you were only retrieving characters from 9 to 10.
This is a String test.
^^^
9,10,11 (11 is not included)
The three characters I pointed to are at indices 9, 10, and 11 respectively. You only retrieved the chars ' ' and 'S', which together is just " S". That explains the one-letter output before.

First off, there are 22 characters in phrase.
"This is a String test."
^^^^^^^^^^^^^^^^^^^^^^ --> 22
Note that spaces () count towards the character count. Additionally, String has a method length() which will give you this number.
String phrase = "This is a String test.";
int phraseLength = phrase.length(); //22
From there, we can get the middle three characters by working off of the value phraseLength/2. The middle three characters would start one before the middle position, and stop one afterwards. However, because the string(int, int) method takes the end index to be exclusive, we should increase it by one.
String phrase = "This is a String test.";
int phraseLength = phrase.length(); //22
String middle3 = phrase.substring(phraseLength/2 - 1, phraseLength/2 + 2); // will have the middle 3 chars.
If the length of phrase is odd, this will return it middle 3 chars. If the length of phrase is even (as it is here), this will return the left-middle 3 chars. (For example, in 1,2,3,4, I'm using left-middle to represent 2 and right-middle to represent 3).
Another note, it's both unnecessary and bad practice to write new String("asdf"). Simply use the string literal instead.
String phrase = new String ("This is a String test."); //Bad
String phrase = "This is a String test."; //Good

you have an error in the begind index and the end index in the string, this is the correct code:
String phrase = new String("This is a String test.");
String middle3 = phrase.substring(11, 14);
System.out.println("middle3: " + middle3);

A strange character

String str = "ิ";
System.out.println(str.length());
byte[] b = str.getBytes();
System.out.println(b[0]);
System.out.println(b[1]);
System.out.println(b[2]);
Above is my code.A speical char in str. It's length is one,but the byte is three.why? And how to make there become one? How to print this char use java code? And in my android phone this char can't delete.

Its because string is "encoded" into bytes, according to documentation
Encodes this String into a sequence of bytes using the platform's default charset, storing the
result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified.
The CharsetEncoder class should be used when more control over the encoding process is required.

It seems like your special character is encoded using UTF-8. UTF-8 characters have different byte sizes, depending on their position in the range.
You can find the algorithm in the wikipedia page here and see how the size is determined.
From the Java String length() documentation:
The length is equal to the number of Unicode code units in the string.
Since the character is encoded using 3 bytes (whereas Unicode is one byte), you get a length of 3, rather than a length of 1 as you expect.

Lenght is NOT bytes
You have only 1 caracter, but this caracter is 3 bytes long. A String is made of several characters , but that doesnt mean that a 1 caracter string would be 1 byte.
About that caracter "ิ.
Java is by default using an UNICODE (encoding. "ิ is actually 0E34, this value beeing the THAI CHARACTER SARA.)
About your encoding issue
You need to change the way your application does its charset encoding and to use a utf-8 encoding instead.

Beside all the other comments. Here a small snippet to visualize it.
String str = "ิ"; // \u0E34
System.out.println("character length: " + str.length());
System.out.print("bytes: ");
for (byte b : str.getBytes("UTF-8")) {
System.out.append(Integer.toHexString(b & 0xFF).toUpperCase() + " ");
}
System.out.println("");
int codePoint = Character.codePointAt(str, 0);
System.out.println("unicode name of the codepoint: " + Character.getName(codePoint));
output
character length: 1
bytes: E0 B8 B4
unicode name of the codepoint: THAI CHARACTER SARA I

How to format numbers to a hex strings?

I want to format int numbers as hex strings. System.out.println(Integer.toHexString(1)); prints 1 but I want it as 0x00000001. How do I do that?

Try this
System.out.println(String.format("0x%08X", 1));

You can use the String.format to format an integer as a hex string.
System.out.println(String.format("0x%08X", 1));
That is, pad with zeros, and make the total width 8. The 1 is converted to hex for you. The above line gives: 0x00000001 and
System.out.println(String.format("0x%08X", 234));
gives: 0x000000EA

From formatting syntax documented on Java's Formatter class:
Integer intObject = Integer.valueOf(1);
String s = String.format("0x%08x", intObject);
System.out.println(s);

Less verbose:
System.out.printf("0x%08x", 1); //"Use %0X for upper case letters

I don't know Java too intimately, but there must be a way you can pad the output from the toHexString function with a '0' to a length of 8. If "0x" will always be at the beginning, just tack on that string to the beginning.

Java 17+
There is a new immutable class dedicated to conversion into and formatting hexadecimal numbers. The easiest way to go is using HexFormat::toHexDigits which includes leading zeroes:
String hex = "0x" + HexFormat.of().toHexDigits(1);
// 0x00000001
Beware, one has to concatenate with the "0x" prefix as such method ignores defined prefixes and suffixes, so the following snippet doesn't work as expected (only HexFormat::formatHex methods work with them):
String hex = HexFormat.of().withPrefix("0x").toHexDigits(1);
// 00000001
Returns the eight hexadecimal characters for the int value. Each nibble (4 bits) from most significant to least significant of the value is formatted as if by toLowHexDigit(nibble). The delimiter, prefix and suffix are not used.
Alternatively use the advantage of HexFormat::formatHex formatting to two hexadecimal characters, and a StringBuilder as an Appendable prefix containing "0x":
Each byte value is formatted as the prefix, two hexadecimal characters selected from uppercase or lowercase digits, and the suffix.
StringBuilder hex = HexFormat.of()
.formatHex(new StringBuilder("0x"), new byte[] {0, 0, 0, 1});
// 0x00000001
StringBuilder hex = HexFormat.of()
.formatHex(new StringBuilder("0x"), ByteBuffer.allocate(4).putInt(1).array());
// 0x00000001

You can use a java.util.Formatter or the printf method on a PrintStream.

This is String extension for Kotlin
//if lengthOfResultTextNeeded = 3 and input String is "AC", the result is = "0AC"
//if lengthOfResultTextNeeded = 4 and input String is "AC", the result is = "00AC"
fun String.unSignedHex(lengthOfResultTextNeeded: Int): String {
val count =
lengthOfResultTextNeeded - this.length
val buildHex4DigitString = StringBuilder()
var i = 1
while (i <= count) {
buildHex4DigitString.append("0")
++i
}
return buildHex4DigitString.toString() + this
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Trim a string based on its byte length - java

Divide your required byte with 3 and fetch those characters. For ex. int requiredBytes = 5; int requiredLength = 5 / 3; System.out.println(country.subString(0,requiredLength));

Related

Java find the index of string based on the utf-8 encoding index

Java String to Char Array

How to print the middle three characters of a String in Java?

A strange character

How to format numbers to a hex strings?

Categories

Resources