Correct way to split UTF-8 String - java

I want to split a utf-8 string.
I have tried the StringTokenizer but it fails.
The title should be "0" but it shows as "عُدي_صدّام_حُسين".
String test = "en.m عُدي_صدّام_حُسين 1 0";
StringTokenizer stringTokenizer = new StringTokenizer(test);
String code = stringTokenizer.nextToken();
String title = stringTokenizer.nextToken();
What is the correct way to split a utf-8 string?

The problem here is that the Arabic text isn't "at the end" of the string.
For example, if I select the contents of the string literal (in Chrome), moving my mouse from left-to-right, it selects the en.m first, then selects all of the arabic text, then the 0 1. The text just looks "at the end" because that's how it is being rendered.
The string, as specified in your Java source code actually does have the عُدي_صدّام_حُسين as the second token. So, you're splitting it correctly, you're just not splitting what you think you're splitting.

Generally, there is not the correct way, but I normally use the method substring() of the String class (see here). You can pass it either the begin index to make it return the substring from that index to the original String's end or two bounds of the substring within the original String. With the method indexOf() of the same class you can locate a character within the original String if you do not know its index.

Related

Shorten String between Chars until max char length is reached

in my GUI I'm displaying a path string via JLabel and a MouseListener opens the folder when clicked on the label.
I want to shorten the displayed string between and after the first directory slashes until the whole string is under a certain length e.g. 20.
e.g.:
String regularPath="C:\Users\xy\Desktop\d1\d2\d3\d4\d5"; //->34 chars
String newPath= "C:\...\d1\d2\d3\d4 " //->20 chars
I could't figure out a logic to implement this at the moment and I would appreciate your help (indexof("\\") always lead to an OutOfBoundsException). Thanks in advance!
Your problem is a good candidate for a regex replacement using lookarounds. You may find on the following pattern, and replace with ellipsis.
(?<=\w:\\).*?(?=.{0,14}$)
(?<=\w:\\) assert that a drive letter pattern (e.g. C:\) precedes
.*? match and consume everything until
(?=.{0,14}$) we see 14 more characters in the rest of the path
Note that the .*? quantity is what gets replaced with ellipsis, but everything on either side remains as is. Also, paths which are shorter than 20 characters total will not match this pattern, and therefore will be printed in their entirety.
String regularPath = "C:\\Users\\xy\\Desktop\\d1\\d2\\d3\\d4\\d5";
regularPath = regularPath.replaceAll("(?<=\\w:\\\\).*?(?=.{14}$)", "...");
System.out.println(regularPath);
C:\...d1\d2\d3\d4\d5
Demo

Replace a string without certain prefix and suffix in Java

I'm trying to replace all ocurrences of a given string, but I have to be sure that it isn't surrounded with letters or numbers.
For example:
// Directive's block
BIT EQU $1111
BIT0 EQU $0000
// Instruction's block
ADD BIT, (**BIT**0)+
When my parser founds an EQU in the first line, it reads the instruction's block trying to find the given label ("BIT", in this case) and replacing it with its value. Then the result is (which is wrong):
ADD $1111, (**$1111**0)+
Overriding the name of the other label, cause it is a substring of it. So I have to be sure that the surrounded characters are not letters or numbers, then I can be sure that it doesn't overrides another label ID.
My code for now is:
output += operand.replace(label, value)+" ";
operand: a string containing the whole operand
label: the label to be found for replacement
value: the value to be replaced with that label
Now i'm trying to use ReplaceAll() and some regex:
String regex = "(?<![a-zA-Z_])"+label+"[^a-zA-Z_]";
output+= operand.replaceAll(regex, value)+" ";
But it throws the following exception:
IndexOutOfBoundsException: Non group 1 (java.util.regex.Matcher.start)
Even if I left only the suffix, it throws the same error.
Does anybody knows what does it means?
Thanks you guys.
If you're using replaceAll() and you're trying to replace something with $1111, that won't work, because $ has a special meaning in replaceAll. Use Matcher.quoteReplacement(value) instead of value in the replaceAll() call; quoteReplacement makes sure that any special characters are "quoted" so that they no longer have special meanings. (The replacement is interpreting $1 as "replace with the contents of group 1", which is why you're getting the error.)

Substring contatining words up to n characters

I've got a string and I want to get first words that are containing up to N characters together.
For example:
String s = "This is some text form which I want to get some first words";
Let's say that I want to get words up to 30 characters, result should look like this:
This is some text form which
Is there any method for this? I don't want to reinvent the wheel.
EDIT: I know the substring method, but it can break words. I don't want to get something like
This is some text form whi
etc.
You could use regular expressions to achieve this. Something like below should do the job:
String input = "This is some text form which I want to get some first words";
Pattern p = Pattern.compile("(\\b.{25}[^\\s]*)");
Matcher m = p.matcher(input);
if(m.find())
System.out.println(m.group(1));
This yields:
This is some text form which
An explanation of the regular expression is available here. I used 25 since the first 25 characters would result into a broken sub string, so you can replace it with whatever value you want.
Split your string with space ' ' then foreach substring add it to a new string and check whether the length of the new substring exceeds or not exceeds the limit.
you could do it like this without regex
String s = "This is some text form which I want to get some first words";
// Check if last character is a whitespace
int index = s.indexOf(' ', 29-1);
System.out.println(s.substring(0,index));
The output is This is some text form which;
obligatory edit: there is no length check in there, so care for it.

How do I get the 5th word in a string? Java

Say I have a string of a text document and I want to save the 124th word of that string in another string how would I do this? I assume it counts each "word" as a string of text in between spaces (including things like - hyphens).
Edit:
Basically what I'm doing right now is grabbing text off a webpage and I want to get number values next to a certain word. Like here is the webpage, and its saved in a string .....health 78 mana 32..... or something of the sort and i want to get the 78 and the 32 and save it as a variable
If you have a string
String s = "...";
then you can get the word (separated by spaces) in the nth position using split(delimiter) which returns an array String[]:
String word = s.split("\\s+")[n-1];
Note:
The argument passed to split() is the delimiter. In this case, "\\s+" is a regular expression, that means that the method will split the string on each whitespace, even if there are more than one together.
Why not convert the String to a String array using StringName.split(" "), i.e. split the string based on spaces. Then only a matter of retrieving the 124th element of the array.
For example, you have a string like this:
String a="Hello stackoverflow i am Gratin";
To see 5th word, just write that code:
System.out.println(a.split("\\s+")[4]);
This is a different approach that automatically returns a blank String if there isn't a 5th word:
String word = input.replaceAll("^\\s*(\\S+\\s+){4}(\\S+)?.*", "$1");
Solutions that rely on split() would need an extra step of checking the resulting array length to prevent getting an ArrayIndexOutOfBoundsException if there are less than 5 words.

Regex required to update a character

I have a String : testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing
I want to replace the character s with some other character sequence suppose : <b>X</b> but i want the character sequence s to remain intact i.e. regex should not update the character s with a previous character as "<".
I used the JAVA code :
String str = testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;
str = str.replace("s[^<]", "<b>X</b>");
The problem is that the regex would match 2 characters, s and following character if it is not ">" and Sting.replace would replace both the characters. I want only s to be replaced and not the following character.
Any help would be appreciated. Since i have lots of such replacements i don't want to use a loop matching each character and updating it sequentially.
There are other ways, but you could, for example, capture the second character and put it back:
str = str.replaceAll("s([^<])", "<b>X\\1</b>");
Looks like you want a negative lookahead:
s(?!<)
String str = "testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;";
System.out.println(str.replaceAll("s(?!<)", "<b>X</b>"));
output:
te<b>X</b>ting<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;
Use look arounds to assert, but not capture, surrounding text:
str = str.replaceAll("s(?![^<]))", "whatever");
Or, capture and put back using a back reference $1:
str = str.replaceAll("s([^<])", "whatever$1");
Note that you need to use replaceAll() (which use regex), rather than replace() (which uses plain text).

Categories