How to split a String array? - java

Intention is to take a current line (String that contains commas), replace white space with "" (Trim space) and finally store split String elements into the array.
Why does not this work?
String[] textLine = currentInputLine.replace("\\s", "").split(",");

On regex vs non-regex methods
The String class has the following methods:
Non-regex methods:
String replace(char oldChar, char newChar)
String replace(CharSequence target, CharSequence replacement)
boolean startsWith(String prefix)
boolean endsWith(String suffix)
boolean contains(CharSequence s)
Regex methods:
String replaceAll(String regex, String replacement)
String replaceFirst(String regex, String replacement)
String[] split(String regex)
boolean matches(String regex)
So here we see the immediate cause of your problem: you're using a regex pattern in a non-regex method. Instead of replace, you want to use replaceAll.
Other common pitfalls include:
split(".") (when a literal period is meant)
matches("pattern") is a whole-string match!
There's no contains("pattern"); use matches(".*pattern.*") instead
On Guava's Splitter
Depending on your need, String.replaceAll and split combo may do the job adequately. A more specialized tool for this purpose, however, is Splitter from Guava.
Here's an example to show the difference:
public static void main(String[] args) {
String text = " one, two, , five (three sir!) ";
dump(text.replaceAll("\\s", "").split(","));
// prints "[one] [two] [] [five(threesir!)] "
dump(Splitter.on(",").trimResults().omitEmptyStrings().split(text));
// prints "[one] [two] [five (three sir!)] "
}
static void dump(String... ss) {
dump(Arrays.asList(ss));
}
static void dump(Iterable<String> ss) {
for (String s : ss) {
System.out.printf("[%s] ", s);
}
System.out.println();
}
Note that String.split can not omit empty strings in the beginning/middle of the returned array. It can omit trailing empty strings only. Also note that replaceAll may "trim" spaces excessively. You can make the regex more complicated, so that it only trims around the delimiter, but the Splitter solution is definitely more readable and simpler to use.
Guava also has (among many other wonderful things) a very convenient Joiner.
System.out.println(
Joiner.on("... ").skipNulls().join("Oh", "My", null, "God")
);
// prints "Oh... My... God"

I think you want replaceAll rather than replace.
And replaceAll("\\s","") will remove all spaces, not just the redundant ones. If that's not what you want, you should try replaceAll("\\s+","\\s") or something like that.

What you wrote does not match the code:
Intention is to take a current line which contains commas, store trimmed values of all space and store the line into the array.
It seams, by the code, that you want all spaces removed and split the resulting string at the commas (not described). That can be done as Paul Tomblin suggested.
String[] currentLineArray = currentInputLine.replaceAll("\\s", "").split(",");
If you want to split at the commas and remove leading and trailing spaces (trim) from the resulting parts, use:
String[] currentLineArray = currentInputLine.trim().split("\\s*,\\s*");
(trim() is needed to remove leading spaces of first part and trailing space from last part)

If you need to perform this operation repeatedly, I'd suggest using java.util.regex.Pattern and java.util.regex.Matcher instead.
final Pattern pattern = Pattern.compile( regex);
for(String inp: inps) {
final Matcher matcher = pattern.matcher( inpString);
return matcher.replaceAll( replacementString);
}
Compiling a regex is a costly operation and using String's replaceAll repeatedly is not recommended, since each invocation involves compilation of regex followed by replacement.

Related

Java Regex Metacharacters returning extra space while spliting

I want to split string using regex instead of StringTokenizer. I am using String.split(regex);
Regex contains meta characters and when i am using \[ it is returning extra space in returning array.
import java.util.Scanner;
public class Solution{
public static void main(String[] args) {
Scanner i= new Scanner(System.in);
String s= i.nextLine();
String[] st=s.split("[!\\[,?\\._'#\\+\\]\\s\\\\]+");
System.out.println(st.length);
for(String z:st)
System.out.println(z);
}
}
When i enter input [a\m]
It returns array length as 3 and
a m
Space is also there before a.
Can anyone please explain why this is happening and how can i correct it. I don't want extra space in resulting array.
Since the [ is at the beginning of the string, when split removes [, there appear two elements after the first split step: the empty string that is at the beginning of the string, and the rest of the string. String#split does not return trailing empty elements only (as it is executed with limit=0 by default).
Remove the characters you split against from the start (using a .replaceAll("^[!\\[,?._'#+\\]\\s\\\\]+", note the ^ at the beginning of the pattern). Here is a sample code you can leverage:
String[] st="[a\\m]".replaceAll("^[!\\[,?._'#+\\]\\s\\\\]+", "")
.split("[!\\[,?._'#+\\]\\s\\\\]+");
System.out.println(st.length);
for(String z:st) {
System.out.println(z);
}
See demo
As an addition to Wiktor Stribiżew’s answer, you may do the same without having to specify the pattern twice, by dealing with the java.util.regex package directly. Removing this redundancy may avoid potential errors and may also be more efficient as the pattern doesn’t need to be parsed twice:
Pattern p = Pattern.compile("[!\\[,?\\._'#\\+\\]\\s\\\\]+");
Matcher m = p.matcher(s);
if(m.lookingAt()) s=m.replaceFirst("");
String[] st = p.split(s);
for(String z:st)
System.out.println(z);
To be able to use the same pattern, i.e. without having to use the anchor ^ for removing a leading separator, we first check via lookingAt() whether the pattern really matches at the beginning of the text before removing the first occurrence. Then, we proceed with the split operation, but reusing the already prepared Pattern.
Regarding your issue mentioned in a comment, the split operation will always return at least one element, the input string, when there is no match, even when the string is empty. If you wish to have an empty array then, the only solution is to replace the result explicitly:
if(st.length==1 && s.equals[0]) st=new String[0];
or, if you only want to treat an empty string specially, you may check this beforehand:
if(s.isEmpty()) st=new String[0];
else {
// the code as shown above
}

Split string against some characters except the # character

I want to split a string against the following characters
~!#$%^&*()_+­=<>,.?/:;"'{}|[]\, \n,\t, space
I tried to use \\s regex delimiter but i don't want the # included as the split character so that a string like this is #funny should result to this is #funny as the resulting values.
I have tried the following but it doesn't work.
this is #funny".split("\\s")
but it doesn't work. Any ideas?
Just specify the characters you want in square bracket, which means any of. Single escape Java characters (like \") and double escape Regex special characters (like \\[):
#Test
public void testName() throws Exception
{
String[] split = "this is #funny".split("[~!#$%^&*()_+­=<>,.?/:;\"'{}|\\[\\]\\\\ \\n\\t]");
for (String string : split)
{
logger.debug(string);
}
}
User replaceAll(String regex,String replacement) method from String.
String result = "this is #funny".replaceAll("[~!#$%^&*()_+­=<>,.?/:;\"'{}|\\[\\]\\,\\n\\t]", "");
System.out.println(result);
You can try to implement this:
String[] split = "this&is%a#funny^string".split("[^#\\p{Alnum}]|\\s+");
for (String string : split){
System.out.println(string);
}
Also check the Java API (Patterns) for more information on how to process strings.
It look like this will work for you:
String[] split = str.split("[^a-zA-Z&&[^#]]+");
This uses a character class subtraction to split on non-letter chars, except the hash.
Here's some test code:
String str = "this is #funny";
String[] split = str.split("[^a-zA-Z&&[^#]]+");
System.out.println(Arrays.toString(split));
Output:
[this, is, #funny]

Replacing only the first space in a string

I want to replace the first space character in a string with another string listed below. The word may contain many spaces but only the first space needs to be replaced. I tried the regex below but it didn't work ...
Pattern inputSpace = Pattern.compile("^\\s", Pattern.MULTILINE);
String spaceText = "This split ";
System.out.println(inputSpace.matcher(spaceText).replaceAll("&emsp;"));
EDIT:: It is an external API that I am using and I have the constraint that I can only use "replaceAll" ..
Your code doesn't work because it doesn't account for the characters between the start of the string and the white-space.
Change your code to:
Pattern inputSpace = Pattern.compile("^([^\\s]*)\\s", Pattern.MULTILINE);
String spaceText = "This split ";
System.out.println(inputSpace.matcher(spaceText).replaceAll("$1&emsp;"));
Explanation:
[^...] is to match characters that don't match the supplied characters or character classes (\\s is a character class).
So, [^\\s]* is zero-or-more non-white-space characters. It's surrounded by () for the below.
$1 is the first thing that appears in ().
Java regex reference.
The preferred way, however, would be to use replaceFirst: (although this doesn't seem to conform to your requirements)
String spaceText = "This split ";
spaceText = spaceText.replaceFirst("\\s", "&emsp;");
You can use the String.replaceFirst() method to replace the first occurence of the pattern
System.out.println(" all test".replaceFirst("\\s", "test"));
And String.replaceFirst() internally calls Matcher.replaceFirst() so its equivalent to
Pattern inputSpace = Pattern.compile("\\s", Pattern.MULTILINE);
String spaceText = "This split ";
System.out.println(inputSpace.matcher(spaceText).replaceFirst("&emsp;"));
Do in 2 steps:
indexOf(" ") will tell you where is the index
result = str.substring(0, index) + str.substring(index+1, str.length())
The idea is this, you may need to adjust the index values properly according to API.
It should be faster than regexp, because there is 2x arraycopy and not need to text compile pattern matching and stuff.
Can use Apache StringUtils:
import org.apache.commons.lang.StringUtils;
public class substituteFirstOccurrence{
public static void main(String[] args){
String text = "Word1 Word2 Word3";
System.out.println(StringUtils.replaceOnce(text, " ", "-"));
// output: "Word1-Word2 Word3"
}
}
We can simply use yourString.replaceFirst(" ", ""); in Kotlin.

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?
What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.
[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well
Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.
Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}
Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}
Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

Finding tokens in a Java String

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?
For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:
"hello[world]this[[is]me"
The output should be:
token[0] = "world"
token[1] = "[is"
(Note: the second token has a 'start' string in it)
I think you can use the Apache Commons Lang feature that exists in StringUtils:
substringsBetween(java.lang.String str,
java.lang.String open,
java.lang.String close)
The API docs say it:
Searches a String for substrings
delimited by a start and end tag,
returning all matching substrings in
an array.
The Commons Lang substringsBetween API can be found here:
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)
Here is the way I would go to avoid dependency on commons lang.
public static String escapeRegexp(String regexp){
String specChars = "\\$.*+?|()[]{}^";
String result = regexp;
for (int i=0;i<specChars.length();i++){
Character curChar = specChars.charAt(i);
result = result.replaceAll(
"\\"+curChar,
"\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
}
return result;
}
public static List<String> findGroup(String content, String pattern, int group) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
List<String> result = new ArrayList<String>();
while (m.find()) {
result.add(m.group(group));
}
return result;
}
public static List<String> tokenize(String content, String firstToken, String lastToken){
String regexp = lastToken.length()>1
?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
:escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
return findGroup(content, regexp, 1);
}
Use it like this :
String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");
StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.
Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.
There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.
Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.
Try a regular expression like:
(.*?\[(.*?)\])
The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].
StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:
public List extractTokens(String txt, String str, String end) {
int so=0,eo;
List lst=new ArrayList();
while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
so+=str.length();
if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
lst.add(txt.substring(so,eo);
so=eo+end.length();
}
}
return lst;
}
The regular expression \\[[\\[\\w]+\\] gives us
[world] and
[[is]

Categories