Efficiently split large strings in Java - java

I have a large string that should be split at a certain character, if it is not preceded by another certain character.
Would is the most efficient way to do this?
An example: Split this string at ':', but not at "?:":
part1:part2:https?:example.com:anotherstring
What I have tried so far:
Regex (?<!\?):. Very slow.
First getting the indices where to split the string and then split it. Only efficient if there are not many split characters in the string.
Iterating over the string character by character. Efficient if there are not many protect characters (e.g. '?').

I fear you would have to go through the string and check if a ":" is preceded by a "?"
int lastIndex=0;
for(int index=string.indexOf(":"); index >= 0; index=string.indexOf(":", lastIndex)){
if(index == 0 || string.charAt(index-1) != '?'){
String splitString = string.subString(lastIndex, index);
// add splitString to list or array
lastIndex = index+1;
}
}
// add string.subString(lastIndex) to list or array

You will have to test this very carefully (since I didn't do that), but using a regular expression in the split() might produce the results you want:
public static void main(String[] args) {
String s = "Start.Teststring.Teststring1?.Teststring2.?Teststring3.?.End";
String[] result = s.split("(?<!\\?)\\.(?!\\.)");
System.out.println(String.join("|", result));
}
Output:
Start|Teststring|Teststring1?.Teststring2|?Teststring3|?.End
Note:
This only considers your example about splitting by dot if the dot is not preceded by an interrogation mark.
I don't think you will get a much more performant solution than the regex...

Related

String.replace() not replacing all occurrences

I have a very long string which looks similar to this.
355,356,357,358,359,360,361,382,363,364,365,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,368,369,313,370,371,372,373,374,375,376,377,378,379,380,381,382,382,382,382,382,382,383,384,385,380,381,382,382,382,382,382,386,387,388,389,380,381,382,382,382,382,382,382,390,391,380,381,382,382,382,382,382,392,393,394,395,396,397,398,399,....
When I tried using the following code to remove the number 382 from the string.
String str = "355,356,357,358,359,360,361,382,363,364,365,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,368,369,313,370,371,372,373,374,375,376,377,378,379,380,381,382,382,382,382,382,382,383,384,385,380,381,382,382,382,382,382,386,387,388,389,380,381,382,382,382,382,382,382,390,391,380,381,382,382,382,382,382,392,393,394,395,396,397,398,399,...."
str = str.replace(",382,", ",");
But it seems that not all occurrences are being replaced. The string which originally had above 3000 occurrences still was left with about 630 occurrences after replacing.
Is the capability of String.replace() limited? If so, is there a possible way of achieving what I need?
You need to replace the trailing comma as well (if one exists, which it won't if last in the list):
str = str.replaceAll("\\b382,?", "");
Note \b word boundary to prevent matching "-,1382,-".
The above will convert:
382,111,382,1382,222,382
to:
111,1382,222
I think the issue is your first argument to replace(), in particular the comma (,) before and after 382. If you have "382,382,383", you will only match the inner ",382," and leave the initial one behind. Try:
str.replace("382,", "");
Although this will fail to match "382" at the very end as it does not have a comma after it.
A full solution might entail two method calls thus:
str = str.replace("382", ""); // Remove all instances of 382
str.replaceAll(",,+", ","); // Compress all duplicates, triplicates, etc. of commas
This combines the two approaches:
str.replaceAll("382,?", ""); // Remove 382 and an optional comma after it.
Note: both of the last two approaches leave a trailing comma if 382 is at the end.
try this
str = str.replaceAll(",382,", ",");
Firstly, remove the preceding comma in your matching string. Then, remove duplicated commas by replacing commas with a single comma using java regular expression.
String input = "355,356,357,358,359,360,361,382,363,364,365,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,360,361,363,366,368,369,313,370,371,372,373,374,375,376,377,378,379,380,381,382,382,382,382,382,382,383,384,385,380,381,382,382,382,382,382,386,387,388,389,380,381,382,382,382,382,382,382,390,391,380,381,382,382,382,382,382,392,393,394,395,396,397,398,399";
String result = input.replace("382,", ","); // remove the preceding comma
String result2 = result.replaceAll("[,]+", ","); // replace duplicate commas
System.out.println(result2);
As dave already said, the problem is that your pattern overlaps. In the string "...,382,382,..." there are two occurrences of ",382,":
"...,382,382,..."
----- first occurrence
----- second occurrence
These two occurrences overlap at the comma, and thus Java can only replace one of them. When finding occurrences, it does not see yet what you replace the pattern with, and thus it doesn't see that new occurrence of ",382," is generated when replacing the first occurrence is replaced by the comma.
If your data is known not to contain numbers with more than 3 digits, then you might do:
str.replace("382,", "");
and then handle occurrences at the end as a special case. But if your data can contain big numbers, then "...,1382,..." will be replaced by "...,1,..." which probably is not what you want.
Here are two solutions that do not have the above problem:
First, simply repeat the replacement until no changes occur anymore:
String oldString = str;
str = str.replace(",382,", ",");
while (!str.equals(oldString)) {
oldString = str;
str = str.replace(",382,", ",");
}
After that, you will have to handle possible occurrences at the end of the string.
Second, if you have Java 8, you can do a little more work yourself and use Java streams:
str = Arrays.stream(str.split(","))
.filter(s -> !s.equals("382"))
.collect(Collectors.joining(","));
This first splits the string at ",", then filters out all strings which are equal to "382", and then concatenates the remaining strings again with "," in between.
(Both code snippets are untested.)
Traditional way:
String str = ",abc,null,null,0,0,7,8,9,10,11,12,13,14";
String newStr = "", word = "";
for (int i=0; i<str.length(); i++) {
if (str.charAt(i) == ',') {
if (word.equals("null") || word.equals("0"))
word = "";
newStr += word+",";
word = "";
} else {
word += str.charAt(i);
if (i == str.length()-1)
newStr += word;
}
}
System.out.println(newStr);
Output:
,abc,,,,,7,8,9,10,11,12,13,14

String.split(String pattern) Java method is not working as intended

I'm using String.split() to divide some Strings as IPs but its returning an empty array, so I fixed my problem using String.substring(), but I'm wondering why is not working as intended, my code is:
// filtrarIPs("196.168.0.1 127.0.0.1 255.23.44.1 100.168.100.1 90.168.0.1","168");
public static String filtrarIPs(String ips, String filtro) {
String resultado = "";
String[] lista = ips.split(" ");
for (int c = 0; c < lista.length; c++) {
String[] ipCorta = lista[c].split("."); // Returns an empty array
if (ipCorta[1].compareTo(filtro) == 0) {
resultado += lista[c] + " ";
}
}
return resultado.trim();
}
It should return an String[] as {"196"."168"."0"."1"}....
split works with regular expressions. '.' in regular expression notation is a single character. To use split to split on an actual dot you must escape it like this: split("\\.").
Use
String[] ipCorta = lista[c].split("\\.");
in regular expressions the . matches almost any character.
If you want to match the dot you have to escape it \\..
Your statement
lista[c].split(".")
will split the first String "196.168.0.1" by any (.) character, because String.split takes a regular expression as argument.
However, the point, why you are getting an empty array is, that split will also remove all trailing empty Strings in the result.
For example, consider the following statement:
String[] tiles = "aaa".split("a");
This will split the String into three empty values like [ , , ]. Because of the fact, that the trailing empty values will be removed, the array will remain empty [].
If you have the following statement:
String[] tiles = "aaab".split("a");
it will split the String into three empty values and one filled value b like [ , , , "b"]
Since there are no trailing empty values, the result remains with these four values.
To get rid of the fact, that you don't want to split on every character, you have to escape the regular expression like this:
lista[c].split("\\.")
String.split() takes a regular expression as parameter, so you have to escape the period (which matches on anything). So use split("\\.") instead.
THis may help you:
public static void main(String[] args){
String ips = "196.168.0.1 127.0.0.1 255.23.44.1 100.168.100.1 90.168.0.1";
String[] lista = ips.split(" ");
for(String s: lista){
for(String s2: s.split("\\."))
System.out.println(s2);
}
}

Split Strings separated by an artbitrary character

Say we would like to write a method to receive entire book in a string and an arbitrary single-character delimiter to separate strings and return an array of strings. I came up with the following implementation (Java).(suppose no consecutive delimiter etc)
ArrayList<String> separater(String book, char delimiter){
ArrayList<String> ret = new ArrayList<>();
String word ="";
for (int i=0; i<book.length(), ++i){
if (book.charAt(i)!= delimiter){
word += book.charAt(i);
} else {
ret.add(word);
word = "";
}
}
return ret;
}
Question: I wonder if there is any way to leverage String.split() for shorter solutions? Its because I could not find a general way of defining a general regex for an arbitrary character delimiter.
String.split("\\.") if the delimiter is '.'
String.split("\\s+"); if the delimiter is ' ' // space character
That measn I cold not find a general way of generating the input regex of method split() from the input character delimiter. Any suggestions?
String[] array = string.split(Pattern.quote(String.valueOf(delimiter)));
That said, The Guava Splitter is much more versatile and well-behaving than String.split().
And a note on your method: concatenating to a String in a loop is very inefficient. As Strings are immutable, it produces a lot of temporary Strings and StringBuilders. You should use a StringBuilder instead.

About splits in Java

I'm a Java beginner, so please bear with me if this is an extremely easy answer.
Say I have code that looks like this:
String str;
String [] splits;
str = "The words never line up in such a way ";
splits = str.split(" ");
for (int i = 0; i < splits.length; i++)
System.out.println(splits[i]);
What does Java do at the end of the string? After "way" there is a space; since there is no value after the space does Java decide not to split again?
Thanks so much!
According to the Java documentation for split(), http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String),
The split(String r) is equivalent to the split(String r, 0) method, which will ignore and not include any blank trailing empty strings. Specifically from the docs:
"Trailing empty strings are therefore not included in the resulting
array."
So the last element in the array after the split will be "way"
You can confirm this by executing the code you mentioned.
You will not get any trailing space after delimiter if you use split method. Example
class Main
{
public static void main (String[] args)
{
String str;
String [] splits;
str = "The words never line up in such a way "; // some empty string after delimiter at end
splits = str.split(" ");
for (int i = 0; i < splits.length; i++)
System.out.println(splits[i]);
System.out.println("END");
}
}
OUTPUT
The
words
never
line
up
in
such
a
way
END
see no splitted string for end delimiters.
Now
class Main
{
public static void main (String[] args)
{
String str;
String [] splits;
str = "The words never line up in such a way yeah";
splits = str.split(" ");
for (int i = 0; i < splits.length; i++)
System.out.println(splits[i]);
System.out.println("END");
}
}
OUTPUT
The
words
never
line
up
in
such
a
way
yeah
END
see an extra string after delimiter which is also a empty string but not the trailing, so it will be in the array.
I´ve been looking at javadoc and here what it says about String.split:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
It seems that this method calls .split with two arguments:
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.
thanks

Need to split a string into two parts in java

I have a string which contains a contiguous chunk of digits and then a contiguous chunk of characters. I need to split them into two parts (one integer part, and one string).
I tried using String.split("\\D", 1), but it is eating up first character.
I checked all the String API and didn't find a suitable method.
Is there any method for doing this thing?
Use lookarounds: str.split("(?<=\\d)(?=\\D)")
String[] parts = "123XYZ".split("(?<=\\d)(?=\\D)");
System.out.println(parts[0] + "-" + parts[1]);
// prints "123-XYZ"
\d is the character class for digits; \D is its negation. So this zero-matching assertion matches the position where the preceding character is a digit (?<=\d), and the following character is a non-digit (?=\D).
References
regular-expressions.info/Lookarounds and Character Class
Related questions
Java split is eating my characters.
Is there a way to split strings with String.split() and include the delimiters?
Alternate solution using limited split
The following also works:
String[] parts = "123XYZ".split("(?=\\D)", 2);
System.out.println(parts[0] + "-" + parts[1]);
This splits just before we see a non-digit. This is much closer to your original solution, except that since it doesn't actually match the non-digit character, it doesn't "eat it up". Also, it uses limit of 2, which is really what you want here.
API links
String.split(String regex, int limit)
If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter.
There's always an old-fashioned way:
private String[] split(String in) {
int indexOfFirstChar = 0;
for (char c : in.toCharArray()) {
if (Character.isDigit(c)) {
indexOfFirstChar++;
} else {
break;
}
}
return new String[]{in.substring(0,indexOfFirstChar), in.substring(indexOfFirstChar)};
}
(hope it works with digit-only or char-only Strings too - can't test it here - if not, take it as a general idea)

Categories