How to efficiently remove all instances of a String from another String? - java

The problem I am solving is replacing all Strings from another String.
I solved this problem fairly easily on codingbat.com by using String.replaceAll, and doing it until the first String no longer contains the other String.
However, I dislike this method as it is very slow. I have tried searching this website for more efficient methods, and came across these questions:
Fastest way to perform a lot of strings replace in Java
String.replaceAll is considerably slower than doing the job yourself
They solved the problem by using StringUtils and Patterns. I still think these methods are too slow!
When I code problems like these, I like to get my runtime under two seconds with Java. I'm testing this with a String of 1,000,000 characters. String.replaceAll went well over two seconds, and so did the other two methods.
Does anyone have a fast solution for this problem? Thanks!
EDIT: Unfortunately, the answers I received still run too slowly. And yes, I did mean make a new String, not change the old String, sorry for that mistake.
I'm not sure how it would work, but I think looping over each char and checking might work. Something with algorithms.

Strings are immutable so you can't remove stuff from them. Which means that you need to create a new String without the stuff that you want removed. When you use String.replace that is pretty much what it does: it creates a new String.
Beware of String.replaceAll since it uses a regular expression that gets compiled every time you call it (so never use it in a long loop). This is likely your problem.
If you need to use regular expressions, use the Pattern class to compile your regex and reuse the instance to create a new Matcher for each string you process. If you don't reuse your Pattern instance, it is going to be slow.
If you don't need a regular expression, StringUtils has a replaceEach() that does not rely on regular expressions.
If you are processing a large String. You may want to do things in a streaming fashion and loop over the characters and copy characters over to a StringBuilder.
Alternatively, you could use a regular expression to search for a particular pattern in the String and loop over the matches it finds and for each match append everything from the previous match to the current match to a StringBuilder.

The problem is your String in enormous, you only want to move/copy it once, and all the solutions that use multiple calls to replace will still end up doing an enormous amount of unnecessary work.
What you really want to use is Apache StringUtils.replaceEachRepeatedly, as that method handles searching for multiple strings while only building the result string one.

Apart of the time that each methods (replace, StringUtils or Patterns, ...) takes you only have one Thread working.
If you can split the work done by that thread in two or more, for example each Thread runs for a specific position in the string to other, you will be able to have a fast solution.
The tricky part is to divide the work and then join it together.
That will depend how you read the string, where do you write it in the end for example.
Regards,

I have faced the same problem some time ago and came to this post: Replace all occurrences of a String using StringBuilder?
Using the implementation given in the post:
public static void main(String[] args) {
String from = "A really long string full of ands and ors";
String replaceFrom = "and";
String replaceTo = "or";
long initTime = System.nanoTime();
String result1 = from.replace(replaceFrom, replaceTo);
System.out.println("Time1: " + (System.nanoTime() - initTime));
System.out.println(result1);
StringBuilder sb1 = new StringBuilder(from);
initTime = System.nanoTime();
replaceAll(sb1, replaceFrom, replaceTo);
System.out.println("Time1: " + (System.nanoTime() - initTime));
System.out.println(sb1.toString());
}
// From https://stackoverflow.com/questions/3472663/replace-all-occurences-of-a-string-using-stringbuilder
public static void replaceAll(StringBuilder builder, String from, String to) {
int index = builder.indexOf(from);
while (index != -1) {
builder.replace(index, index + from.length(), to);
index += to.length(); // Move to the end of the replacement
index = builder.indexOf(from, index);
}
}
The explanation of the better performance of the second solution is that it relays on StringBuilder, a mutable object rather than on String an immutable one. See Immutability of Strings in Java for a better explanation.
This solution will work both using StringBuffer and StringBuilder, but as explained in Difference between StringBuilder and StringBuffer StringBuffer is synchronized and StringBuilder is not, so if you don't need synchronisation you better use StringBuilder.

I just tried this, which resulted in :
100960923
197642683484
import java.util.Stack;
public class Test {
public static String removeAll(final String stringToModify, final String stringToFindAndRemove) {
if (stringToModify==null||stringToModify.length()==0) return new String(stringToModify);
if (stringToFindAndRemove==null||stringToFindAndRemove.length()==0) return new String(stringToModify);
if (stringToModify.length()<stringToFindAndRemove.length()) return new String(stringToModify);
int lastChar = 0;
int buffPos=0;
Stack<Integer>stack = new Stack<Integer>();
char[] chars = stringToModify.toCharArray();
char[] ref = stringToFindAndRemove.toCharArray();
char[] ret = new char[chars.length];
for (int a=0;a<chars.length;a++) {
if (chars[a]==ref[buffPos]) {
if (buffPos==ref.length-1) {
buffPos=0;
stack.pop();
} else {
if (buffPos==0) stack.push(lastChar);
buffPos++;
}
} else {
if (buffPos!=0) {
for (int b=0;b<buffPos;b++) {
ret[lastChar]=ref[b];
lastChar++;
}
a--;
buffPos = 0;
} else {
ret[lastChar]=chars[a];
lastChar++;
}
}
if (stack.size()>0&&(lastChar-stack.peek()>=ref.length)) {
while(stack.size()>0 && (lastChar-stack.peek()>=ref.length)) {
int top = stack.pop();
boolean f = true;
for (int foo=0;foo<ref.length;foo++) {
if (ret[top+foo]!=ref[foo]) {
f=false;
break;
}
}
if (f) lastChar=top;
}
}
}
if (buffPos!=0) {
for (int b=0;b<buffPos;b++) {
ret[lastChar]=ref[b];
lastChar++;
}
}
char[] out = new char[lastChar];
System.arraycopy(ret,0,out,0,lastChar);
return new String(out);
}
public static void main(final String[] args) {
StringBuffer s = new StringBuffer();
StringBuffer un = new StringBuffer();
for (int a=0;a<100000;a++) {
s.append("s");
un.append("un");
}
StringBuffer h = new StringBuffer(s);
h.append(un);
h.append("m");
String huge = h.toString();
String t = "sun";
long startTime = System.nanoTime();
String rep = removeAll(huge,t);
long endTime = System.nanoTime();
long duration = (endTime - startTime);
//System.out.println(rep);
System.out.println(duration);
startTime = System.nanoTime();
rep = new String(huge);
int pos = rep.indexOf(t);
while (pos!=-1) {
rep = rep.replaceAll(t,"");
pos = rep.indexOf(t);
}
endTime = System.nanoTime();
duration = (endTime - startTime);
//System.out.println(rep);
System.out.println(duration);
}
}
I'd be interested to see how fast this runs on someone elses machine. Because my boss thinks my machine is fast enough! :)

Related

Is there a way to concatenate Java strings in less than O(n) time?

My homework question involves joining strings in a particular sequence. We are first given the strings, followed by a set of instructions that tell us how to concatenate them; finally we print the output string.
I have used the Kattis FastIO class to handle buffered input and output. Below is my algorithm, which iterates through the instructions to concatenate the strings. I have tried making the array of normal strings, StringBuffers and StringBuilders.
The program seems to work as intended, but it gives a time limit error on my submission platform due to inefficiency. It seems like appending the way I did is O(n); is there any faster way?
public class JoinStrings {
public static void main(String[] args) {
Kattio io = new Kattio(System.in, System.out);
ArrayList<StringBuilder> stringList = new ArrayList<StringBuilder>();
int numStrings = io.getInt();
StringBuilder[] stringArray = new StringBuilder[numStrings];
for (int i = 0; i < numStrings; i++) {
String str = io.getWord();
stringArray[i] = new StringBuilder(str);
}
StringBuilder toPrint = stringArray[0];
while (io.hasMoreTokens()) {
int a = io.getInt();
int b = io.getInt();
stringArray[a-1].append(stringArray[b-1]); // this is the line that is done N times
toPrint = stringArray[a-1];
}
io.println(toPrint.toString());
io.flush();
}
}
The StringBuilder.append() copy char from new string to existing string. It's fast but not free.
Instead of keeping appending the String to the StringBuilder array, keep track of the String indexes need to appended. Then finally append the Strings stored in the print out indexes list.

Efficient Text Processing Java

I have created an application to process log files but am having some bottle neck when the amount of files = ~20
The issue comes from a particular method which takes on average a second or so to complete roughly and as you can imagime this isn't practical when it needs to be done > 50 times
private String getIdFromLine(String line){
String[] values = line.split("\t");
String newLine = substringBetween(values[4], "Some String : ", "Value=");
String[] split = newLine.split(" ");
return split[1].substring(4, split[1].length());
}
private String substringBetween(String str, String open, String close) {
if (str == null || open == null || close == null) {
return null;
}
int start = str.indexOf(open);
if (start != -1) {
int end = str.indexOf(close, start + open.length());
if (end != -1) {
return str.substring(start + open.length(), end);
}
}
return null;
}
A line comes from the reading of a file which is very efficient so I don't feel a need to post that code unless someone asks.
Is there anyway to improve perofmrance of this at all?
Thanks for your time
A few things are likely problematic:
Whether or not you realized, you are using regular expressions. The argument to String.split() is a treated as a regex. Using String.indexOf() will almost certainly be a faster way to find the particular portion of the String that you want. As HRgiger points out, Guava's splitter is a good choice because it does just that.
You're allocating a bunch of stuff you don't need. Depending on how long your lines are, you could be creating a ton of extra Strings and String[]s that you don't need (and the garbage collecting them). Another reason to avoid String.split().
I also recommend using String.startsWith() and String.endsWith() rather that all of this stuff that you're doing with the indexOf() if only for the fact that it'd be easier to read.
I would try to use regular expressions.
One of the main problems in this code is the "split" method.
For example this one:
private String getIdFromLine3(String line) {
int t_index = -1;
for (int i = 0; i < 3; i++) {
t_index = line.indexOf("\t", t_index+1);
if (t_index == -1) return null;
}
//String[] values = line.split("\t");
String newLine = substringBetween(line.substring(t_index + 1), "Some String : ", "Value=");
// String[] split = newLine.split(" ");
int p_index = newLine.indexOf(" ");
if (p_index == -1) return null;
int p_index2 = newLine.indexOf(" ", p_index+1);
if (p_index2 == -1) return null;
String split = newLine.substring(p_index+1, p_index2);
// return split[1].substring(4, split[1].length());
return split.substring(4, split.length());
}
UPD: It could be 3 times faster.
I would recommend to use the VisualVM to find the bottle neck before oprimisation.
If you need performance in your application, you will need profiling anyways.
As optimisation i would make an custom loop to replace yours substringBetween method and get rid of multiple indexOf calls
Google guava splitter pretty fast as well.
Could you try the regex anyway and post results please just for comparison:
Pattern p = Pattern.compile("(Some String : )(.*?)(Value=)"); //remove first and last group if not needed (adjust m.group(x) to match
#Test
public void test2(){
String str = "Long java line with Some String : and some object with Value=154345 ";
System.out.println(substringBetween(str));
}
private String substringBetween(String str) {
Matcher m = p.matcher(str);
if(m.find(2)){
return m.group(2);
}else{
return null;
}
}
If this is faster find a regex that combines both functions

Most efficient way to fill a String with a specified length with a specified character?

Basically given an int, I need to generate a String with the same length containing only the specified character. Related question here, but it relates to C# and it does matter what's in the String.
This question, and my answer to it are why I am asking this one. I'm not sure what's the best way to go about it performance wise.
Example
Method signature:
String getPattern(int length, char character);
Usage:
//returns "zzzzzz"
getPattern(6, 'z');
What I've tried
String getPattern(int length, char character) {
String result = "";
for (int i = 0; i < length; i++) {
result += character;
}
return result;
}
Is this the best that I can do performance-wise?
You should use StringBuilder instead of concatenating chars this way. Use StringBuilder.append().
StringBuilder will give you better performance. The problem with concatenation the way you are doing is each time a new String (string is immutable) is created then the old string is copied, the new string is appended, and the old String is thrown away. It's a lot of extra work that over a period of type (like in a big for loop) will cause performance degradation.
StringUtils from commons-lang or Strings from guava are your friends. As already stated avoid String concatenations.
StringUtils.repeat("a", 3) // => "aaa"
Strings.repeat("hey", 3) // => "heyheyhey"
Use primitive char arrays & some standard util classes like Arrays
public class Test {
static String getPattern(int length, char character) {
char[] cArray = new char[length];
Arrays.fill(cArray, character);
// return Arrays.toString(cArray);
return new String(cArray);
}
static String buildPattern(int length, char character) {
StringBuilder sb= new StringBuilder(length);
for (int i = 0; i < length; i++) {
sb.append(character);
}
return sb.toString();
}
public static void main(String args[]){
long time = System.currentTimeMillis();
getPattern(10000000,'c');
time = System.currentTimeMillis() - time;
System.out.println(time); //prints 93
time = System.currentTimeMillis();
buildPattern(10000000,'c');
time = System.currentTimeMillis() - time;
System.out.println(time); //prints 188
}
}
EDIT Arrays.toString() gave lower performance since it eventually used a StringBuilder, but the new String did the magic.
Yikes, no.
A String is immutable in java; you can't change it. When you say:
result += character;
You're creating a new String every time.
You want to use a StringBuilder and append to it, then return a String with its toString() method.
I think it would be more efficient to do it like following,
String getPattern(int length, char character)
{
char[] list = new char[length];
for(int i =0;i<length;i++)
{
list[i] = character;
}
return new string(list);
}
Concatenating a String is never the most efficient, since String is immutable, for better performance you should use StringBuilder, and append()
String getPattern(int length, char character) {
StringBuilder sb= new StringBuilder(length)
for (int i = 0; i < length; i++) {
sb.append(character);
}
return sb.toString();
}
Performance-wise, I think you'd have better results creating a small String and concatenating (using StringBuilder of course) until you reach the request size: concatenating/appending "zzz" to "zzz" performs probably betters than concatenating 'z' three times (well, maybe not for such small numbers, but when you reach 100 or so chars, doing ten concatenations of 'z' followed by ten concatenations of "zzzzzzzzzz" is probably better than 100 concatenatinos of 'z').
Also, because you ask about GWT, results will vary a lot between DevMode (pure Java) and "production mode" (running in JS in the browser), and is likely to vary depending on the browser.
The only way to really know is to benchmark, everything else is pure speculation.
And possibly use deferred binding to use the most performing variant in each browser (that's exactly how StringBuilder is emulated in GWT).

Is there a way to build a Java String using an SLF4J-style formatting function?

I've heard that using StringBuilder is faster than using string concatenation, but I'm tired of wrestling with StringBuilder objects all of the time. I was recently exposed to the SLF4J logging library and I love the "just do the right thing" simplicity of its formatting when compared with String.format. Is there a library out there that would allow me to write something like:
int myInteger = 42;
MyObject myObject = new MyObject(); // Overrides toString()
String result = CoolFormatingLibrary.format("Simple way to format {} and {}",
myInteger, myObject);
Also, is there any reason (including performance but excluding fine-grained control of date and significant digit formatting) why I might want to use String.format over such a library if it does exist?
Although the Accepted answer is good, if (like me) one is interested in exactly Slf4J-style semantics, then the correct solution is to use Slf4J's MessageFormatter
Here is an example usage snippet:
public static String format(String format, Object... params) {
return MessageFormatter.arrayFormat(format, params).getMessage();
}
(Note that this example discards a last argument of type Throwable)
For concatenating strings one time, the old reliable "str" + param + "other str" is perfectly fine (it's actually converted by the compiler into a StringBuilder).
StringBuilders are mainly useful if you have to keep adding things to the string, but you can't get them all into one statement. For example, take a for loop:
String str = "";
for (int i = 0; i < 1000000; i++) {
str += i + " "; // ignoring the last-iteration problem
}
This will run much slower than the equivalent StringBuilder version:
StringBuilder sb = new StringBuilder(); // for extra speed, define the size
for (int i = 0; i < 1000000; i++) {
sb.append(i).append(" ");
}
String str = sb.toString();
But these two are functionally equivalent:
String str = var1 + " " + var2;
String str2 = new StringBuilder().append(var1).append(" ").append(var2).toString();
Having said all that, my actual answer is:
Check out java.text.MessageFormat. Sample code from the Javadocs:
int fileCount = 1273;
String diskName = "MyDisk";
Object[] testArgs = {new Long(fileCount), diskName};
MessageFormat form = new MessageFormat("The disk \"{1}\" contains {0} file(s).");
System.out.println(form.format(testArgs));
Output:
The disk "MyDisk" contains 1,273 file(s).
There is also a static format method which does not require creating a MessageFormat object.
All such libraries will boil down to string concatenation at their most basic level, so there won't be much performance difference from one to another.
Plus it worth bearing in min that String.format() is a bad implementation of sprintf done with regexps, so if you profile your code you will see an patterns and int[] that you were not expecting.
MessageFormat and the slf MessageFormmater are generally faster and allocate less junk

Replace the first letter of a String in Java?

I'm trying to convert the first letter of a string to lowercase.
value.substring(0,1).toLowerCase() + value.substring(1)
This works, but are there any better ways to do this?
I could use a replace function, but Java's replace doesn't accept an index. You have to pass the actual character/substring. It could be done like this:
value.replaceFirst(value.charAt(0), value.charAt(0).toLowerCase())
Except that replaceFirst expects 2 strings, so the value.charAt(0)s would probably need to be replaced with value.substring(0,1).
Is there any standard way to replace the first letter of a String?
I would suggest you to take a look at Commons-Lang library from Apache. They have a class
StringUtils
which allows you to do a lot of tasks with Strings. In your case just use
StringUtils.uncapitalize( value )
read here about uncapitalize as well as about other functionality of the class suggested
Added: my experience tells that Coomon-Lang is quite good optimized, so if want to know what is better from algorithmistic point of view, you could take a look at its source from Apache.
The downside of the code you used (and I've used in similar situations) is that it seems a bit clunky and in theory generates at least two temporary strings that are immediately thrown away. There's also the issue of what happens if your string is fewer than two characters long.
The upside is that you don't reference those temporary strings outside the expression (leaving it open to optimization by the bytecode compiler or the JIT optimizer) and your intent is clear to any future code maintainer.
Barring your needing to do several million of these any given second and detecting a noticeable performance issue doing so, I wouldn't worry about performance and would prefer clarity. I'd also bury it off in a utility class somewhere. :-) See also jambjo's response to another answer pointing out that there's an important difference between String#toLowerCase and Character.toLowerCase. (Edit: The answer and therefore comment have been removed. Basically, there's a big difference related to locales and Unicode and the docs recommend using String#toLowerCase, not Character.toLowerCase; more here.)
Edit Because I'm in a weird mood, I thought I'd see if there was a measureable difference in performance in a simple test. There is. It could be because of the locale difference (e.g., apples vs. oranges):
public class Uncap
{
public static final void main(String[] params)
{
String s;
String s2;
long start;
long end;
int counter;
// Warm up
s = "Testing";
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap1(s);
s2 = uncap2(s);
s2 = uncap3(s);
}
// Test v2
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap2(s);
}
end = System.currentTimeMillis();
System.out.println("2: " + (end - start));
// Test v1
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap1(s);
}
end = System.currentTimeMillis();
System.out.println("1: " + (end - start));
// Test v3
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap3(s);
}
end = System.currentTimeMillis();
System.out.println("3: " + (end - start));
System.exit(0);
}
// The simple, direct version; also allows the library to handle
// locales and Unicode correctly
private static final String uncap1(String s)
{
return s.substring(0,1).toLowerCase() + s.substring(1);
}
// This will *not* handle locales and unicode correctly
private static final String uncap2(String s)
{
return Character.toLowerCase(s.charAt(0)) + s.substring(1);
}
// This will *not* handle locales and unicode correctly
private static final String uncap3(String s)
{
StringBuffer sb;
sb = new StringBuffer(s);
sb.setCharAt(0, Character.toLowerCase(sb.charAt(0)));
return sb.toString();
}
}
I mixed up the order in various tests (moving them around and recompiling) to avoid issues of ramp-up time (and tried to force some initially anyway). Very unscientific, but uncap1 was consistently slower than uncap2 and uncap3 by about 40%. Not that it matters, we're talking a difference of 400ms across a million iterations on an Intel Atom processor. :-)
So: I'd go with your simple, straightforward code, wrapped up in a utility function.
Watch out for any of the character functions in strings. Because of unicode, it is not always a 1 to 1 mapping. Stick to string based methods unless char is really what you want. As others have suggested, there are string utils out there, but even if you don't want to use them for your project, just make one yourself as you work. The worst thing you can do is to make a special function for lowercase and hide it in a class and then use the same code slightly differently in 12 different places. Put it somewhere it can easily be shared.
Use StringBuffer:
buffer.setCharAt(0, Character.toLowerCase(buffer.charAt(0)));

Categories