I would like to get an answer pointing out the reasons why the following idea described below on a very simple example is commonly considered bad and know its weaknesses.
I have a sentence of words and my goal is to make every second one to uppercase. My starting point for both of the cases is exactly the same:
String sentence = "Hi, this is just a simple short sentence";
String[] split = sentence.split(" ");
The traditional and procedural approach is:
StringBuilder stringBuilder = new StringBuilder();
for (int i=0; i<split.length; i++) {
if (i%2==0) {
stringBuilder.append(split[i]);
} else {
stringBuilder.append(split[i].toUpperCase());
}
if (i<split.length-1) { stringBuilder.append(" "); }
}
When want to use java-stream the use is limited due the effectively-final or final variable constraint used in the lambda expression. I have to use the workaround using the array and its first and only index, which was suggested in the first comment of my question How to increment a value in Java Stream. Here is the example:
int index[] = {0};
String result = Arrays.stream(split)
.map(i -> index[0]++%2==0 ? i : i.toUpperCase())
.collect(Collectors.joining(" "));
Yeah, it's a bad solution and I have heard few good reasons somewhere hidden in comments of a question I am unable to find (if you remind me some of them, I'd upvote twice if possible). But what if I use AtomicInteger - does it make any difference and is it a good and safe way with no side effects compared to the previous one?
AtomicInteger atom = new AtomicInteger(0);
String result = Arrays.stream(split)
.map(i -> atom.getAndIncrement()%2==0 ? i : i.toUpperCase())
.collect(Collectors.joining(" "));
Regardless of how ugly it might look for anyone, I ask for the description of possible weaknesses and their reasons. I don't care the performance but the design and possible weaknesses of the 2nd solution.
Please, don't match AtomicInteger with multi-threading issue. I used this class since it receives, increments and stores the value in the way I need for this example.
As I often say in my answers that "Java Stream-API" is not the bullet for everything. My goal is to explore and find the edge where is this sentence applicable since I find the last snippet quite clear, readable and brief compared to StringBuilder's snippet.
Edit: Does exist any alternative way applicable for the snippets above and all the issues when it’s needed to work with both item and index while iteration using Stream-API?
The documentation of the java.util.stream package states that:
Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
[...]
The ordering of side-effects may be surprising. Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.
This means that the elements may be processed out of order, and thus the Stream-solutions may produce wrong results.
This is (at least for me) a killer argument against the two Stream-solutions.
By the process of elimination, we only have the "traditional solution" left. And honestly, I do not see anything wrong with this solution. If we wanted to get rid of the for-loop, we could re-write this code using a foreach-loop:
boolean toUpper = false; // 1st String is not capitalized
for (String word : splits) {
stringBuilder.append(toUpper ? word.toUpperCase() : word);
toUpper = !toUpper;
}
For a streamified and (as far as I know) correct solution, take a look at Octavian R.'s answer.
Your question wrt. the "limits of streams" is opinion-based.
The answer to the question (s) ends here. The rest is my opinion and should be regarded as such.
In Octavian R.'s solution, an artificial index-set is created through a IntStream, which is then used to access the String[]. For me, this has a higher cognitive complexity than a simple for- or foreach-loop and I do not see any benefit in using streams instead of loops in this situation.
In Java, comparing with Scala, you must be inventive. One solution without mutation is this one:
String sentence = "Hi, this is just a simple short sentence";
String[] split = sentence.split(" ");
String result = IntStream.range(0, split.length)
.mapToObj(i -> i%2==0 ? split[i].toUpperCase():split[i])
.collect(Collectors.joining(" "));
System.out.println(result);
In Java streams you should avoid the mutation. Your solution with AtomicInteger it's ugly and it's a bad practice.
Kind regards!
As explained in Turing85’s answer, your stream solutions are not correct, as they rely on the processing order, which is not guaranteed. This can lead to incorrect results with parallel execution today, but even if it happens to produce the desired result with a sequential stream, that’s only an implementation detail. It’s not guaranteed to work.
Besides that, there is no advantage in rewriting code to use the Stream API with a logic that basically still is a loop, but obfuscated with a different API. The best way to describe the idea of the new APIs, is to say that you should express what to do but not how.
Starting with Java 9, you could implement the same thing as
String result = Pattern.compile("( ?+[^ ]* )([^ ]*)").matcher(sentence)
.replaceAll(m -> m.group(1)+m.group(2).toUpperCase());
which expresses the wish to replace every second word with its upper case form, but doesn’t express how to do it. That’s up to the library, which likely uses a single StringBuilder instead of splitting into an array of strings, but that’s irrelevant to the application logic.
As long as you’re using Java 8, I’d stay with the loop and even when switching to a newer Java version, I would consider replacing the loop as not being an urgent change.
The pattern in the above example has been written in a way to do exactly the same as your original code splitting at single space characters. Usually, I’d encode “replace every second word” more like
String result = Pattern.compile("(\\w+\\W+)(\\w+)").matcher(sentence)
.replaceAll(m -> m.group(1)+m.group(2).toUpperCase());
which would behave differently when encountering multiple spaces or other separators, but usually is closer to the actual intention.
Related
I'm fairly inexperienced with using objects so I would really like some input.
I'm trying to remove comments from a list that have certain "unwanted words" in them, both the comments and the list of "unwanted words" are in ArrayList objects.
This is inside of a class called FormHelper, which contains the private member comments as an ArrayList, the auditList ArrayList is created locally in a member function called populateComments(), which then calls this function (below). PopulateComments() is called by the constructor, and so this function only gets called once, when an instance of FormHelper is created.
private void filterComments(ArrayList <String> auditList) {
for(String badWord : auditList) {
for (String thisComment : this.comments) {
if(thisComment.contains(badWord)) {
int index = this.comments.indexOf(thisComment);
this.comments.remove(index);
}
}
}
}
something about the way I implemented this doesn't feel right, I'm also concerned that I'm using ArrayList functions inefficiently. Is my suspicion correct?
It is not particularly efficient. However, finding a more efficient solution is not straightforward.
Lets step back to a simpler problem.
private void findBadWords(List <String> wordList, List <String> auditList) {
for(String badWord : auditList) {
for (String word : wordList) {
if (word.equals(badWord)) {
System.err.println("Found a bad word");
}
}
}
}
Suppose that wordList contains N words and auditList contains M words. Some simple analysis will show that the inner loop is executed N x M times. The N factor is unavoidable, but the M factor is disturbing. It means that the more "bad" words you have to check for the longer it takes to check.
There is a better way to do this:
private void findBadWords(List <String> wordList, HashSet<String> auditWords) {
for (String word : wordList) {
if (auditWords.contains(word))) {
System.err.println("Found a bad word");
}
}
}
Why is that better? It is better (faster) because HashSet::contains doesn't need to check all of the audit words one at a time. In fact, in the optimal case it will check none of them (!) and the average case just one or two of them. (I won't go into why, but if you want to understand read the Wikipedia page on hash tables.)
But your problem is more complicated. You are using String::contains to test if each comment contains each bad word. That is not a simple string equality test (as per my simplified version).
What to do?
Well one potential solution is to split the the comments into an array of words (e.g. using String::split and then user the HashSet lookup approach. However:
That changes the behavior of your code. (In a good way actually: read up on the Scunthorpe problem!) You will now only match the audit words is they are actual words in the comment text.
Splitting a string into words is not cheap. If you use String::split it entails creating and using a Pattern object to find the word boundaries, creating substrings for each word and putting them into an array. You can probably do better, but it is always going to be a non-trivial calculation.
So the real question will be whether the optimization is going to pay off. That is ultimately going to depend on the value of M; i.e. the number of bad words you are looking for. The larger M is, the more likely it will be to split the comments into words and use a HashSet to test the words.
Another possible solution doesn't involve splitting the comments. You could take the list of audit words and assemble them into a single regex like this: \b(word-1|word-2|...|word-n)\b. Then use this regex with Matcher::find to search each comment string for bad words. The performance will depend on the optimizing capability of the regex engine in your Java platform. It has the potential to be faster than splitting.
My advice would be to benchmark and profile your entire application before you start. Only optimize:
when the benchmarking says that the overall performance of the requests where this comment checking occurs is concerning. (If it is OK, don't waste your time optimizing.)
when the profiling says that this method is a performance hotspot. (There is a good chance that the real hotspots are somewhere else. If so, you should optimize them rather than this method.)
Note there is an assumption that you have (sufficiently) completed your application and created a realistic benchmark for it before you think about optimizing. (Premature optimization is a bad idea ... unless you really know what you are doing.)
As a general approach, removing individual elements from an ArrayList in a loop is inefficient, because it requires shifting all of the "following" elements along one position in the array.
A B C D E
^ if you remove this
^---^ you have to shift these 3 along by one
/ / /
A C D E
If you remove lots of elements, this will have a substantial impact on the time complexity. It's better to identify the elements to remove, and then remove them all at once.
I suggest that a neater way to do this would be using removeIf, which (at least for collection implementations such as ArrayList) does this "all at once" removal:
this.comments.removeIf(
c -> auditList.stream().anyMatch(c::contains));
This is concise, but probably quite slow because it has to keep checking the entire comment string to see if it contains each bad word.
A probably faster way would be to use regex:
Pattern p = Pattern.compile(
auditList.stream()
.map(Pattern::quote)
.collect(joining("|")));
this.comments.removeIf(
c -> p.matcher(c).find());
This would be better because the compiled regex would search for all of the bad words in a single pass over each comment.
The other advantage of a regex-based approach is that you can check case insensitively, by supplying the appropriate flag when compiling the regex.
In Java there are a bunch of methods that all have to do with manipulating Strings.
The simplest example is the String.split("something") method.
Now the actual definition of many of those methods is that they all take a regular expression as their input parameter(s). Which makes then all very powerful building blocks.
Now there are two effects you'll see in many of those methods:
They recompile the expression each time the method is invoked. As such they impose a performance impact.
I've found that in most "real-life" situations these methods are called with "fixed" texts. The most common usage of the split method is even worse: It's usually called with a single char (usually a ' ', a ';' or a '&') to split by.
So it's not only that the default methods are powerful, they also seem overpowered for what they are actually used for. Internally we've developed a "fastSplit" method that splits on fixed strings. I wrote a test at home to see how much faster I could do it if it was known to be a single char. Both are significantly faster than the "standard" split method.
So I was wondering: why was the Java API chosen the way it is now?
What was the good reason to go for this instead of having a something like split(char) and split(String) and a splitRegex(String) ??
Update: I slapped together a few calls to see how much time the various ways of splitting a string would take.
Short summary: It makes a big difference!
I did 10000000 iterations for each test case, always using the input
"aap,noot,mies,wim,zus,jet,teun"
and always using ',' or "," as the split argument.
This is what I got on my Linux system (it's an Atom D510 box, so it's a bit slow):
fastSplit STRING
Test 1 : 11405 milliseconds: Split in several pieces
Test 2 : 3018 milliseconds: Split in 2 pieces
Test 3 : 4396 milliseconds: Split in 3 pieces
homegrown fast splitter based on char
Test 4 : 9076 milliseconds: Split in several pieces
Test 5 : 2024 milliseconds: Split in 2 pieces
Test 6 : 2924 milliseconds: Split in 3 pieces
homegrown splitter based on char that always splits in 2 pieces
Test 7 : 1230 milliseconds: Split in 2 pieces
String.split(regex)
Test 8 : 32913 milliseconds: Split in several pieces
Test 9 : 30072 milliseconds: Split in 2 pieces
Test 10 : 31278 milliseconds: Split in 3 pieces
String.split(regex) using precompiled Pattern
Test 11 : 26138 milliseconds: Split in several pieces
Test 12 : 23612 milliseconds: Split in 2 pieces
Test 13 : 24654 milliseconds: Split in 3 pieces
StringTokenizer
Test 14 : 27616 milliseconds: Split in several pieces
Test 15 : 28121 milliseconds: Split in 2 pieces
Test 16 : 27739 milliseconds: Split in 3 pieces
As you can see it makes a big difference if you have a lot of "fixed char" splits to do.
To give you guys some insight; I'm currently in the Apache logfiles and Hadoop arena with the data of a big website. So to me this stuff really matters :)
Something I haven't factored in here is the garbage collector. As far as I can tell compiling a regular expression into a Pattern/Matcher/.. will allocate a lot of objects, that need to be collected some time. So perhaps in the long run the differences between these versions is even bigger .... or smaller.
My conclusions so far:
Only optimize this if you have a LOT of strings to split.
If you use the regex methods always precompile if you repeatedly use the same pattern.
Forget the (obsolete) StringTokenizer
If you want to split on a single char then use a custom method, especially if you only need to split it into a specific number of pieces (like ... 2).
P.S. I'm giving you all my homegrown split by char methods to play with (under the license that everything on this site falls under :) ). I never fully tested them .. yet. Have fun.
private static String[]
stringSplitChar(final String input,
final char separator) {
int pieces = 0;
// First we count how many pieces we will need to store ( = separators + 1 )
int position = 0;
do {
pieces++;
position = input.indexOf(separator, position + 1);
} while (position != -1);
// Then we allocate memory
final String[] result = new String[pieces];
// And start cutting and copying the pieces.
int previousposition = 0;
int currentposition = input.indexOf(separator);
int piece = 0;
final int lastpiece = pieces - 1;
while (piece < lastpiece) {
result[piece++] = input.substring(previousposition, currentposition);
previousposition = currentposition + 1;
currentposition = input.indexOf(separator, previousposition);
}
result[piece] = input.substring(previousposition);
return result;
}
private static String[]
stringSplitChar(final String input,
final char separator,
final int maxpieces) {
if (maxpieces <= 0) {
return stringSplitChar(input, separator);
}
int pieces = maxpieces;
// Then we allocate memory
final String[] result = new String[pieces];
// And start cutting and copying the pieces.
int previousposition = 0;
int currentposition = input.indexOf(separator);
int piece = 0;
final int lastpiece = pieces - 1;
while (currentposition != -1 && piece < lastpiece) {
result[piece++] = input.substring(previousposition, currentposition);
previousposition = currentposition + 1;
currentposition = input.indexOf(separator, previousposition);
}
result[piece] = input.substring(previousposition);
// All remaining array elements are uninitialized and assumed to be null
return result;
}
private static String[]
stringChop(final String input,
final char separator) {
String[] result;
// Find the separator.
final int separatorIndex = input.indexOf(separator);
if (separatorIndex == -1) {
result = new String[1];
result[0] = input;
}
else {
result = new String[2];
result[0] = input.substring(0, separatorIndex);
result[1] = input.substring(separatorIndex + 1);
}
return result;
}
Note that the regex need not be recompiled each time. From the Javadoc:
An invocation of this method of the form str.split(regex, n) yields the same result as the expression
Pattern.compile(regex).split(str, n)
That is, if you are worried about performance, you may precompile the pattern and then reuse it:
Pattern p = Pattern.compile(regex);
...
String[] tokens1 = p.split(str1);
String[] tokens2 = p.split(str2);
...
instead of
String[] tokens1 = str1.split(regex);
String[] tokens2 = str2.split(regex);
...
I believe that the main reason for this API design is convenience. Since regular expressions include all "fixed" strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.
My feeling (which I can't back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.
It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.
I wouldn't say most string manipulations are regex-based in Java. Really we are only talking about split and replaceAll/replaceFirst. But I agree, it's a big mistake.
Apart from the ugliness of having a low-level language feature (strings) becoming dependent on a higher-level feature (regex), it's also a nasty trap for new users who might naturally assume that a method with the signature String.replaceAll(String, String) would be a string-replace function. Code written under that assumption will look like it's working, until a regex-special character creeps in, at which point you've got confusing, hard-to-debug (and maybe even security-significant) bugs.
It's amusing that a language that can be so pedantically strict about typing made the sloppy mistake of treating a string and a regex as the same thing. It's less amusing that there's still no builtin method to do a plain string replace or split. You have to use a regex replace with a Pattern.quoted string. And you only even get that from Java 5 onwards. Hopeless.
#Tim Pietzcker:
Are there other languages that do the same?
JavaScript's Strings are partly modelled on Java's and are also messy in the case of replace(). By passing in a string, you get a plain string replace, but it only replaces the first match, which is rarely what's wanted. To get a replace-all you have to pass in a RegExp object with the /g flag, which again has problems if you want to create it dynamically from a string (there is no built-in RegExp.quote method in JS). Luckily, split() is purely string-based, so you can use the idiom:
s.split(findstr).join(replacestr)
Plus of course Perl does absolutely everything with regexen, because it's just perverse like that.
(This is a comment more than an answer, but is too big for one. Why did Java do this? Dunno, they made a lot of mistakes in the early days. Some of them have since been fixed. I suspect if they'd thought to put regex functionality in the box marked Pattern back in 1.0, the design of String would be cleaner to match.)
I imagine a good reason is that they can simply pass the buck on to the regex method, which does all the real heavy lifting for all of the string methods. Im guessing they thought if they already had a working solution it was less efficient, from a development and maintenance standpoint, to reinvent the wheel for each string manipulation method.
Interesting discussion!
Java was not originally intended as a batch programming language. As such the API out of the box are more tuned towards doing one "replace" , one "parse" etc. except on Application initialization when the app may be expected to be parsing a bunch of configuration files.
Hence optimization of these APIs was sacrificed in the altar of simplicity IMO. But the question brings up an important point. Python's desire to keep the regex distinct from the non regex in its API, stems from the fact that Python can be used as an excellent scripting language as well. In UNIX too, the original versions of fgrep did not support regex.
I was engaged in a project where we had to do some amount of ETL work in java. At that time, I remember coming up with the kind of optimizations that you have alluded to, in your question.
I suspect that the reason why things like String#split(String) use regexp under the hood is because it involves less extraneous code in the Java Class Library. The state machine resulting from a split on something like , or space is so simple that it is unlikely to be significantly slower to execute than a statically implemented equivalent using a StringCharacterIterator.
Beyond that the statically implemented solution would complicate runtime optimization with the JIT because it would be a different block of code that also requires hot code analysis. Using the existing Pattern algorithms regularly across the library means that they are more likely candidates for JIT compilation.
Very good question..
I suppose when the designers sat down to look at this (and not for very long, it seems), they came at it from a point of view that it should be designed to suit as many different possibilities as possible. Regular Expressions offered that flexibility.
They didn't think in terms of efficiencies. There is the Java Community Process available to raise this.
Have you looked at using the java.util.regex.Pattern class, where you compile the expression once and then use on different strings.
Pattern exp = Pattern.compile(":");
String[] array = exp.split(sourceString1);
String[] array2 = exp.split(sourceString2);
In looking at the Java String class, the uses of regex seem reasonable, and there are alternatives if regex is not desired:
http://java.sun.com/javase/6/docs/api/java/lang/String.html
boolean matches(String regex) - A regex seems appropriate, otherwise you could just use equals
String replaceAll/replaceFirst(String regex, String replacement) - There are equivalents that take CharSequence instead, preventing regex.
String[] split(String regex, int limit) - A powerful but expensive split, you can use StringTokenizer to split by tokens.
These are the only functions I saw that took regex.
Edit: After seeing that StringTokenizer is legacy, I would defer to Péter Török's answer to precompile the regex for split instead of using the tokenizer.
The answer to your question is that the Java core API did it wrong. For day to day work you can consider using Guava libraries' CharMatcher which fills the gap beautifully.
...why was the Java API chosen the way it is now?
Short answer: it wasn't. Nobody ever decided to favor regex methods over non-regex methods in the String API, it just worked out that way.
I always understood that Java's designers deliberately kept the string-manipulation methods to a minimum, in order to avoid API bloat. But when regex support came along in JDK 1.4, of course they had to add some convenience methods to String's API.
So now users are faced with a choice between the immensely powerful and flexible regex methods, and the bone-basic methods that Java always offered.
I would like to know other people's opinion on the following style of writing a for loop:
for (int rep = numberOfReps; rep --> 0 ;) {
// do something that you simply want to repeat numberOfReps times
}
The reason why I invented this style is to distinguish it from the more general case of for loops. I only use this when I need to simply repeat something numberOfReps times and the body of the loop does not use the values of rep and numberofReps in any way.
As far as I know, standard Java for example doesn't have a simple way of saying "just repeat this N times", and that's why I came up with this. I'd even go as far as saying that the body of the loop must not continue or break, unless explicitly documented at the top of the for loop, because as I said the whole purpose is to make the code easier to understand by coming up with a distinct style to express simple repetitions.
The idea is that if what you're doing is not simple (dependency on value of an inreasing/decreasing index, breaks, continues, etc), then use the standard for loop. If what you are doing is simple repetition, on the other hand, then this distinct style communicates that "fact" (once you know the purpose of the style, of course).
I said "fact" because the style can be abused, of course. I'm operating under the assumption that you have competent programmers whose objective is to make their code easier to understand, not harder.
A comment was made that allude to the principle that for should only be used for simple iteration, and while should be used otherwise (e.g. if the loop variables are modified in the body).
If that's the case, then I'm merely extending that principle to say that if it's even simpler than your simple for loops (i.e. you don't even care about the iteration index, or whether it's increasing or decreasing, etc, you just want to repeat doing something N times), then use the winking arrow for loop construct instead.
What a coincidence, Josh Bloch just tweeted the following:
Goes-to Considered Harmful:
public static void main(String[] a) {
int i = 10;
while (i --> 0) /* i goes-to 0 */ {
System.out.println(i);
}
}
Unfortunately no explanation was given, but it seems that at least this pseudo operator has a name. It has also been discussed before on SO: What is the name of this operator: “-->”?
You have the language-agnostic tag, but this question isn't really language agnostic. That pattern would be fine if there wasn't already a well established idiom for doing something n times in your language.
You go on to mention Java, whicha already has a well-established idiom for doing something n times:
for (int i = 0; i < numberOfReps; i++) {
// do something that you simply want to repeat numberOfReps times
}
While your pattern works just as well, it's confusing to others. When I first saw it my thoughts were:
What's that weird arrow?
Why is that line winking at me?
Unless you develop a pattern that has a significant advantage over the standard idiom, it's best to stick with the standard so your fellow coders don't end up scratching their heads.
Nearly every language these days has lambda, so you can write a function like
nTimes(n, body)
that takes an int and a lambda, and more directly communicate intent. In F#, for example
let nTimes(n,f) =
for i in 1..n do f()
nTimes(3, fun() -> printfn "Hello")
or if you prefer extension methods
type System.Int32 with
member this.Times(f) =
for i in 1..this do f()
(3).Times(fun() -> printfn "Hello")
Given a string with replacement keys in it, how can I most efficiently replace these keys with runtime values, using Java? I need to do this often, fast, and on reasonably long strings (say, on average, 1-2kb). The form of the keys is my choice, since I'm providing the templates here too.
Here's an example (please don't get hung up on it being XML; I want to do this, if possible, cheaper than using XSL or DOM operations). I'd want to replace all #[^#]*?# patterns in this with property values from bean properties, true Property properties, and some other sources. The key here is fast. Any ideas?
<?xml version="1.0" encoding="utf-8"?>
<envelope version="2.3">
<delivery_instructions>
<delivery_channel>
<channel_type>#CHANNEL_TYPE#</channel_type>
</delivery_channel>
<delivery_envelope>
<chan_delivery_envelope>
<queue_name>#ADDRESS#</queue_name>
</chan_delivery_envelope>
</delivery_envelope>
</delivery_instructions>
<composition_instructions>
<mime_part content_type="application/xml">
<content><external_uri>#URI#</external_uri></content>
</mime_part>
</composition_instructions>
</envelope>
The naive implementation is to use String.replaceAll() but I can't help but think that's less than ideal. If I can avoid adding new third-party dependencies, so much the better.
The appendReplacement method in Matcher looks like it might be useful, although I can't vouch for its speed.
Here's the sample code from the Javadoc:
Pattern p = Pattern.compile("cat");
Matcher m = p.matcher("one cat two cats in the yard");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "dog");
}
m.appendTail(sb);
System.out.println(sb.toString());
EDIT: If this is as complicated as it gets, you could probably implement your own state machine fairly easily. You'd pretty much be doing what appendReplacement is already doing, although a specialized implementation might be faster.
It's premature to leap to writing your own. I would start with the naive replace solution, and actually benchmark that. Then I would try a third-party templating solution. THEN I would take a stab at the custom stream version.
Until you get some hard numbers, how can you be sure it's worth the effort to optimize it?
Does Java have a form of regexp replace() where a function gets called?
I'm spoiled by the Javascript String.replace() method. (For that matter you could run Rhino and use Javascript, but somehow I don't think that would be anywhere near as fast as a pure Java call even if the Javascript compiler/interpreter were efficient)
edit: never mind, #mmyers probably has the best answer.
gratuitous point-groveling: (and because I wanted to see if I could do it myself :)
Pattern p = Pattern.compile("#([^#]*?)#");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb,substitutionTable.lookupKey(m.group(1)));
}
m.appendTail(sb);
// replace "substitutionTable.lookupKey" with your routine
You really want to write something custom so you can avoid processing the string more than once. I can't stress this enough - as most of the other solutions I see look like they are ignoring that problem.
Optionally turn the text into a stream. Read it char by char forwarding each char to an output string/stream until you see the # then read to the next # slurping out the key, substituting the key into the output: repeat until end of stream.
I know it's plain old brute for - but it's probably the best.
I'm assuming you have some reasonable assumption around '#' not just 'showing up' independant of your token keys in the input. :)
please don't get hung up on it being XML; I want to do this, if possible, cheaper than using XSL or DOM operations
Whatever's downstream from your process will get hung up if you don't also process the inserted strings for character escapes. Which isn't to say that you can't do it yourself if you have good cause, but does mean you either have to make sure your patterns are all in text nodes, and you also correctly escape the replacement text.
What exact advantage does #Foo# have over the standard &Foo; syntax already built into the XML libraries which ship with Java?
Text processing is going to always be bounded if you dont shift your paradigm. I dont know how flexible your domain is, so not sure if this is applicable, but here goes:
try creating an index into where your text substitution is - this is especially good if the template doesnt change often, because it becomes part of the "compile" of the template, into a binary object that can take in the value required for the substitutions, and blit out the entire string as a byte array. This object can be cached/saved, and next time, resubstitute in the new value to use again. I.e., you save on parsing the document every time. (implementation is left as an exercise to the reader =D )
But please use a profiler to check whether this is actually the bottleneck that you say it is before embarking on writing a custom templating engine. The problem may actually be else where.
As others have said, appendReplacement() and appendTail() are the tools you need, but there's something you have watch out for. If the replacement string contains any dollar signs, the method will try to interpret them as capture-group references. If there are any backslashes (which are used to escape the dollars sing), it will either eat them or throw an exception.
If your replacement string is dynamically generated, you may not know in advance whether it will contain any dollar signs or backslashes. To prevent problems, you can append the replacement directly to the StringBuffer, like so:
Pattern p = Pattern.compile("#([^#]*?)#");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement("");
sb.append(substitutionTable.lookupKey(m.group(1)));
}
m.appendTail(sb);
You still have to call appendReplacement() each time, because that's what keeps you in sync with the match position. But this trick avoids a lot of pointless processing, which could give you a noticeable performance boost as a bonus.
this is what I use, from the apache commons project
http://commons.apache.org/lang/api/org/apache/commons/lang/text/StrSubstitutor.html
I also have a non-regexp based substitution library, available here. I have not tested its speed, and it doesn't directly support the syntax in your example. But it would be easy to extend to support that syntax; see, for instance, this class.
Take a look at a library that specializes in this, e.g., Apache Velocity. If nothing else, you can bet their implementation for this part of the logic is fast.
I wouldn't be so sure the accepted answer is faster than String.replaceAll(String,String). Here for your comparison is the implementation of String.replaceAll and the Matcher.replaceAll that is used under the covers. looks very similar to what the OP is looking for, and I'm guessing its probably more optomized than this simplistic solution.
public String replaceAll(String s, String s1)
{
return Pattern.compile(s).matcher(this).replaceAll(s1);
}
public String replaceAll(String s)
{
reset();
boolean flag = find();
if(flag)
{
StringBuffer stringbuffer = new StringBuffer();
boolean flag1;
do
{
appendReplacement(stringbuffer, s);
flag1 = find();
} while(flag1);
appendTail(stringbuffer);
return stringbuffer.toString();
} else
{
return text.toString();
}
}
... Chii is right.
If this is a template that has to be run so many times that speed matters, find the index of your substitution tokens to be able to get to them directly without having to start at the beginning each time. Abstract the 'compilation' into an object with the nice properties, they should only need updating after a change to the template.
Rythm a java template engine now released with an new feature called String interpolation mode which allows you do something like:
String result = Rythm.render("Hello #who!", "world");
The above case shows you can pass argument to template by position. Rythm also allows you to pass arguments by name:
Map<String, Object> args = new HashMap<String, Object>();
args.put("title", "Mr.");
args.put("name", "John");
String result = Rythm.render("Hello #title #name", args);
Since your template content is relatively long you could put them into a file and then call Rythm.render using the same API:
Map<String, Object> args = new HashMap<String, Object>();
// ... prepare the args
String result = Rythm.render("path/to/my/template.xml", args);
Note Rythm compile your template into java byte code and it's fairly fast, about 2 times faster than String.format
Links:
Check the full featured demonstration
read a brief introduction to Rythm
download the latest package or
fork it
Which one is recommended considering readability, memory usage, other reasons?
1.
String strSomething1 = someObject.getSomeProperties1();
strSomething1 = doSomeValidation(strSomething1);
String strSomething2 = someObject.getSomeProperties2();
strSomething2 = doSomeValidation(strSomething2);
String strSomeResult = strSomething1 + strSomething2;
someObject.setSomeProperties(strSomeResult);
2.
someObject.setSomeProperties(doSomeValidation(someObject.getSomeProperties1()) +
doSomeValidation(someObject.getSomeProperties2()));
If you would do it some other way, what would that be? Why would you do that way?
I'd go with:
String strSomething1 = someObject.getSomeProperties1();
String strSomething2 = someObject.getSomeProperties2();
// clean-up spaces
strSomething1 = removeTrailingSpaces(strSomething1);
strSomething2 = removeTrailingSpaces(strSomething2);
someObject.setSomeProperties(strSomething1 + strSomething2);
My personal preference is to organize by action, rather than sequence. I think it just reads better.
I would probably go in-between:
String strSomething1 = doSomeValidation(someObject.getSomeProperties1());
String strSomething2 = doSomeValidation(someObject.getSomeProperties2());
someObject.setSomeProperties(strSomething1 + strSomething2);
Option #2 seems like a lot to do in one line. It's readable, but takes a little effort to parse. In option #1, each line is very readable and clear in intent, but the verbosity slows me down when I'm going over it. I'd try to balance brevity and clarity as above, with each line representing a simple "sentence" of code.
I prefer the second. You can make it just as readable with a little bit of formatting, without declaring the extra intermediate references.
someObject.setSomeProperties(
doSomeValidation( someObject.getSomeProperties1() ) +
doSomeValidation( someObject.getSomeProperties2() ));
Your method names provide all the explanation needed.
Option 2 for readability. I don't see any memory concerns here if the methods only do what their names indicate. I would be vary with concatenations though. Performance definitely takes a beat with increasing string concats because of the immutability of Java Strings.
Just curious to know, did you really write your own removeTrailingSpaces() method or is it just an example ?
I try to have one operation per line. The main reason is this:
setX(getX().getY()+getA().getB())
If you have a NPE here, which method returned null? So I like to have intermediate results in some variable which I can see after the code fell into the strong arms of the debugger and without having to restart!
for me, it depends on the context and the surrounding code.
[EDIT: does not make any sense, sorry]
if it was in method like "setSomeObjectProperties()", I'd prefer variant 2 but perhaps would create a private method "getProperty(String name)" which removes the trailing spaces if removing the spaces is not an important operation
[/EDIT]
If validation the properties is an important step of your method, then I'd call the method "setValidatedProperties()" and would prefer a variant of your first suggestion:
validatedProp1 = doValidation(someObject.getSomeProperty1());
validatedProp2 = doValidation(someObject.getSomeProperty2());
someObject.setSomeProperties(validatedProp1, validatedProp2);
If validation is not something important of this method (e.g. there's no point in returning properties which are not validated), I'd try to put the validation-step in "getSomePropertyX()"
Personally, I prefer the second one. It's less cluttered and I don't have to keep track of those temporary variables.
Might change easily with more complex expressions, though.
I like both Greg and Bill versions, I think I would more naturally write code like Greg's one. One advantage with intermediary variables: it is easier to debug (in the general case).