String capitalize - better way

String capitalize - better way - java

What method of capitalizing is better?
mine:
char[] charArray = string.toCharArray();
charArray[0] = Character.toUpperCase(charArray[0]);
return new String(charArray);
or
commons lang - StringUtils.capitalize:
return new StringBuffer(strLen)
.append(Character.toTitleCase(str.charAt(0)))
.append(str.substring(1))
.toString();
I think mine is better, but i would rather ask.

I guess your version will be a little bit more performant, since it does not allocate as many temporary String objects.
I'd go for this (assuming the string is not empty):
StringBuilder strBuilder = new StringBuilder(string);
strBuilder.setCharAt(0, Character.toUpperCase(strBuilder.charAt(0))));
return strBuilder.toString();
However, note that they are not equivalent in that one uses toUpperCase() and the other uses toTitleCase().
From a forum post:
Titlecase <> uppercase
Unicode
defines three kinds of case mapping:
lowercase, uppercase, and titlecase.
The difference between uppercasing and
titlecasing a character or character
sequence can be seen in compound
characters (that is, a single
character that represents a compount
of two characters).
For example, in Unicode, character
U+01F3 is LATIN SMALL LETTER DZ. (Let
us write this compound character
using ASCII as "dz".) This character
uppercases to character U+01F1, LATIN
CAPITAL LETTER DZ. (Which is
basically "DZ".) But it titlecases to
to character U+01F2, LATIN CAPITAL
LETTER D WITH SMALL LETTER Z. (Which
we can write "Dz".)
character uppercase titlecase
--------- --------- ---------
dz DZ Dz

If I were to write a library, I'd try to make sure I got my Unicode right beofre worrying about performance. Off the top of my head:
int len = str.length();
if (len == 0) {
return str;
}
int head = Character.toUpperCase(str.codePointAt(0));
String tail = str.substring(str.offsetByCodePoints(0, 1));
return new String(new int[] { head }).concat(tail);
(I'd probably also look up the difference between title and upper case before I committed.)

Performance is equal.
Your code copies the char[] calling string.toCharArray() and new String(charArray).
The apache code on buffer.append(str.substring(1)) and buffer.toString(). The apache code has an extra string instance that has the base char[1,length] content. But this will not be copied when the instance String is created.

StringBuffer is declared to be thread safe, so it might be less effective to use it (but one shouldn't bet on it before actually doing some practical tests).

StringBuilder (from Java 5 onwards) is faster than StringBuffer if you don't need it to be thread safe but as others have said you need to test if this is better than your solution in your case.

Have you timed both?
Honestly, they're equivalent.. so the one that performs better for you is the better one :)

Not sure what the difference between toUpperCase and toTitleCase is, but it looks as if your solution requires one less instantiation of the String class, while the commons lang implementation requires two (substring and toString create new Strings I assume, since String is immutable).
Whether that's "better" (I guess you mean faster) I don't know. Why don't you profile both solutions?

look at this question titlecase-conversion . apache FTW.

/**
* capitalize the first letter of a string
*
* #param String
* #return String
* */
public static String capitalizeFirst(String s) {
if (s == null || s.length() == 0) {
return "";
}
char first = s.charAt(0);
if (Character.isUpperCase(first)) {
return s;
} else {
return Character.toUpperCase(first) + s.substring(1);
}
}

If you only capitalize limited words, you better cache it.
#Test
public void testCase()
{
String all = "At its base, a shell is simply a macro processor that executes commands. The term macro processor means functionality where text and symbols are expanded to create larger expressions.\n" +
"\n" +
"A Unix shell is both a command interpreter and a programming language. As a command interpreter, the shell provides the user interface to the rich set of GNU utilities. The programming language features allow these utilities to be combined. Files containing commands can be created, and become commands themselves. These new commands have the same status as system commands in directories such as /bin, allowing users or groups to establish custom environments to automate their common tasks.\n" +
"\n" +
"Shells may be used interactively or non-interactively. In interactive mode, they accept input typed from the keyboard. When executing non-interactively, shells execute commands read from a file.\n" +
"\n" +
"A shell allows execution of GNU commands, both synchronously and asynchronously. The shell waits for synchronous commands to complete before accepting more input; asynchronous commands continue to execute in parallel with the shell while it reads and executes additional commands. The redirection constructs permit fine-grained control of the input and output of those commands. Moreover, the shell allows control over the contents of commands’ environments.\n" +
"\n" +
"Shells also provide a small set of built-in commands (builtins) implementing functionality impossible or inconvenient to obtain via separate utilities. For example, cd, break, continue, and exec cannot be implemented outside of the shell because they directly manipulate the shell itself. The history, getopts, kill, or pwd builtins, among others, could be implemented in separate utilities, but they are more convenient to use as builtin commands. All of the shell builtins are described in subsequent sections.\n" +
"\n" +
"While executing commands is essential, most of the power (and complexity) of shells is due to their embedded programming languages. Like any high-level language, the shell provides variables, flow control constructs, quoting, and functions.\n" +
"\n" +
"Shells offer features geared specifically for interactive use rather than to augment the programming language. These interactive features include job control, command line editing, command history and aliases. Each of these features is described in this manual.";
String[] split = all.split("[\\W]");
// 10000000
// upper Used 606
// hash Used 114
// 100000000
// upper Used 5765
// hash Used 1101
HashMap<String, String> cache = Maps.newHashMap();
long start = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++)
{
String upper = split[i % split.length].toUpperCase();
// String s = split[i % split.length];
// String upper = cache.get(s);
// if (upper == null)
// {
// cache.put(s, upper = s.toUpperCase());
//
// }
}
System.out.println("Used " + (System.currentTimeMillis() - start));
}
The text is picked from here.
Currently, I need to upper case the table name and columns, many many more times, but they are limited.Use the hashMap to cache will be better.
:-)

use this method for capitalizing of string. its totally working without any bug
public String capitalizeString(String value)
{
String string = value;
String capitalizedString = "";
System.out.println(string);
for(int i = 0; i < string.length(); i++)
{
char ch = string.charAt(i);
if(i == 0 || string.charAt(i-1)==' ')
ch = Character.toUpperCase(ch);
capitalizedString += ch;
}
return capitalizedString;
}

Related

Is String.trim() faster than String.replace()?

Let's say there has a string like " world ". This String only has the blank at front and end. Is the trim() faster than replace()?
I used the replace once and my mentor said don't use it since the trim() probably faster.
If not, what's the advantage of trim() than replace()?

If we look at the source code for the methods:
replace():
public String replace(CharSequence target, CharSequence replacement) {
String tgtStr = target.toString();
String replStr = replacement.toString();
int j = indexOf(tgtStr);
if (j < 0) {
return this;
}
int tgtLen = tgtStr.length();
int tgtLen1 = Math.max(tgtLen, 1);
int thisLen = length();
int newLenHint = thisLen - tgtLen + replStr.length();
if (newLenHint < 0) {
throw new OutOfMemoryError();
}
StringBuilder sb = new StringBuilder(newLenHint);
int i = 0;
do {
sb.append(this, i, j).append(replStr);
i = j + tgtLen;
} while (j < thisLen && (j = indexOf(tgtStr, j + tgtLen1)) > 0);
return sb.append(this, i, thisLen).toString()
}
Vs trim():
public String trim() {
int len = value.length;
int st = 0;
char[] val = value; /* avoid getfield opcode */
while ((st < len) && (val[st] <= ' ')) {
st++;
}
while ((st < len) && (val[len - 1] <= ' ')) {
len--;
}
return ((st > 0) || (len < value.length)) ? substring(st, len) : this;
}
As you can see replace() calls multiple other methods and iterates throughout the entire String, while trim() simply iterates over the beginning and ending of the String until the character isn't a white space. So in the single respect of trying to only remove white space before and after a word, trim() is more efficient.
We can run some benchmarks on this:
public static void main(String[] args) {
long testStartTime = System.nanoTime();;
trimTest();
long trimTestTime = System.nanoTime() - testStartTime;
testStartTime = System.nanoTime();
replaceTest();
long replaceTime = System.nanoTime() - testStartTime;
System.out.println("Time for trim(): " + trimTestTime);
System.out.println("Time for replace(): " + replaceTime);
}
public static void trimTest() {
for(int i = 0; i < 1000000; i ++) {
new String(" string ").trim();
}
}
public static void replaceTest() {
for(int i = 0; i < 1000000; i ++) {
new String(" string ").replace(" ", "");
}
}
Output:
Time for trim(): 53303903
Time for replace(): 485536597
//432,232,694 difference

Assuming that the people writing the Java library code are doing a good job1, you can assume that a special purpose method (like trim()) will be as fast, and probably faster than a general purpose method (like replace(...)) doing the same thing.
Two reasons:
If the special purpose method is slower, its implementation can be rewritten as equivalent calls to the general purpose one, making the performance equivalent in most cases. A competent programmer will do this because it reduces maintenance costs.
In the special purpose method, it is likely that there will be optimizations that can be made that don't apply in the general-purpose case.
In this case we know that trim() only needs to look at the start and end of the string ... whereas replace(...) needs to look at all of the characters in the string. (We can infer this from the description of what the respective methods do.)
If we assume "competence" then we can infer that the developers will have done the analysis and not implemented trim() sub-optimally2; i.e. they won't code trim() to examine all characters.
There is another reason to use the special purpose method over the general purpose. It makes your code simpler, easier to read, and easier to inspect for correctness. This may well be more important than performance.
This clearly applies in the case of trim() versus replace(...).
1 - We can in this case. There are lots of eyes looking at this code, and lots of people who will complain loudly about egregious performance issues.
2 - Unfortunately, it is not always as straightforward as this. A library method needs to be optimized for "typical" behavior, but it also needs to avoid pathological performance in edge-cases. It is not always possible to achieve both things.

trim() is definitely faster to type, yes. It doesn't take any parameters.
It is also much faster to understand what you where trying to do. You were trying to trim the string, rather than replacing all the spaces it contains with the empty string, knowing from other context that there is only space at the beginning and the end of the string.
Indeed much faster no matter how you look at it. Don't complicate the life of the persons who're trying to read your code. Most of the time, it will be you months later, or at least someone you don't hate.

Trim will prune the outter characters until they are non white space. I believe they trim space, tab, and new lines.
Replace will scan the entire string (so, it could be a sentense) and would replace inner " " with "", essentially compressing them together.
They have different use cases though, obviously 1 is to clean up user input where the other is to update a string where matches are found with something else.
That being said, run times: Replace will run in N time, as it will look for all matching characters. Trim will run in O(N), but most likely a just a few characters off of each end.
The idea behind trim i think came around from people would would type and input things but accidentally press space before submitting their forms, essentially trying to save the field "Foo " instead of "Foo"

s.trim() shortens a String s. This means no characters has to be moved from an index to another. It starts at the first character (s.toCharArray()[0]) of the String and shortens the String character by character until the first non-whitespace character occurs. It works the same way to shorten the String at the end. So it compresses the String. If a String has no leading and trailing whitespace trim will be ready after checking the first and the last character.
In case of " world ".trim() two steps are needed: one to remove the first leading whitespace as it is on the first index and the the second to remove the last whitespace as it is on the last index.
" world ".replace(" ", "") will need at least n = " world ".length() steps. It has to check every character if it has to be replaced. But if we take into account that the implementation of String.replace(...) needs to compile a Pattern, build a Matcher and then to replace all the matched regions it's seems far complex comparing to shorten a String.
We also have to consider that " world ".replace(" ", "") does not replace whitespaces but only the String " ". Since String replace(CharSequence target, CharSequence replacement) compiles the target using Pattern.LITERAL we cannot use the character class \s. To be more accurate we would have to compare " world ".trim() to " world ".replaceAll("\\s", ""). It is still not the same because a whitespace in String trim() is defined as c <= ' ' for each c in s.toCharArray().
Summarizing: String.trim() should be faster - especially for long strings
The description how the methods work is based on the implementation of String in Java 8. But implementations can change.
But the question should be: What do you intent to do with the string? Do you want to trim it or to replace some characters? According to it use the corresponding method.

How to compare Chinese characters in Java using 'equals()'

I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:
for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {
// Account for 'r' like in dianr/huir
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Also, feel free to suggest a more elegant way to parse this ...
[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)
oh, dang, apparently it does not work simply copy and pasting:

Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:
tmpChar.codePoints().forEach(c -> {
if (c == '兒') {
// ...
}
});
(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).

Either characters, accepting 兒 as substring.
String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
int position2 = position + "兒".length();
s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
// At position i there is a 兒.
}
Or code points where it would be one code point. As that is not really easier, variable substring seem fine.

if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.
There are a variety of APIs on the String class for coping.
As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.
Or, you can use the ICU4J library with a richer set of facilities for all of this.

which code is more efficient?

which of the following is an efficient way to reverse words in a string ?
public String Reverse(StringTokenizer st){
String[] words = new String[st.countTokens()];
int i = 0;
while(st.hasMoreTokens()){
words[i] = st.nextToken();i++}
for(int j = words.length-1;j--)
output = words[j]+" ";}
OR
public String Reverse(StringTokenizer st, String output){
if(!st.hasMoreTokens()) return output;
output = st.nextToken()+" "+output;
return Reverse(st, output);}
public String ReverseMain(StringTokenizer st){
return Reverse(st, "");}
while the first way seems more readable and straight forward, there are two loops in it. In the 2nd method, I've tried doing it in tail-recursive way. But I am not sure whether java does optimize tail-recursive code.

you could do this in just one loop
public String Reverse(StringTokenizer st){
int length = st.countTokens();
String[] words = new String[length];
int i = length - 1;
while(i >= 0){
words[i] = st.nextToken();i--}
}

But I am not sure whether java does optimize tail-recursive code.
It doesn't. Or at least the Sun/Oracle Java implementations don't, up to and including Java 7.
References:
"Tail calls in the VM" by John Rose # Oracle.
Bug 4726340 - RFE: Tail Call Optimization
I don't know whether this makes one solution faster than the other. (Test it yourself ... taking care to avoid the standard micro-benchmarking traps.)
However, the fact that Java doesn't implement tail-call optimization means that the 2nd solution is liable to run out of stack space if you give it a string with a large (enough) number of words.
Finally, if you are looking for a more space efficient way to implement this, there is clever way that uses just a StringBuilder.
Create a StringBuilder from your input String
Reverse the characters in the StringBuilder using reverse().
Step through the StringBuilder, identifying the start and end offset of each word. For each start/end offset pair, reverse the characters between the offsets. (You have to do this using a loop.)
Turn the StringBuilder back into a String.

You can test results by timing both of them on a large amount of results
eg. You reverse 100000000 strings and see how many seconds it takes. You could also compare start and end system timestamps to get the exact difference between the two functions.

StringTokenizer is not deprecated but if you read the current JavaDoc...
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
String[] strArray = str.split(" ");
StringBuilder sb = new StringBuilder();
for (int i = strArray.length() - 1; i >= 0; i--)
sb.append(strArray[i]).append(" ");
String reversedWords = sb.substring(0, sb.length -1) // strip trailing space

Replace the first letter of a String in Java?

I'm trying to convert the first letter of a string to lowercase.
value.substring(0,1).toLowerCase() + value.substring(1)
This works, but are there any better ways to do this?
I could use a replace function, but Java's replace doesn't accept an index. You have to pass the actual character/substring. It could be done like this:
value.replaceFirst(value.charAt(0), value.charAt(0).toLowerCase())
Except that replaceFirst expects 2 strings, so the value.charAt(0)s would probably need to be replaced with value.substring(0,1).
Is there any standard way to replace the first letter of a String?

I would suggest you to take a look at Commons-Lang library from Apache. They have a class
StringUtils
which allows you to do a lot of tasks with Strings. In your case just use
StringUtils.uncapitalize( value )
read here about uncapitalize as well as about other functionality of the class suggested
Added: my experience tells that Coomon-Lang is quite good optimized, so if want to know what is better from algorithmistic point of view, you could take a look at its source from Apache.

The downside of the code you used (and I've used in similar situations) is that it seems a bit clunky and in theory generates at least two temporary strings that are immediately thrown away. There's also the issue of what happens if your string is fewer than two characters long.
The upside is that you don't reference those temporary strings outside the expression (leaving it open to optimization by the bytecode compiler or the JIT optimizer) and your intent is clear to any future code maintainer.
Barring your needing to do several million of these any given second and detecting a noticeable performance issue doing so, I wouldn't worry about performance and would prefer clarity. I'd also bury it off in a utility class somewhere. :-) See also jambjo's response to another answer pointing out that there's an important difference between String#toLowerCase and Character.toLowerCase. (Edit: The answer and therefore comment have been removed. Basically, there's a big difference related to locales and Unicode and the docs recommend using String#toLowerCase, not Character.toLowerCase; more here.)
Edit Because I'm in a weird mood, I thought I'd see if there was a measureable difference in performance in a simple test. There is. It could be because of the locale difference (e.g., apples vs. oranges):
public class Uncap
{
public static final void main(String[] params)
{
String s;
String s2;
long start;
long end;
int counter;
// Warm up
s = "Testing";
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap1(s);
s2 = uncap2(s);
s2 = uncap3(s);
}
// Test v2
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap2(s);
}
end = System.currentTimeMillis();
System.out.println("2: " + (end - start));
// Test v1
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap1(s);
}
end = System.currentTimeMillis();
System.out.println("1: " + (end - start));
// Test v3
start = System.currentTimeMillis();
for (counter = 1000000; counter > 0; --counter)
{
s2 = uncap3(s);
}
end = System.currentTimeMillis();
System.out.println("3: " + (end - start));
System.exit(0);
}
// The simple, direct version; also allows the library to handle
// locales and Unicode correctly
private static final String uncap1(String s)
{
return s.substring(0,1).toLowerCase() + s.substring(1);
}
// This will *not* handle locales and unicode correctly
private static final String uncap2(String s)
{
return Character.toLowerCase(s.charAt(0)) + s.substring(1);
}
// This will *not* handle locales and unicode correctly
private static final String uncap3(String s)
{
StringBuffer sb;
sb = new StringBuffer(s);
sb.setCharAt(0, Character.toLowerCase(sb.charAt(0)));
return sb.toString();
}
}
I mixed up the order in various tests (moving them around and recompiling) to avoid issues of ramp-up time (and tried to force some initially anyway). Very unscientific, but uncap1 was consistently slower than uncap2 and uncap3 by about 40%. Not that it matters, we're talking a difference of 400ms across a million iterations on an Intel Atom processor. :-)
So: I'd go with your simple, straightforward code, wrapped up in a utility function.

Watch out for any of the character functions in strings. Because of unicode, it is not always a 1 to 1 mapping. Stick to string based methods unless char is really what you want. As others have suggested, there are string utils out there, but even if you don't want to use them for your project, just make one yourself as you work. The worst thing you can do is to make a special function for lowercase and hide it in a class and then use the same code slightly differently in 12 different places. Put it somewhere it can easily be shared.

Use StringBuffer:
buffer.setCharAt(0, Character.toLowerCase(buffer.charAt(0)));

Evaluation of environment variables in command run by Java's Runtime.exec()

I have a scenarios where I have a Java "agent" that runs on a couple of platforms (specifically Windows, Solaris & AIX). I'd like to factor out the differences in filesystem structure by using environment variables in the command line I execute.
As far as I can tell there is no way to get the Runtime.exec() method to resolve/evaluate any environment variables referenced in the command String (or array of Strings).
I know that if push comes to shove I can write some code to pre-process the command String(s) and resolve enviroment variables by hand (using getEnv() etc). However I'm wondering if there is a smarter way to do this since I'm sure I'm not the only person wanting to do this and I'm sure there are pitfalls in "knocking up" my own implementation.
Your guidance and suggestions are most welcome.
edit:
I would like to refer to environment variables in the command string using some consistent notation such as $VAR and/or %VAR%. Not fussed which.
edit:
To be clear I'd like to be able to execute a command such as:
perl $SCRIPT_ROOT/somePerlScript.pl args
on Windows and Unix hosts using Runtime.exec(). I specify the command in config file that describes a list of jobs to run and it has to be able to work cross platform, hence my thought that an environment variable would be useful to factor out the filesystem differences (/home/username/scripts vs C:\foo\scripts). Hope that helps clarify it.
Thanks.
Tom

I think I misunderstood your question with my original answer. If you are trying to resolve environment variable references on the command lines that you generated from with-in Java, I think you may have to "roll your own".
There are many different standards for how these are expanded, depending on the operating system. In addition, this is typically a function of the shell, so even on the same OS there could be different ways. In fact, standard operating system process activation functions (e.g. exec in Unix) do not do command line expansion.
This is really not that difficult, with Java 5 and later. Define a standard for yourself, I typically use the Java standard that you see in security policy files and some enhanced property file definitions - ${var} expands to the variable/property name reference. Something like:
private static String expandCommandLine(final String cmd) {
final Pattern vars = Pattern.compile("[$]\\{(\\S+)\\}");
final Matcher m = vars.matcher(cmd);
final StringBuffer sb = new StringBuffer(cmd.length());
int lastMatchEnd = 0;
while (m.find()) {
sb.append(cmd.substring(lastMatchEnd, m.start()));
final String envVar = m.group(1);
final String envVal = System.getenv(envVar);
if (envVal == null)
sb.append(cmd.substring(m.start(), m.end()));
else
sb.append(envVal);
lastMatchEnd = m.end();
}
sb.append(cmd.substring(lastMatchEnd));
return sb.toString();
}

Is there some reason system properties won't work? Use the -D flag on the command line, you can then retrieve it via System.getProperty

The regex
[$]\\{(\\S+)\\}
is greedy (S+) and in case of xx ${a} xx ${b} xx it will match A} xx ${B instead of seperately A and B. Did you ever test it with multiple vars in the cmd?
So if there are multiple variables to be replaced. It should be
[$]\\{(\\S+?)\\}

Just because I have the Windows environment variable resolution code in my clipboard (coded by yours truly) which works on a string, I'm just going to post it up here. It's perhaps the most efficient method (maybe regex is far less efficient, I'm not sure).
int from = 0;
int startperc = -1;
int endperc = -1;
while (true) {
int index = tok.indexOf("%", from);
if (index == -1) break;
if (startperc == -1) {
startperc = index;
} else if (endperc == -1) {
endperc = index;
}
from = index + 1;
if (startperc >= 0 && endperc > startperc) {
String startbit = tok.substring(0, startperc);
String middlebit = System.getenv(tok.substring(startperc + 1, endperc));
String endbit = endperc <= tok.length() ? tok.substring(endperc + 1) : ""; // substr up to end
// integrate
tok = startbit + middlebit + endbit;
// reset
startperc = -1;
endperc = -1;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.