Trim String in Java while preserve full word

Trim String in Java while preserve full word - java

I need to trim a String in java so that:
The quick brown fox jumps over the laz dog.
becomes
The quick brown...
In the example above, I'm trimming to 12 characters. If I just use substring I would get:
The quick br...
I already have a method for doing this using substring, but I wanted to know what is the fastest (most efficient) way to do this because a page may have many trim operations.
The only way I can think off is to split the string on spaces and put it back together until its length passes the given length. Is there an other way? Perhaps a more efficient way in which I can use the same method to do a "soft" trim where I preserve the last word (as shown in the example above) and a hard trim which is pretty much a substring.
Thanks,

Below is a method I use to trim long strings in my webapps.
The "soft" boolean as you put it, if set to true will preserve the last word.
This is the most concise way of doing it that I could come up with that uses a StringBuffer which is a lot more efficient than recreating a string which is immutable.
public static String trimString(String string, int length, boolean soft) {
if(string == null || string.trim().isEmpty()){
return string;
}
StringBuffer sb = new StringBuffer(string);
int actualLength = length - 3;
if(sb.length() > actualLength){
// -3 because we add 3 dots at the end. Returned string length has to be length including the dots.
if(!soft)
return escapeHtml(sb.insert(actualLength, "...").substring(0, actualLength+3));
else {
int endIndex = sb.indexOf(" ",actualLength);
return escapeHtml(sb.insert(endIndex,"...").substring(0, endIndex+3));
}
}
return string;
}
Update
I've changed the code so that the ... is appended in the StringBuffer, this is to prevent needless creations of String implicitly which is slow and wasteful.
Note: escapeHtml is a static import from apache commons:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
You can remove it and the code should work the same.

Here is a simple, regex-based, 1-line solution:
str.replaceAll("(?<=.{12})\\b.*", "..."); // How easy was that!? :)
Explanation:
(?<=.{12}) is a negative look behind, which asserts that there are at least 12 characters to the left of the match, but it is a non-capturing (ie zero-width) match
\b.* matches the first word boundary (after at least 12 characters - above) to the end
This is replaced with "..."
Here's a test:
public static void main(String[] args) {
String input = "The quick brown fox jumps over the lazy dog.";
String trimmed = input.replaceAll("(?<=.{12})\\b.*", "...");
System.out.println(trimmed);
}
Output:
The quick brown...
If performance is an issue, pre-compile the regex for an approximately 5x speed up (YMMV) by compiling it once:
static Pattern pattern = Pattern.compile("(?<=.{12})\\b.*");
and reusing it:
String trimmed = pattern.matcher(input).replaceAll("...");

Please try following code:
private String trim(String src, int size) {
if (src.length() <= size) return src;
int pos = src.lastIndexOf(" ", size - 3);
if (pos < 0) return src.substring(0, size);
return src.substring(0, pos) + "...";
}

Try searching for the last occurence of a space that is in a position less or more than 11 and trim the string there, by adding "...".

Your requirements aren't clear. If you have trouble articulating them in a natural language, it's no surprise that they'll be difficult to translate into a computer language like Java.
"preserve the last word" implies that the algorithm will know what a "word" is, so you'll have to tell it that first. The split is a way to do it. A scanner/parser with a grammar is another.
I'd worry about making it work before I concerned myself with efficiency. Make it work, measure it, then see what you can do about performance. Everything else is speculation without data.

How about:
mystring = mystring.replaceAll("^(.{12}.*?)\b.*$", "$1...");

I use this hack : suppose that the trimmed string must have 120 of length :
String textToDisplay = textToTrim.substring(0,(textToTrim.length() > 120) ? 120 : textToTrim.length());
if (textToDisplay.lastIndexOf(' ') != textToDisplay.length() &&textToDisplay.length()!=textToTrim().length()) {
textToDisplay = textToDisplay + textToTrim.substring(textToDisplay.length(),textToTrim.indexOf(" ", textToDisplay.length()-1))+ " ...";
}

Related

Count the Characters in a String Recursively & treat "eu" as a Single Character

I am new to Java, and I'm trying to figure out how to count Characters in the given string and threat a combination of two characters "eu" as a single character, and still count all other characters as one character.
And I want to do that using recursion.
Consider the following example.
Input:
"geugeu"
Desired output:
4 // g + eu + g + eu = 4
Current output:
2
I've been trying a lot and still can't seem to figure out how to implement it correctly.
My code:
public static int recursionCount(String str) {
if (str.length() == 1) {
return 0;
}
else {
String ch = str.substring(0, 2);
if (ch.equals("eu") {
return 1 + recursionCount(str.substring(1));
}
else {
return recursionCount(str.substring(1));
}
}
}

OP wants to count all characters in a string but adjacent characters "ae", "oe", "ue", and "eu" should be considered a single character and counted only once.
Below code does that:
public static int recursionCount(String str) {
int n;
n = str.length();
if(n <= 1) {
return n; // return 1 if one character left or 0 if empty string.
}
else {
String ch = str.substring(0, 2);
if(ch.equals("ae") || ch.equals("oe") || ch.equals("ue") || ch.equals("eu")) {
// consider as one character and skip next character
return 1 + recursionCount(str.substring(2));
}
else {
// don't skip next character
return 1 + recursionCount(str.substring(1));
}
}
}

Recursion explained
In order to address a particular task using Recursion, you need a firm understanding of how recursion works.
And the first thing you need to keep in mind is that every recursive solution should (either explicitly or implicitly) contain two parts: Base case and Recursive case.
Let's have a look at them closely:
Base case - a part that represents a simple edge-case (or a set of edge-cases), i.e. a situation in which recursion should terminate. The outcome for these edge-cases is known in advance. For this task, base case is when the given string is empty, and since there's nothing to count the return value should be 0. That is sufficient for the algorithm to work, outcomes for other inputs should be derived from the recursive case.
Recursive case - is the part of the method where recursive calls are made and where the main logic resides. Every recursive call eventually hits the base case and stars building its return value.
In the recursive case, we need to check whether the given string starts from a particular string like "eu". And for that we don't need to generate a substring (keep in mind that object creation is costful). instead we can use method String.startsWith() which checks if the bytes of the provided prefix string match the bytes at the beginning of this string which is chipper (reminder: starting from Java 9 String is backed by an array of bytes, and each character is represented either with one or two bytes depending on the character encoding) and we also don't bother about the length of the string because if the string is shorter than the prefix startsWith() will return false.
Implementation
That said, here's how an implementation might look:
public static int recursionCount(String str) {
if(str.isEmpty()) {
return 0;
}
return str.startsWith("eu") ?
1 + recursionCount(str.substring(2)) : 1 + recursionCount(str.substring(1));
}
Note: that besides from being able to implement a solution, you also need to evaluate it's Time and Space complexity.
In this case because we are creating a new string with every call time complexity is quadratic O(n^2) (reminder: creation of the new string requires allocating the memory to coping bytes of the original string). And worse case space complexity also would be O(n^2).
There's a way of solving this problem recursively in a linear time O(n) without generating a new string at every call. For that we need to introduce the second argument - current index, and each recursive call should advance this index either by 1 or by 2 (I'm not going to implement this solution and living it for OP/reader as an exercise).
In addition
In addition, here's a concise and simple non-recursive solution using String.replace():
public static int count(String str) {
return str.replace("eu", "_").length();
}
If you would need handle multiple combination of character (which were listed in the first version of the question) you can make use of the regular expressions with String.replaceAll():
public static int count(String str) {
return str.replaceAll("ue|au|oe|eu", "_").length();
}

Is String.trim() faster than String.replace()?

Let's say there has a string like " world ". This String only has the blank at front and end. Is the trim() faster than replace()?
I used the replace once and my mentor said don't use it since the trim() probably faster.
If not, what's the advantage of trim() than replace()?

If we look at the source code for the methods:
replace():
public String replace(CharSequence target, CharSequence replacement) {
String tgtStr = target.toString();
String replStr = replacement.toString();
int j = indexOf(tgtStr);
if (j < 0) {
return this;
}
int tgtLen = tgtStr.length();
int tgtLen1 = Math.max(tgtLen, 1);
int thisLen = length();
int newLenHint = thisLen - tgtLen + replStr.length();
if (newLenHint < 0) {
throw new OutOfMemoryError();
}
StringBuilder sb = new StringBuilder(newLenHint);
int i = 0;
do {
sb.append(this, i, j).append(replStr);
i = j + tgtLen;
} while (j < thisLen && (j = indexOf(tgtStr, j + tgtLen1)) > 0);
return sb.append(this, i, thisLen).toString()
}
Vs trim():
public String trim() {
int len = value.length;
int st = 0;
char[] val = value; /* avoid getfield opcode */
while ((st < len) && (val[st] <= ' ')) {
st++;
}
while ((st < len) && (val[len - 1] <= ' ')) {
len--;
}
return ((st > 0) || (len < value.length)) ? substring(st, len) : this;
}
As you can see replace() calls multiple other methods and iterates throughout the entire String, while trim() simply iterates over the beginning and ending of the String until the character isn't a white space. So in the single respect of trying to only remove white space before and after a word, trim() is more efficient.
We can run some benchmarks on this:
public static void main(String[] args) {
long testStartTime = System.nanoTime();;
trimTest();
long trimTestTime = System.nanoTime() - testStartTime;
testStartTime = System.nanoTime();
replaceTest();
long replaceTime = System.nanoTime() - testStartTime;
System.out.println("Time for trim(): " + trimTestTime);
System.out.println("Time for replace(): " + replaceTime);
}
public static void trimTest() {
for(int i = 0; i < 1000000; i ++) {
new String(" string ").trim();
}
}
public static void replaceTest() {
for(int i = 0; i < 1000000; i ++) {
new String(" string ").replace(" ", "");
}
}
Output:
Time for trim(): 53303903
Time for replace(): 485536597
//432,232,694 difference

Assuming that the people writing the Java library code are doing a good job1, you can assume that a special purpose method (like trim()) will be as fast, and probably faster than a general purpose method (like replace(...)) doing the same thing.
Two reasons:
If the special purpose method is slower, its implementation can be rewritten as equivalent calls to the general purpose one, making the performance equivalent in most cases. A competent programmer will do this because it reduces maintenance costs.
In the special purpose method, it is likely that there will be optimizations that can be made that don't apply in the general-purpose case.
In this case we know that trim() only needs to look at the start and end of the string ... whereas replace(...) needs to look at all of the characters in the string. (We can infer this from the description of what the respective methods do.)
If we assume "competence" then we can infer that the developers will have done the analysis and not implemented trim() sub-optimally2; i.e. they won't code trim() to examine all characters.
There is another reason to use the special purpose method over the general purpose. It makes your code simpler, easier to read, and easier to inspect for correctness. This may well be more important than performance.
This clearly applies in the case of trim() versus replace(...).
1 - We can in this case. There are lots of eyes looking at this code, and lots of people who will complain loudly about egregious performance issues.
2 - Unfortunately, it is not always as straightforward as this. A library method needs to be optimized for "typical" behavior, but it also needs to avoid pathological performance in edge-cases. It is not always possible to achieve both things.

trim() is definitely faster to type, yes. It doesn't take any parameters.
It is also much faster to understand what you where trying to do. You were trying to trim the string, rather than replacing all the spaces it contains with the empty string, knowing from other context that there is only space at the beginning and the end of the string.
Indeed much faster no matter how you look at it. Don't complicate the life of the persons who're trying to read your code. Most of the time, it will be you months later, or at least someone you don't hate.

Trim will prune the outter characters until they are non white space. I believe they trim space, tab, and new lines.
Replace will scan the entire string (so, it could be a sentense) and would replace inner " " with "", essentially compressing them together.
They have different use cases though, obviously 1 is to clean up user input where the other is to update a string where matches are found with something else.
That being said, run times: Replace will run in N time, as it will look for all matching characters. Trim will run in O(N), but most likely a just a few characters off of each end.
The idea behind trim i think came around from people would would type and input things but accidentally press space before submitting their forms, essentially trying to save the field "Foo " instead of "Foo"

s.trim() shortens a String s. This means no characters has to be moved from an index to another. It starts at the first character (s.toCharArray()[0]) of the String and shortens the String character by character until the first non-whitespace character occurs. It works the same way to shorten the String at the end. So it compresses the String. If a String has no leading and trailing whitespace trim will be ready after checking the first and the last character.
In case of " world ".trim() two steps are needed: one to remove the first leading whitespace as it is on the first index and the the second to remove the last whitespace as it is on the last index.
" world ".replace(" ", "") will need at least n = " world ".length() steps. It has to check every character if it has to be replaced. But if we take into account that the implementation of String.replace(...) needs to compile a Pattern, build a Matcher and then to replace all the matched regions it's seems far complex comparing to shorten a String.
We also have to consider that " world ".replace(" ", "") does not replace whitespaces but only the String " ". Since String replace(CharSequence target, CharSequence replacement) compiles the target using Pattern.LITERAL we cannot use the character class \s. To be more accurate we would have to compare " world ".trim() to " world ".replaceAll("\\s", ""). It is still not the same because a whitespace in String trim() is defined as c <= ' ' for each c in s.toCharArray().
Summarizing: String.trim() should be faster - especially for long strings
The description how the methods work is based on the implementation of String in Java 8. But implementations can change.
But the question should be: What do you intent to do with the string? Do you want to trim it or to replace some characters? According to it use the corresponding method.

Java IF Statement necessary?

I have:
String str = "Hello, how, are, you";
I want to create a helper method that removes the commas from any string. Which of the following is more accurate?
private static String removeComma(String str){
if(str.contains(",")){
str = str.replaceAll(",","");
}
return str;
}
OR
private static String removeComma(String str){
str = str.replaceAll(",","");
return str;
}
Seems like I don't need the IF statement but there might be a case where I do.
If there is a better way let me know.

Both are functionally equivalent but the former is more verbose and will probably be slower because it runs an extra operation.
Also note that you don't need replaceAll (which accepts a regular expression): replace will do.
So I would go for:
private static String removeComma(String str){
return str.replace(",", "");
}

The IF statement is unnecessary, unless you're handling "large" strings (we're talking megabytes or more).
If you're using the IF statement, your code will first search for the first occurance of a comma, and then execute the replacement. This could be costly if the comma is near the end of the string and your string is large, since it will have to be traversed twice.
Without the IF statement, commas will be replaced if they exist. If the answer is negative, your string will be untouched.
Bottom rule: use the version without the IF statement.

Both are correct, but the second one is cleaner since the IF statement of the first alternative is not needed.

It's a matter of what is the probability to have strings with comma in your universe of strings.
If you have a high probability, call the method replaceAll without checking first.
BUT If you are not using extremely huge strings, I guess you will see no difference in perfomance at all.

Just another solution with time complexity O(n), space complexity O(n):
public static String removeComma(String str){
int length = str.length();
StringBuffer sb = new StringBuffer();
for (int i = 0; i < length; i++) {
char c = str.charAt(i);
if (c != ',') {
sb.append(c);
}
}
return sb.toString();
}

How can i extract specific terms from string lines in Java?

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)
So, here's just example string line among thousands of string lines:
(split() doesn't work.!!! )
test.csv
"31451 CID005319044 　　15939353　　 C8H14O3S2 　　　beta-lipoic acid　　 C1C[S#](=O)S[C##H]1CCCCC(=O)O "
"12232 COD05374044 23439353　　C924O3S2 　　　saponin　　 CCCC(=O)O "
"9048 　 CTD042032　23241　　C3HO4O3S2　Berberine　 [C##H]1CCCCC(=O)O "
I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position.
You can see there are big spaces between terms, so that's why I said 5th position.
In this case, how can I extract terms located in 5th position for each line?
One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that.
Because the length of whitespace is random, I can not use the .split() function.
For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**

Here is a solution for your problem using the string split and index of,
import java.util.ArrayList;
public class StringSplit {
public static void main(String[] args) {
String[] seperatedStr = null;
int fourthStrIndex = 0;
String modifiedStr = null, finalStr = null;
ArrayList<String> strList = new ArrayList<String>();
strList.add("31451 CID005319044 　　15939353　　 C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O ");
strList.add("12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O ");
strList.add("9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O ");
for (String item: strList) {
seperatedStr = item.split("\\s+");
fourthStrIndex = item.indexOf(seperatedStr[3]) + seperatedStr[3].length();
modifiedStr = item.substring(fourthStrIndex, item.length());
finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
System.out.println(finalStr.trim());
}
}
}
Output:
beta-lipoic acid
saponin
Berberine

Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:
String s[] = str.split("\\s\\s+");
for (String string : s) {
System.out.println(string);
}
Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)
public static List<String> getData(String str) {
List<String> list = new ArrayList<>();
String s="";
int count=0;
for(char c : str.toCharArray()){
System.out.println(c);
if (c==' '){
count++;
}else {
s = s+c;
}
if(count>1&&!s.equalsIgnoreCase("")){
list.add(s);
count=0;
s="";
}
}
return list;
}

This would be a relatively easy fix if it weren't for beta-lipoic acid...
Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.
Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array
While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...
Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like
Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
// return line[4].append(line[5]) or something like that
}
Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes
line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");
Then hopefully the only thing that is left would be the term you're looking for.
Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.
Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.

Splitting string N into N/X strings

I would like some guidance on how to split a string into N number of separate strings based on a arithmetical operation; for example string.length()/300.
I am aware of ways to do it with delimiters such as
testString.split(",");
but how does one uses greedy/reluctant/possessive quantifiers with the split method?
Update: As per request a similar example of what am looking to achieve;
String X = "32028783836295C75546F7272656E745C756E742E657865000032002E002E005C0"
Resulting in X/3 (more or less... done by hand)
X[0] = 32028783836295C75546F
X[1] = 6E745C756E742E6578650
x[2] = 65000032002E002E005C0
Dont worry about explaining how to put it into the array, I have no problem with that, only on how to split without using a delimiter, but an arithmetic operation

You could do that by splitting on (?<=\G.{5}) whereby the string aaaaabbbbbccccceeeeefff would be split into the following parts:
aaaaa
bbbbb
ccccc
eeeee
fff
The \G matches the (zero-width) position where the previous match occurred. Initially, \G starts at the beginning of the string. Note that by default the . meta char does not match line breaks, so if you want it to match every character, enable DOT-ALL: (?s)(?<=\G.{5}).
A demo:
class Main {
public static void main(String[] args) {
int N = 5;
String text = "aaaaabbbbbccccceeeeefff";
String[] tokens = text.split("(?<=\\G.{" + N + "})");
for(String t : tokens) {
System.out.println(t);
}
}
}
which can be tested online here: http://ideone.com/q6dVB
EDIT
Since you asked for documentation on regex, here are the specific tutorials for the topics the suggested regex contains:
\G, see: http://www.regular-expressions.info/continue.html
(?<=...), see: http://www.regular-expressions.info/lookaround.html
{...}, see: http://www.regular-expressions.info/repeat.html

If there's a fixed length that you want each String to be, you can use Guava's Splitter:
int length = string.length() / 300;
Iterable<String> splitStrings = Splitter.fixedLength(length).split(string);
Each String in splitStrings with the possible exception of the last will have a length of length. The last may have a length between 1 and length.
Note that unlike String.split, which first builds an ArrayList<String> and then uses toArray() on that to produce the final String[] result, Guava's Splitter is lazy and doesn't do anything with the input string when split is called. The actual splitting and returning of strings is done as you iterate through the resulting Iterable. This allows you to just iterate over the results without allocating a data structure and storing them all or to copy them into any kind of Collection you want without going through the intermediate ArrayList and String[]. Depending on what you want to do with the results, this can be considerably more efficient. It's also much more clear what you're doing than with a regex.

How about plain old String.substring? It's memory friendly (as it reuses the original char array).

well, I think this is probably as efficient a way to do this as any other.
int N=300;
int sublen = testString.length()/N;
String[] subs = new String[N];
for(int i=0; i<testString.length(); i+=sublen){
subs[i] = testString.substring(i,i+sublen);
}
You can do it faster if you need the items as a char[] array rather as individual Strings - depending on how you need to use the results - e.g. using testString.toCharArray()

Dunno, you'll probably need a method that takes string and int times and returns a list of strings. Pseudo code (haven't checked if it works or not):
public String[] splintInto(String splitString, int parts)
{
int dlength = splitString.length/parts
ArrayList<String> retVal = new ArrayList<String>()
for(i=0; i<splitString.length;i+=dlength)
{
retVal.add(splitString.substring(i,i+dlength)
}
return retVal.toArray()
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.