Remove all punctuation from the end of a string - java

Examples:
// A B C. -> A B C
// !A B C! -> !A B C
// A? B?? C??? -> A? B?? C
Here's what I have so far:
while (endsWithRegex(word, "\\p{P}")) {
word = word.substring(0, word.length() - 1);
}
public static boolean endsWithRegex(String word, String regex) {
return word != null && !word.isEmpty() &&
word.substring(word.length() - 1).replaceAll(regex, "").isEmpty();
}
This current solution works, but since it's already calling String.replaceAll within endsWithRegex, we should be able to do something like this:
word = word.replaceAll(/* regex */, "");
Any advice?

I suggest using
\s*\p{Punct}+\s*$
It will match optional whitespace and punctuation at the end of the string.
If you do not care about the whitespace, just use \p{Punct}+$.
Do not forget that in Java strings, backslashes should be doubled to denote literal backslashes (that must be used as regex escape symbols).
Java demo
String word = "!Words word! ";
word = word.replaceAll("\\s*\\p{Punct}+\\s*$", "");
System.out.println(word); // => !Words word

You can use:
str = str.replaceFirst("\\p{P}+$", "");
To include space also:
str = str.replaceFirst("[\\p{Space}\\p{P}]+$", "")

how about this, if you can take a minor hit in efficiency.
reverse the input string
keep removing characters until you hit an alphabet
reverse the string and return

I have modified the logic of your method
public static boolean endsWithRegex(String word, String regex) {
return word != null && !word.isEmpty() && word.matches(regex);
}
and your regex is : regex = ".*[^a-zA-Z]$";

Related

Regex to consolidate multiple rules

I'm looking at optimising my string manipulation code and consolidating all of my replaceAll's to just one pattern if possible
Rules -
strip all special chars except -
replace space with -
condense consecutive - 's to just one -
Remove leading and trailing -'s
My code -
public static String slugifyTitle(String value) {
String slugifiedVal = null;
if (StringUtils.isNotEmpty(value))
slugifiedVal = value
.replaceAll("[ ](?=[ ])|[^-A-Za-z0-9 ]+", "") // strips all special chars except -
.replaceAll("\\s+", "-") // converts spaces to -
.replaceAll("--+", "-"); // replaces consecutive -'s with just one -
slugifiedVal = StringUtils.stripStart(slugifiedVal, "-"); // strips leading -
slugifiedVal = StringUtils.stripEnd(slugifiedVal, "-"); // strips trailing -
return slugifiedVal;
}
Does the job but obviously looks shoddy.
My test assertions -
Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/ ==> heading-with-symbols
Heading with an asterisk* ==> heading-with-an-asterisk
Custom-id-&-stuff ==> custom-id-stuff
--Custom-id-&-stuff-- ==> custom-id-stuff
Disclaimer: I don't think a regex approach to this problem is wrong, or that this is an objectively better approach. I am merely presenting an alternative approach as food for thought.
I have a tendency against regex approaches to problems where you have to ask how to solve with regex, because that implies you're going to struggle to maintain that solution in the future. There is an opacity to regexes where "just do this" is obvious, when you know just to do this.
Some problems typically solved with regex, like this one, can be solved using imperative code. It tends to be more verbose, but it uses simple, apparent, code constructs; it's easier to debug; and can be faster because it doesn't involve the full "machinery" of the regex engine.
static String slugifyTitle(String value) {
boolean appendHyphen = false;
StringBuilder sb = new StringBuilder(value.length());
// Go through value one character at a time...
for (int i = 0; i < value.length(); i++) {
char c = value.charAt(i);
if (isAppendable(c)) {
// We have found a character we want to include in the string.
if (appendHyphen) {
// We previously found character(s) that we want to append a single
// hyphen for.
sb.append('-');
appendHyphen = false;
}
sb.append(c);
} else if (requiresHyphen(c)) {
// We want to replace hyphens or spaces with a single hyphen.
// Only append a hyphen if it's not going to be the first thing in the output.
// Doesn't matter if this is set for trailing hyphen/whitespace,
// since we then never hit the "isAppendable" condition.
appendHyphen = sb.length() > 0;
} else {
// Other characters are simply ignored.
}
}
// You can lowercase when appending the character, but `Character.toLowerCase()`
// recommends using `String.toLowerCase` instead.
return sb.toString().toLowerCase(Locale.ROOT);
}
// Some predicate on characters you want to include in the output.
static boolean isAppendable(char c) {
return (c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')
|| (c >= '0' && c <= '9');
}
// Some predicate on characters you want to replace with a single '-'.
static boolean requiresHyphen(char c) {
return c == '-' || Character.isWhitespace(c);
}
(This code is wildly over-commented, for the purpose of explaining it in this answer. Strip out the comments and unnecessary things like the else, it's actually not super complicated).
Consider the following regex parts:
Any special chars other than -: [\p{S}\p{P}&&[^-]]+ (character class subtraction)
Any one or more whitespace or hyphens: [^-\s]+ (this will be used to replace with a single -)
You will still need to remove leading/trailing hyphens, it will be a separate post-processing step. If you wish, you can use a ^-+|-+$ regex.
So, you can only reduce this to three .replaceAll invocations keeping the code precise and readable:
public static String slugifyTitle(String value) {
String slugifiedVal = null;
if (value != null && !value.trim().isEmpty())
slugifiedVal = value.toLowerCase()
.replaceAll("[\\p{S}\\p{P}&&[^-]]+", "") // strips all special chars except -
.replaceAll("[\\s-]+", "-") // converts spaces/hyphens to -
.replaceAll("^-+|-+$", ""); // remove trailing/leading hyphens
return slugifiedVal;
}
See the Java demo:
List<String> strs = Arrays.asList("Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/",
"Heading with an asterisk*",
"Custom-id-&-stuff",
"--Custom-id-&-stuff--");
for (String str : strs)
System.out.println("\"" + str + "\" => " + slugifyTitle(str));
}
Output:
"Heading with symbols *~!##$%^&()_+-=[]{};',.<>?/" => heading-with-symbols
"Heading with an asterisk*" => heading-with-an-asterisk
"Custom-id-&-stuff" => custom-id-stuff
"--Custom-id-&-stuff--" => custom-id-stuff
NOTE: if your strings can contain any Unicode whitespace, replace "[\\s-]+" with "(?U)[\\s-]+".

Java: How to replace consecutive characters with a single character?

How can I replace consecutive characters with a single character in java?
String fileContent = "def mnop.UVW";
String oldDelimiters = " .";
String newDelimiter = "!";
for (int i = 0; i < oldDelimiters.length(); i++){
Character character = oldDelimiters.charAt(i);
fileContent = fileContent.replace(String.valueOf(character), newDelimiter);
}
Current output: def!!mnop!UVW
Desired output: def!mnop!UVW
Notice the two spaces are replaced with two exclamation marks. How can I replace consecutive delimiters with one delimiter?
Since you want to match consecutive characters from the old delimiter, a regex solution doesn't seem to be feasible here. You can instead match char by char if it belongs to one of the old delimiter chars and then set it with the new one as shown below.
import java.util.*;
public class Main{
public static void main(String[] args) {
String fileContent = "def mnop.UVW";
String oldDelimiters = " .";
// add all old delimiters in a set for fast checks
Set<Character> set = new HashSet<>();
for(int i=0;i<oldDelimiters.length();++i) set.add(oldDelimiters.charAt(i));
/*
match all consecutive chars at once, check if it belongs to an old delimiter
and replace it with the new one
*/
String newDelimiter = "!";
StringBuilder res = new StringBuilder("");
for(int i=0;i<fileContent.length();++i){
if(set.contains(fileContent.charAt(i))){
while(i + 1 < fileContent.length() && fileContent.charAt(i) == fileContent.charAt(i+1)) i++;
res.append(newDelimiter);
}else{
res.append(fileContent.charAt(i));
}
}
System.out.println(res.toString());
}
}
Demo: https://onlinegdb.com/r1BC6qKP8
s = s.replaceAll("([ \\.])[ \\.]+", "$1");
Or if only several same delimiters have to be replaced:
s = s.replaceAll("([ \\.])\\1+", "$1");
[....] is a group of alternative characters
First (...) is group 1, $1
\\1 is the text of the first group
While not using regex, I thought a solution with StreamS was needed, because everyone loves streams:
private static class StatefulFilter implements Predicate<String> {
private final String needle;
private String last = null;
public StatefulFilter(String needle) {
this.needle = needle;
}
#Override
public boolean test(String value) {
boolean duplicate = last != null && last.equals(value) && value.equals(needle);
last = value;
return !duplicate;
}
}
public static void main(String[] args) {
System.out.println(
"def mnop.UVW"
.codePoints()
.sequential()
.mapToObj(c -> String.valueOf((char) c))
.filter(new StatefulFilter(" "))
.map(x -> x.equals(" ") ? "!" : x)
.collect(Collectors.joining(""))
);
}
Runnable example: https://onlinegdb.com/BkY0R2twU
Explanation:
Theoretically, you aren't really supposed to have a stateful filter, but technically, as long as the stream is not parallelized, it works fine:
.codePoints() - splits the String into a Stream
.sequential() - since we care about the order of characters, our Stream may not be processed in parallel
.mapToObj(c -> String.valueOf((char) c)) - the comparison in the filter is more intuitive if we convert to String, but it's not really needed
.filter(new StatefulFilter(" ")) - here we filter out any space that comes after another space
.map(x -> x.equals(" ") ? "!" : x) - now we can replace the remaining spaces with exclamation marks
.collect(Collectors.joining("")) - and finally we can join the characters together to reconstitute a String
The StatefulFilter itself is pretty straight forward - it checks whether a) we have a previous character at all, b) whether the previous character is the same as the current character and c) whether the current character is the delimiter (space). It returns false (meaning the character gets deleted) only if all a, b and c are true.
The biggest difficulty to using a regex for this, is to create an expression from your oldDelimiters string. For example:
String oldDelimiters = " .";
String expression = "\\" + String.join("+|\\", oldDelimiters.split("")) + "+";
String text = "def mnop.UVW;abc .df";
String result = text.replaceAll(expression, "!");
(Edit: since characters in the expression are now escaped anyway, I removed the character classes and edited the following text to reflect that change.)
Where the generated expression looks like \ +|\.+, i.e. each character is quantified and constitutes one alternative of the expression. The engine will match and replace one alternative at a time if it can be matched. result now contains:
def!mnop!UVW;abc!!df
Not sure how backwards compatible this is due to split() behaviour in previous versions of Java (producing a leading space in splitting on the empty string), but with current versions this should be fine.
Edit: As it is, this breaks if the delimiting characters contain digits or characters representing unescaped regex tokens (i.e. 1, b, etc.).

Regex to find last letter in word for any String

I have a small problem with Regular Expressions ( regex ).
I want to remove any "T" at the end of each word in the string.
This is the code I am using to display all the words ending with "T".
public static void main (String []args){
String name = "PHYLAURHEIMSMET hello tttttyyuolktttb fedqs jhgjt";
p = Pattern.compile("([a-z0-9]+)?[t]");
m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
}
Thank you for all your help.
i try to remove any "T" at the end of each word
string.replaceAll("T(?!\\S)", "");
This would match all the T's which was present at the end of each word.
OR
string.replaceAll("T(?=\\s|$)", "");
This would match all the t's only if it's follwed by a space or end of the line anchor.
To do a case-insensitive replacement.
string.replaceAll("(?i)T(?!\\S)", "");
DEMO
I would use the \b word-boundary to test if the letter is at the end of the word, you can see the result of replacing t\b with an empty string on this regex101.
I'm not sure if you want to only remove upper case T's, in that case, remove the i all the way to the right.
In Java that would be:
string.replaceAll("T\b", ""); // <-- only upper case T's
string.replaceAll("(?i)T\b", ""); // <-- both t and T's
This is just suggetion why not to use simple String class inbuilt functions something like this :
String name = "PHYLAURHEIMSMET hello tttttyyuolktttb fedqs jhgjt";
if ('t' == name.charAt(name.length() - 1) || 'T' == name.charAt(name.length() - 1)) {
System.out.println("contains last char 't' or 'T'");
}

Return a substring which is almost the same as the original String

I have a String literal that contains a sequence of [a-z] characters followed by a digit character. I want to create a new String literal whose contents will be the same as the old String except the last digit character. How can I do this in the most optimal way in Java?
eg:
String str = "sometext2";
String newString = "sometext5"; //The digit part is dynamic and I have that value already computed
return newString
You could try this regex
\d*$ // 0 or more digits at the end of the string
Example:
#Test
public void replaceTrailingDigits() {
String str = "sometext2".replaceFirst("\\d*$", Integer.toString(5));
Assert.assertEquals("sometext5", str);
str = "sometext226782".replaceFirst("\\d*$", Integer.toString(897623));
Assert.assertEquals("sometext897623", str);
str = "sometext".replaceFirst("\\d*$", Integer.toString(4));
Assert.assertEquals("sometext4", str);
}
As you see in the 3rd test, the regexp allow to append the new number if the original str does not have any tailing digits. If you want to prevent that then you could change the mutiplicity to one or more , i.e. \d+$
Try this
StringBuilder sb = new StringBuilder();
sb.append(str.subString(0, str.length()-1)).append(digit).toString();
if(str != null && !str.isEmpty())
return (str.substring(0, str.length() - 1)).concat(YOUR_OTHER_DIGIT);
else
return str; // Handle appropriately what action you want if str is null or empty
If it is only one digit, use substrings
return str.substring(str.length() - 1) + nextDigit;
If there are several digits, use a regex
return str.replaceAll("[0-9]+", "" + nextNumber)
If this is in a tight loop, reuse the regex instead of using replaceAll
If it is only one digit, you can do that:
return originalString.substring(0, originalString.length()-1) + newDigit;

Regular expression troubles, escaped quotes

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell
Say I have the following string
"Hello\" World" "Hello Universe" Hi
How could I turn it into a 3 element list
Hello" World
Hello Universe
Hi
The following is my first attempt, but it's got a number of problems
It leaves the quote characters
It doesn't catch the escaped quote
Code:
public void test() {
String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
List<String> list = split(str);
}
public static List<String> split(String str) {
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" + /* double quoted token*/
"|'[^']*'" + /*single quoted token*/
"|[A-Za-z']+" /*everything else*/
);
List<String> opts = new ArrayList<String>();
Scanner scanner = new Scanner(str).useDelimiter(pattern);
String token;
while ((token = scanner.findInLine(pattern)) != null) {
opts.add(token);
}
return opts;
}
So the incorrect output of the following code is
"Hello\"
World
" "
Hello
Universe
Hi
EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind
If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:
public static List<String> tokenize(String s) throws IOException {
List<String> opts = new ArrayList<String>();
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.quoteChar('\"');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
opts.add(st.sval);
}
return opts;
}
If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):
public static List<String> splitSSV(String in) throws IOException {
ArrayList<String> out = new ArrayList<String>();
StringReader r = new StringReader(in);
StringBuilder b = new StringBuilder();
int inQuote = -1;
boolean escape = false;
int c;
// read each character
while ((c = r.read()) != -1) {
if (escape) { // if the previous char is escape, add the current char
b.append((char)c);
escape = false;
continue;
}
switch (c) {
case '\\': // deal with escape char
escape = true;
break;
case '\"':
case '\'': // deal with quote chars
if (c == '\"' || c == '\'') {
if (inQuote == -1) { // not in a quote
inQuote = c; // now we are
} else {
inQuote = -1; // we were in a quote and now we aren't
}
}
break;
case ' ':
if (inQuote == -1) { // if we aren't in a quote, then add token to list
out.add(b.toString());
b.setLength(0);
} else {
b.append((char)c); // else append space to current token
}
break;
default:
b.append((char)c); // append all other chars to current token
}
}
if (b.length() > 0) {
out.add(b.toString()); // add final token to list
}
return out;
}
I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.
To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.
Step 1: tokenize the input: /([ \t]+)|(\\")|(")|([^ \t"]+)/
This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.
Step 2: build a finite state machine matching and reacting to the tokens:
State: START
SPACE -> return empty string
ESCAPED_QUOTE -> Error (?)
QUOTE -> State := WITHIN_QUOTES
TEXT -> return text
State: WITHIN_QUOTES
SPACE -> add value to accumulator
ESCAPED_QUOTE -> add quote to accumulator
QUOTE -> return and clear accumulator; State := START
TEXT -> add text to accumulator
Step 3: Profit!!
I think if you use pattern like this:
Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");
Then it will give you desired output. When I ran with your input data I got this list:
["Hello\" World", "Hello Universe", Hi]
I used [A-Za-z']+ from your own question but shouldn't it be just : [A-Za-z]+
EDIT
Change your opts.add(token); line to:
opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));
The first thing you need to do is stop thinking of the job in terms of split(). split() is meant for breaking down simple strings like this/that/the other, where / is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).
With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.
In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.
public static void test()
{
String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
List<String> commands = parseCommands(str);
for (String s : commands)
{
System.out.println(s);
}
}
public static List<String> parseCommands(String s)
{
String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\"" // double-quoted
+ "|'((?:[^'\\\\]++|\\\\.)*+)'" // single-quoted
+ "|\\S+"; // not quoted
Pattern p = Pattern.compile(rgx);
Matcher m = p.matcher(s);
List<String> commands = new ArrayList<String>();
while (m.find())
{
String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
: m.start(2) != -1 ? m.group(2) // strip single-quotes
: m.group();
cmd = cmd.replaceAll("\\\\(.)", "$1"); // remove escape characters
commands.add(cmd);
}
return commands;
}
output:
Hello" World
Hello Universe
Hi
This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.

Categories