java regular expression for multiple spaces - java

Can I write a single line regular expression instead of the five lines below?
strTestIn = strTestIn.replaceAll("^\\s+", "");
strTestIn = strTestIn.replaceAll("[ ]+", " ");
strTestIn = strTestIn.replaceAll("(\\r\\n)+", "\r\n");
strTestIn = strTestIn.replaceAll("(\\t)+", " ");
strTestIn = strTestIn.replaceAll("\\s+$", "");
What's the difference between these regular expressions?

strTestIn = strTestIn.replaceAll("^\\s+", "");
removes whitespace at the start of the string.
strTestIn = strTestIn.replaceAll("\\s+$", "");
removes whitespace at the end of the string.
strTestIn = strTestIn.replaceAll("[ ]+", " ");
condenses multiple spaces into a single space.
strTestIn =strTestIn.replaceAll("(\\r\\n)+", "\r\n");
removes empty lines by replacing adjacent newlines with a single newline.
strTestIn = strTestIn.replaceAll("(\\t)+", " ");
condenses tabs into a single space.
So they all do different things. A combination is possible for those that have the same replacement string:
strTestIn = strTestIn.replaceAll("^\\s+|\\s+$", "");
strTestIn = strTestIn.replaceAll(" {2,}|\t+", " ");
strTestIn = strTestIn.replaceAll("(\r\n)+", "\r\n");
You can also clean up and improve the regexes a bit (removing some unnecessary backslashes, and changing the minimum number of spaces to two).

Related

Java: how to replace space and \r in string with delimiter value?

How to replace space and \r in string with delimiter value?
String oldDelimiter = " ";
String newDelimiter = "o";
String fileContent = "try some **random** %!(chars)!% ##\r" +
"or line break$# \r" +
":(";
fileContent = fileContent.replaceAll("[" + oldDelimiter + "]+", newDelimiter);
fileContent = fileContent.replaceAll("\r", newDelimiter);
current output: tryosomeo**random**o%!(chars)!%o##oorolineobreak$#oo:(
desired output: tryosomeo**random**o%!(chars)!%o##oorolineobreak$#o:(
Notice the extra letter o towards the end of the string after the # symbol.
Update: it should only replace with o if there is a space or \r. However, if both space and \r and next to each other, then only replace with one o delimiter.
Like Andreas said, use the "[ \r]" to replace all of the \r and " " with spaces.
fileContent = fileContent.replaceAll("[ \r]+", "o");

Java regular expression: new line with forward slash in split

I want to capture below from regular expression.
/
--//
or it can be
--//
I tried:
public static final String DELIMITER = "/\\n--//|-//";
ddl.addAll(..).split(DELIMITER)));
and combinations but nothing working.
I am using on Windows
You don't need to escape so much. You are missing a second newline in your delimiter:
String newline = System.getProperty("line.separator");
String text = "before" + "/" + newline + newline + "--//" + "after";
System.out.println(text);
String delimiter = "/" + newline + newline + "--//";
String[] parts = text.split(delimiter);
System.out.println(parts[0]); // prints "before"
System.out.println(parts[1]); // prints "after"

Cleaning a text with Regular expression

I am reading a text file and I want to find correct tokens of the text. But I have problem with dot at the end of the sentences.My code is the following code and query means input string:
query = query.replaceAll("[^\\p{L}\\s0-9-_/.]", "");
query = query.replaceAll("\t", " ");
query = query.replaceAll("\r", " ");
query = query.replaceAll("\n", " ");
StringTokenizer words = new StringTokenizer(query, " ");
while(tokens.hasMoreTokens()){
String str=tokens.nextToken();
String regex = "\\d+.\\d+";
if(!str.matches(regex)) *<- second problem*
System.out.println(str);
For example; the Input text is the following line
THE WORLD OF UNIQUE VENDING CARTS. fy_lkaris#yahoo.com www.ubc_lib?9867.come/homepage 876454 9890-9999-9099.
I want the following string as output
THE WORLD OF UNIQUE VENDING CARTS
fy_lkaris#yahoo.com
www.ubc_lib?9867.come/homepage
9890-9999-9099
But my real out put has dot at the end of first and last line of output.
I can not delete dot (.) since it delete from every place.
THE
WORLD
OF
UNIQUE
VENDING
CARTS.ff_lashkariyahoo.com *<-problem*
www.unb_lib9867.come/homepage
9890-9999-9099. *<-problem*
Also I want to delete only numbers like 4,764,90.900 not 76-098-098 and I could not find any better than useing match function .Is there any way to solve this problem too.
Could you please help me?
Problem is presence of unescaped hyphen in the middle of character class. A hyphen can be unescaped only when it is placed at start or end position inside character class.
Use this:
query = query.replaceAll("[^\\p{L}\\s0-9_/.-]", "");
When hyphen comes in the middle it acts as range. In your case it creating a range between digit 9 (ASCII: 57) and underscore (ASCII: 95).
I found a way for solving my problems. I changed my code to the following code and it works.
query = query.replaceAll("[^\\p{L}\\s0-9-_/.#]", "");
query = query.replaceAll("\t", " ");
query = query.replaceAll("\r", " ");
query = query.replaceAll("\n", " ");
StringTokenizer words = new StringTokenizer(query, " ");
while(tokens.hasMoreTokens()){
String str=tokens.nextToken();
str = str.replaceAll("\\.\\B" , " "); *<-new line*
String regex = "\\d+.\\d+";
if(!str.matches(regex)) *<- second problem*
System.out.println(str);

How to replace multiple spaces and newlines with one blank line

How to remove multiple spaces and newlines in a string, but preserve at least one blank line for each group of blank lines.
For example, change:
"This is
a string.
Something."
to
"This is
a string.
Something."
I'm using .trim() to strip whitespace from the beginning and end of a string, but I couldn't find anything for removing multiple spaces and newlines in a string.
I would like to keep just one whitespace and one newline.
The one-line solution to remove multiple spaces/newlines, but preserve at least one blank line from multiple blank lines:
str = str.replaceAll("(?m)(^ *| +(?= |$))", "").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");
Each individual line is trimmed too.
Here's some test code:
String str = " This is\r\n " +
"\r\n" +
" \r\n " +
" \r \n \n " +
"\r\n" +
" a string. ";
str = str.trim().replaceAll("(?m)(^ *| +(?= |$))", "").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");
System.out.println(str);
Output:
This is
a string.
The previous advice will trim all whitespace, including the linefeeds and replace them with a single space.
text.replaceAll("\\n\\s*\\n", "\\n").replaceAll("[ \\t\\x0B\\f]+", " ").trim());
First it replaces any instances of linefeeds with only whitespace between them with a single linefeed, then it trims down any other whitespace to a single space ignoring linefeeds.
Here is what I came up with after a bit of testing...
public String keepOneWS(String str) {
Pattern p = Pattern.compile("(\\s+)");
Matcher m = p.matcher(str);
Pattern pBlank = Pattern.compile("[ \t]+");
String newLineReplacement = System.getProperty("line.separator") +
System.getProperty("line.separator");
StringBuffer sb = new StringBuffer();
while (m.find()) {
if(pBlank.matcher(m.group(1)).matches()) {
m.appendReplacement(sb, " ");
} else {
m.appendReplacement(sb, newLineReplacement);
}
}
m.appendTail(sb);
return sb.toString().trim();
}
public void testKeepOneWS() {
String str = " This \t is\r\n " +
"\r\n" +
" \r\n " +
" \r \n \t \n " +
"\r\n" +
" a \t string. \t ";
String expected = "This is" + System.getProperty("line.separator")+
System.getProperty("line.separator") + "a string.";
String actual = keepOneWS(str);
System.out.println("'" + actual + "'");
assertEquals(expected, actual);
}
After a goup of whitespace is captured, it is checked whether it consists only of spaces, if yes then that goup is replaced by one single space, otherwise the goup consits of spaces and line terminators, in this case the group is replaced by one line terminator.
The output is:
'This is
a string.'

Using String's ReplaceAll with regex

How to repalace the following String combination:
word1="word2"
With the following String combination:
word1="word3"
Using word boundaries \b.
I used the following, but did't work:
String word2 = "word2";
String word3 = "word3";
String oldLine = "word1=\"" + word2 + "\"";
String newLine = "word1=\"" + word3 + "\"";
String lineToReplace = "\\b" + oldLine + "\\b";
String changedCont = cont.replaceAll(lineToReplace, newLine);
Where cont is a String that contains a lot of characters including word1="word2" String combinations.
Remove the last \b. It will not do what you think, " is not a word character.
String input = "alma word1=\"word2\"";
String replacement = "word1=\"word3\"";
String output = input.replaceAll("\\bword1=\\\"word2\\\"", replaceMent);
If you replace your lineToReplace line by this:
String lineToReplace = "\\b" + oldLine + "(?!\\w)";
It should work the way you want.
You have word boundaries \b inside your string (the ") and you are using word boundaries in your regexp . Remove that last \b for example.
The only word boundary you need is at the front - the rest of your match already has word boundaries built in (the quotes etc).
This will work:
cont.replaceAll("\\bword1=\"word2\"", "word1=\"word3\"");

Categories