Problems with replaceAll (i want to remove all ocurrences of [*]) - java

i have a text with some words like [1], [2], [3] etc...
For example: houses both permanent[1] collections and temporary[2]
exhibitions of contemporary art and photography.[6]
I want to remove these words, so the string must be like this:
For example: houses both permanent collections and temporary
exhibitions of contemporary art and photography.
I tryed using: s = s.replaceAll("[.*]", ""); but it just remove the dots (.) from the text.
Wich is the correct way to achieve it?
thanks

It's because [ and ] are regex markers. This should work:
s = s.replaceAll("\\[\\d+\\]","");
(assuming that you always have numbers within the []).
If it could be any characters:
s = s.replaceAll("\\[.*?\\]","");
(thanks #PeterLawrey).

Use:
s.replaceAll("\\[[^]]+\\]", "")
[ and ] are special in a regular expression and are the delimiters of a character class, you need to escape them. Your original regex was a character class looking either for a dot or a star.

Step 1: get a better (safer) pattern. Your current one will probably remove most of your string, even if you do get it working as written. Aim for as specific as possible. This one should do (only match brackets that have digits between them).
[\d+]
Step 2: escape special regex characters. [] has a special meaning in regex syntax (character classes) so they need escaping.
\[\d+\]
Step 3: escape for string literal. \ has a special meaning in string literals (escape character) so they also need escaping.
"\\[\\d+\\]"
And now we should have some nicely working code.
s = s.replaceAll("\\[\\d+\\]", "");

Try:
public class StringTest {
public static void main(String args[]){
String str = "houses both permanent[1] collections and temporary[2] exhibitions of contemporary art and photography.[6]";
String patten = str.replaceAll("\\[[0-9]*]", "");
System.out.println(patten);
}
}
output:
houses both permanent collections and temporary exhibitions of contemporary art and photography.

Related

Removing a comma from a string

I’m learning java and I’m just trying to figure how to go about solving this problem. I have a string ex:
The <object> <verb>, on the <object>.
Every string that contained in <> (<> are only for clarification and not in the original string) are keys to a hash map that will return a random value.
I then break the string into an array of strings, and then loop through the array and search the hash map if the key exist return a value, here is where I encounter a problem, in the example above the <verb> is a key but not <verb>, (with the comma)
How can I break away from the comma but then return a value with it.
So the end result I don’t need the full code for this, just ideas on how to solve this particular problem.
The dog sat, on the cat.
You can use regular expressions to extract all words (consisting only of letters), and then search the map as you wish. I know that you ask only for comma, but I assume this is the use-case.
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("[a-zA-Z]+") //regex for letter-strings only
.matcher(yourString); // e.g. "The dog sat, on the cat."
while (m.find()) {
allMatches.add(m.group());
}
Result will be a following list:
{"The", "dog", "sat", "on", "the", "cat"}
Then you can iterate allMatches to find appropriate results from your data structure.
P.S. When you use regex, try to compile the pattern only once and reuse it if required again, as it's not that cheap operation.
The split method of String class in Java takes regular expressions, and regular expressions in Java have the or operator similar to if.
So instead of splitting on a single character like a space, you can split on several different things like coma and space, just a space, and a period.
String sentence = "The dog sat, on the cat.";
String [] words = sentence.split(", | |\\.");
The pipe character | is the or operator for regex.
Note that I added the \\. which will remove the '.' from 'cat'.
In regex, the dot means "any letter" so to match and actual dot (period, end of sentence) you need to escape it with a \.
And in a Java string literals (anything between "") \\ means put an actual \ in the string so \\. will become \. when split receives it as a parameter.
There is even a more general way:
String [] words = sentence.split("\\W+");
\W means "any non-word character" - anything other then a letter, a digit or an underscore, and + means "appearing one or more times in a row".
So this will split the string on anything that is not a word.
Remember that in Java, String class is immutable - its contents can not be changed once it is created.
So no matter which solution you use - replaceAll suggested by Spara, word search using Matcher suggested by nemanja228, or the split in my answer, there will always be copies of the original string created, and the original will not be changed, so you just need to keep a reference to it to preserve it for future use (don't change the variable that holds it).

Matching The Arabic punctuation marks in Java

I want to edit on REGEX_PATTERN2 in this code to work with matches()method of The Arabic punctuation marks
String REGEX_PATTERN = "[\\.|,|:|;|!|_|\\?]+";
String s1 = "My life :is happy, stable";
String[] result = s1.split(REGEX_PATTERN);
for (String myString : result) {
System.out.println(myString);
}
String REGEX_PATTERN2 = "[\\.|,|:|;|!|_|،|؛|؟\\?]+";
String s2 = " حياتي ؛ سعيدة، مستقر";
String[] result2 = s2.split(REGEX_PATTERN2);
for (String myString : result2) {
System.out.println(myString);
}
The output I wanted
My life
is happy
stable
حياتي
سعيدة
مستقر
How I can edit to this code and use the matches() instead of split() method to get the same output with Arabic punctuation marks
There are a few problems here. First this example:
if (word.matches("[\\.|,|:|;|!|\\?]+"))
That is mildly1 incorrect for the following reason:
A . does not need to be escaped in a character class.
A | does not mean alternation in a character class.
A ? does not need to be escaped in a character class.
(For more details, read the javadoc or a tutorial on Java regexes.)
So you can rewrite the above as:
if (word.matches("[.,:;!?]+"))
... assuming that you don't want to classify the pipe character as punctuation.
Now this:
if (word.matches("[\.|,|:|;|!|،|؛|..|...|؟|\?]+"))
You have same problems as above. In addition, you seem to have used the two and three full-stop / period characters instead of (presumably) some Unicode character. I suspect they might be a \ufbb7 or u061e or \u06db, but I'm no linguist. (Certainly 2 or 3 full-stops is incorrect.)
So what are the punctuation characters in Arabic?
To be honest, I think that the answer depends on what source you look at, but Wikipedia states:
Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma (,).
1 - By mildly incorrect, I mean that the mistakes in this example are mostly harmless. However, your inclusion of (multiple instances of) the | character n the class does mean that you will incorrectly classify a "pipe" as punctuation.
[] denotes a regex character class, which means it only matches single characters. ... is 3 characters, so it cannot be used in a character class.
In a character class, you don't separate characters with |, and you don't need to escape . and ?.
You probably meant this, which is a list of alternate character sequences:
"(?:\\.|,|:|;|!|\\?|،|؛|؟|\\.\\.|\\.\\.\\.)+"
You might get better performance if you do use a character class where you can:
"(?:\\.{1,3}|[,:;!?،؛؟])+"
Of course, with the + at the end, matching 1-3 periods in each iteration is rather redundant, so this will do:
"[.,:;!?،؛؟]+"
Here's a different approach, that uses Unicode properties instead of specific characters (In case you care about more Arabic marks than just the question mark and comma mentioned in another answer):
"(?=^[\\p{InArabic}.,:;!?]+$)^\\p{IsPunctuation}+$"
It matches an entire string of characters that have a punctuation category, that also are either in the Arabic block or are one of the other punctuation characters you listed in your efforts.
It'll match strings like "؟،" or "؟،:", but not "؟،ؠ" or "؟،a".

Splitting a string in java on more than one symbol

I want to split a string when following of the symbols encounter "+,-,*,/,="
I am using split function but this function can take only one argument.Moreover it is not working on "+".
I am using following code:-
Stringname.split("Symbol");
Thanks.
String.split takes a regular expression as argument.
This means you can alternate whatever symbol or text abstraction in one parameter in order to split your String.
See documentation here.
Here's an example in your case:
String toSplit = "a+b-c*d/e=f";
String[] splitted = toSplit.split("[-+*/=]");
for (String split: splitted) {
System.out.println(split);
}
Output:
a
b
c
d
e
f
Notes:
Reserved characters for Patterns must be double-escaped with \\. Edit: Not needed here.
The [] brackets in the pattern indicate a character class.
More on Patterns here.
You can use a regular expression:
String[] tokens = input.split("[+*/=-]");
Note: - should be placed in first or last position to make sure it is not considered as a range separator.
You need Regular Expression. Addionaly you need the regex OR operator:
String[]tokens = Stringname.split("\\+|\\-|\\*|\\/|\\=");
For that, you need to use an appropriate regex statement. Most of the symbols you listed are reserved in regex, so you'll have to escape them with \.
A very baseline expression would be \+|\-|\\|\*|\=. Relatively easy to understand, each symbol you want is escaped with \, and each symbol is separated by the | (or) symbol. If, for example, you wanted to add ^ as well, all you would need to do is append |\^ to that statement.
For testing and quick expressions, I like to use www.regexpal.com

Java: Do I have an effective regex to eliminate symbols & rename a file?

I have a series of link names from which I'm trying to eliminate special characters. From a brief filewalk, my biggest concerns appear to be brackets, parentheses and colons. After unsuccessfully wrestling with escape characters to SELECT : [ and (, I decided instead to exclude everything I wanted to KEEP in the filename.
Consider:
String foo = inputFilname ; //SAMPLE DATA: [Phone]_Michigan_billing_(automatic).html
String scrubbed foo = foo.replaceAll("[^a-zA-Z-._]","") ;
Expected Result: Phone_Michigan_billing_automatic.html
My escape-character regex was approaching 60 characters when I ditched it. The last version I saved before changing strategies was [:.(\\[)|(\\()|(\\))|(\\])] where I thought I was asking for escape-character-[() and ].
The blanket exclude seems to work just fine. Is the Regex really that simple? Any input on how effective this strategy will be? I feel like I'm missing something and need a couple sets of eyes.
In my opinion, you're using the wrong tool for this job. StringUtils has a method named replaceChars that will replace all occurrences of a char with another one. Here's the documentation:
public static String replaceChars(String str,
String searchChars,
String replaceChars)
Replaces multiple characters in a String in one go. This method can also be used to delete characters.
For example:
replaceChars("hello", "ho", "jy") = jelly.
A null string input returns null. An empty ("") string input returns an empty string. A null or empty set of search characters returns the input string.
The length of the search characters should normally equal the length of the replace characters. If the search characters is longer, then the extra search characters are deleted. If the search characters is shorter, then the extra replace characters are ignored.
StringUtils.replaceChars(null, *, *) = null
StringUtils.replaceChars("", *, *) = ""
StringUtils.replaceChars("abc", null, *) = "abc"
StringUtils.replaceChars("abc", "", *) = "abc"
StringUtils.replaceChars("abc", "b", null) = "ac"
StringUtils.replaceChars("abc", "b", "") = "ac"
StringUtils.replaceChars("abcba", "bc", "yz") = "ayzya"
StringUtils.replaceChars("abcba", "bc", "y") = "ayya"
StringUtils.replaceChars("abcba", "bc", "yzx") = "ayzya"
So in your example:
String translated = StringUtils.replaceChars("[Phone]_Michigan_billing_(automatic).html", "[]():", null);
System.out.println(translated);
Will output:
Phone_Michigan_billing_automatic.html
This will be more straightforward and easier to understand than any regex you could write.
I think your regex is the way to go. In general white listing values instead of black listing them is almost always better.(Only allowing characters you KNOW are good instead of eliminating all characters you think are bad) From a security standpoint this regex should be preferred. You will never end up with a inputFilename which has invalid characters.
suggested regex: [^a-zA-Z-._]
I think your regex can be as simple as \W which will match everything that is not a word character (letters, digits, and underscores). This is the negation of \w
So your code becomes:
foo.replaceAll("\W","");
As pointed out in the comments the above also removes periods this will work to also keep periods:
foo.replaceAll("[^\w.]","");
Details: escape every thing that is not (the ^ inside the character class), a digit, underscore, letter ( the \w) or a period (the \.)
As noted above there may be other chars you want to white list: like -. Just include them in your character class as you go along.
foo.replaceAll("[^\w.\-]","");

regex to convert find instances a single \

I am looking to replace \n with \\n but so far my regex attempts are not working (Really it is any \ by itself, \n just happens to be the use case I have in the data).
What I need is something along the lines of:
any-non-\ followed by \ followed by any-non-\
Ultimately I'll be passing the regex to java.lang.String.replaceAll so a regex formatted for that would be great, but I can probably translate another style regex into what I need.
For example I after this program to print out "true"...
public class Main
{
public static void main(String[] args)
{
final String original;
final String altered;
final String expected;
original = "hello\nworld";
expected = "hello\\nworld";
altered = original.replaceAll("([^\\\\])\\\\([^\\\\])", "$1\\\\$2");
System.out.println(altered.equals(expected));
}
}
using this does work:
altered = original.replaceAll("\\n", "\\\\n");
The string should be
"[^\\\\]\\\\[^\\\\]"
You have to quadruple backslashes in a String constant that's meant for a regex; if you only doubled them, they would be escaped for the String but not for the regex.
So the actual code would be
myString = myString.replaceAll("([^\\\\])\\\\([^\\\\])", "$1\\\\$2");
Note that in the replacement, a quadruple backslash is now interpreted as two backslashes rather than one, since the regex engine is not parsing it. Edit: Actually, the regex engine does parse it since it has to check for the backreferences.
Edit: The above was assuming that there was a literal \n in the input string, which is represented in a string literal as "\\n". Since it apparently has a newline instead (represented as "\n"), the correct substitution would be
myString = myString.replaceAll("\\n", "\\\\n");
This must be repeated for any other special characters (\t, \r, \0, \\, etc.). As above, the replacement string looks exactly like the regex string but isn't.
So whenever there is 1 backslash, you want 2, but if there is 2, 3 or 4... in a row, leave them alone?
you want to replace
(?<=[^\\])\\(?!\\+)([^\\])
with
\\$1
That changes the string
hello\nworld and hello\\nworld and hello\\\nworld
into
hello\\nworld and hello\\nworld and hello\\\nworld
I don't know exactly what you need it for, but you could have a look at StringEscapeUtils from Commons Lang. They have plenty of methods doing things like that, and if you don't find exactly what you're searching for, you could have a look at the source to find inspiration :)
Whats wrong with using altered = original.replaceAll("\\n", "\\\\n"); ? That's exactly what i would have done.

Categories