Removing a comma from a string - java

I’m learning java and I’m just trying to figure how to go about solving this problem. I have a string ex:
The <object> <verb>, on the <object>.
Every string that contained in <> (<> are only for clarification and not in the original string) are keys to a hash map that will return a random value.
I then break the string into an array of strings, and then loop through the array and search the hash map if the key exist return a value, here is where I encounter a problem, in the example above the <verb> is a key but not <verb>, (with the comma)
How can I break away from the comma but then return a value with it.
So the end result I don’t need the full code for this, just ideas on how to solve this particular problem.
The dog sat, on the cat.

You can use regular expressions to extract all words (consisting only of letters), and then search the map as you wish. I know that you ask only for comma, but I assume this is the use-case.
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("[a-zA-Z]+") //regex for letter-strings only
.matcher(yourString); // e.g. "The dog sat, on the cat."
while (m.find()) {
allMatches.add(m.group());
}
Result will be a following list:
{"The", "dog", "sat", "on", "the", "cat"}
Then you can iterate allMatches to find appropriate results from your data structure.
P.S. When you use regex, try to compile the pattern only once and reuse it if required again, as it's not that cheap operation.

The split method of String class in Java takes regular expressions, and regular expressions in Java have the or operator similar to if.
So instead of splitting on a single character like a space, you can split on several different things like coma and space, just a space, and a period.
String sentence = "The dog sat, on the cat.";
String [] words = sentence.split(", | |\\.");
The pipe character | is the or operator for regex.
Note that I added the \\. which will remove the '.' from 'cat'.
In regex, the dot means "any letter" so to match and actual dot (period, end of sentence) you need to escape it with a \.
And in a Java string literals (anything between "") \\ means put an actual \ in the string so \\. will become \. when split receives it as a parameter.
There is even a more general way:
String [] words = sentence.split("\\W+");
\W means "any non-word character" - anything other then a letter, a digit or an underscore, and + means "appearing one or more times in a row".
So this will split the string on anything that is not a word.
Remember that in Java, String class is immutable - its contents can not be changed once it is created.
So no matter which solution you use - replaceAll suggested by Spara, word search using Matcher suggested by nemanja228, or the split in my answer, there will always be copies of the original string created, and the original will not be changed, so you just need to keep a reference to it to preserve it for future use (don't change the variable that holds it).

Related

Split String 2 times but with different splits ";" and "."

Original String: "12312123;www.qwerty.com"
With this Model.getList().get(0).split(";")[1]
I get: "www.qwerty.com"
I tried doing this: Model.getList().get(0).split(";")[1].split(".")[1]
But it didnt work I get exception. How can I solve this?
I want only "qwerty"
Try this, to achieve "qwerty":
Model.getList().get(0).split(";")[1].split("\\.")[1]
You need escape dot symbol
Try to use split(";|\\.") like this:
for (String string : "12312123;www.qwerty.com".split(";|\\.")) {
System.out.println(string);
}
Output:
12312123
www
qwerty
com
You can split a string which has multiple delimiters. Example below:
String abc = "11;xyz.test.com";
String[] tokens = abc.split(";|\\.");
System.out.println(tokens[tokens.length-2]);
The array index 1 part doesn't make sense here. It will throw an ArrayIndexOutOfBounds Exception or something of the sort.
This is because splitting based on "." doesn't work the way you want it to. You would need to escape the period by putting "\." instead. You will find here that "." means something completely different.
You'd need to escape the ., i.e. "\\.". Period is a special character in regular expressions, meaning "any character".
What your current split means is "split on any character"; this means that it splits the string into a number of empty strings, since there is nothing between consecutive occurrences of " any character".
There is a subtle gotcha in the behaviour of the String.split method, which is that it discards trailing empty strings from the token array (unless you pass a negative number as the second parameter).
Since your entire token array consists of empty strings, all of these are discarded, so the result of the split is a zero-length array - hence the exception when you try to access one of its element.
Don't use split, use a regular expression (directly). It's safer, and faster.
String input = "12312123;www.qwerty.com";
String regex = "([^.;]+)\\.[^.;]+$";
Matcher m = Pattern.compile(regex).matcher(input);
if (m.find()) {
System.out.println(m.group(1)); // prints: qwerty
}

Java String split regexp returns empty strings with multiple delimiters

I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.
You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters
I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times
Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)
If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.

Why the space appears as sub string in this split instruction?

I have string with spaces and some non-informative characters and substrings required to be excluded and just to keep some important sections. I used the split as below:
String myString[]={"01: Hi you look tired today? Can I help you?"};
myString=myString[0].split("[\\s+]");// Split based on any white spaces
for(int ii=0;ii<myString.length;ii++)
System.out.println(myString[ii]);
The result is :
01:
Hi
you
look
tired
today?
Can
I
help
you?
The spaces appeared after the split as sub strings when the regex is “[\s+]” but disappeared when the regex is "\s+". I am confused and not able to find answer in the related stack overflow pages. The link regex-Pattern made me more confused.
Please help, I am new with java.
19/1/2015:Edit
After your valuable advice, I reached to point in my program where a conditional statements is required to be decomposed and processed. The case I have is:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\,]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result is fine till now as:
01:IF
rd.h
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
with
0.4610;
My next step is to add string "with" to the regex and get rid of this word while doing the split.
I tried it this way:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\, with]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result not perfect, because I got unwonted extra split at every "h" letter as:
01:IF
rd.
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
0.4610;
Any advice on how to specify string with mixed white spaces and separation marks?
Many thanks.
inside square brackets, [\s+] will represent the whitespace character class with the plus sign added. it is only one character so a sequence of spaces will split many empty strings as Todd noted, and will also use + as separator.
you should use \s+ (without brackets) as the separator. that means one or more whitespace characters.
myString=myString[0].split("\\s+");
Your biggest problem is not understanding enough about regular expressions to write them properly. One key point you don't comprehend is that [...] is a character class, which is a list of characters any one of which can match. For example:
[abc] matches either a, b or c (it does not match "abc")
[\\s+] matches any whitespace or "+" character
[with] matches a single character that is either w, i, t or h
[.$&^?] matches those literal characters - most characters lose their special regex meaning when in a character class
To split on any number of whitespace, comma and ampersand and consume "with" (if it appears), do this:
String [] s2 = s1.split("[\\s,&]+(with[\\s,&]+)?");
You can try it easily here Online Regex and get useful comments.

Java split by alphabeta char creates an empty value in array

I want to split my string on every occurrence of an alpha-beta character.
for example:
"s1l1e13" to an array of: ["s1","l1","e13"]
when trying to use this simple split by regex i get some weird results:
testStr = "s1l1e13"
Arrays.toString(testStr.split("(?=[a-z])"))
gives me the array of:
["","s1","l1","e13"]
how can i create the split without the empty array element?
I tried a couple more things:
testStr = "s1"
Arrays.toString(testStr.split("(?=[a-z])"))
does return the currect array: ["s1"]
but when trying to use substring
testStr = "s1l1e13"
Arrays.toString(testStr.substring(1).split("(?=[a-z])")
i get in return ["1","l1","e13"]
what am i missing?
Your Lookahead marks each position before any character of a to z; marking the following positions:
s1 l1 e13
^ ^ ^
So by spliting using just the Lookahead, it returns ["", "s1", "l1", "e13"]
You can use a Negative Lookbehind here. This looks behind to see if there is not the beginning of the string.
String s = "s1l1e13";
String[] parts = s.split("(?<!\\A)(?=[a-z])");
System.out.println(Arrays.toString(parts)); //=> [s1, l1, e13]
Your problem is that (?=[a-z]) means "place before [a-z]" and in your text
s1l1e13
you have 3 such places. I will mark them with |
|s1|l1|e13
so split (unfortunately correctly) produces "" "s1" "l1" "e13" and doesn't automatically remove for you first empty elements.
To solve this problem you have at least two options:
make sure that there is something before your place you need to split on (it is not at start of your string). You can use for instance (?<=\\d)(?=[a-z]) if you want to split after digit but before character
(PREFFERED SOLUTION) start using Java 8 which automatically removes empty strings at start of result array if regex used on split is zero-length (look-arounds are zero length).
The first match finds "" to be okay because its looking ahead for any alpha character, which is called zero-width lookahead, so it doesn't need to actually match anything. So "s" at the beginning is alphanumeric, and it matches that at a probable spot.
If you want the regex to match something always, use ".+(?=[a-z])"
The problem is that the initial "s" counts as an alphabetic character. So, the regex is trying to split at s.
The issue is that there is nothing before the s, so the regex machine instead decides to show that there is nothing by adding the null element. It'll do the same thing at the end if you ended with "s" (or any other letter).
If this is the only string you're splitting, or if every array you had starts with a letter but does not end with one, just truncate the array to omit the first element. Otherwise, you'll probably need to loop through each array as you make it so that you can drop empty elements.
So it seems your matches has the pattern x###, where x is a letter, and # is a number.
I'd make the following Regex:
([a-z][0-9]+)

Java: Do I have an effective regex to eliminate symbols & rename a file?

I have a series of link names from which I'm trying to eliminate special characters. From a brief filewalk, my biggest concerns appear to be brackets, parentheses and colons. After unsuccessfully wrestling with escape characters to SELECT : [ and (, I decided instead to exclude everything I wanted to KEEP in the filename.
Consider:
String foo = inputFilname ; //SAMPLE DATA: [Phone]_Michigan_billing_(automatic).html
String scrubbed foo = foo.replaceAll("[^a-zA-Z-._]","") ;
Expected Result: Phone_Michigan_billing_automatic.html
My escape-character regex was approaching 60 characters when I ditched it. The last version I saved before changing strategies was [:.(\\[)|(\\()|(\\))|(\\])] where I thought I was asking for escape-character-[() and ].
The blanket exclude seems to work just fine. Is the Regex really that simple? Any input on how effective this strategy will be? I feel like I'm missing something and need a couple sets of eyes.
In my opinion, you're using the wrong tool for this job. StringUtils has a method named replaceChars that will replace all occurrences of a char with another one. Here's the documentation:
public static String replaceChars(String str,
String searchChars,
String replaceChars)
Replaces multiple characters in a String in one go. This method can also be used to delete characters.
For example:
replaceChars("hello", "ho", "jy") = jelly.
A null string input returns null. An empty ("") string input returns an empty string. A null or empty set of search characters returns the input string.
The length of the search characters should normally equal the length of the replace characters. If the search characters is longer, then the extra search characters are deleted. If the search characters is shorter, then the extra replace characters are ignored.
StringUtils.replaceChars(null, *, *) = null
StringUtils.replaceChars("", *, *) = ""
StringUtils.replaceChars("abc", null, *) = "abc"
StringUtils.replaceChars("abc", "", *) = "abc"
StringUtils.replaceChars("abc", "b", null) = "ac"
StringUtils.replaceChars("abc", "b", "") = "ac"
StringUtils.replaceChars("abcba", "bc", "yz") = "ayzya"
StringUtils.replaceChars("abcba", "bc", "y") = "ayya"
StringUtils.replaceChars("abcba", "bc", "yzx") = "ayzya"
So in your example:
String translated = StringUtils.replaceChars("[Phone]_Michigan_billing_(automatic).html", "[]():", null);
System.out.println(translated);
Will output:
Phone_Michigan_billing_automatic.html
This will be more straightforward and easier to understand than any regex you could write.
I think your regex is the way to go. In general white listing values instead of black listing them is almost always better.(Only allowing characters you KNOW are good instead of eliminating all characters you think are bad) From a security standpoint this regex should be preferred. You will never end up with a inputFilename which has invalid characters.
suggested regex: [^a-zA-Z-._]
I think your regex can be as simple as \W which will match everything that is not a word character (letters, digits, and underscores). This is the negation of \w
So your code becomes:
foo.replaceAll("\W","");
As pointed out in the comments the above also removes periods this will work to also keep periods:
foo.replaceAll("[^\w.]","");
Details: escape every thing that is not (the ^ inside the character class), a digit, underscore, letter ( the \w) or a period (the \.)
As noted above there may be other chars you want to white list: like -. Just include them in your character class as you go along.
foo.replaceAll("[^\w.\-]","");

Categories