Performing multiple string replacements with metacharacter regex patterns - java

I am trying to perform multiple string replacements using Java's Pattern and Matcher, where the regex pattern may include metacharacters (e.g. \b, (), etc.). For example, for the input string fit i am, I would like to apply the replacements:
\bi\b --> EYE
i --> I
I then followed the coding pattern from two questions (Java Replacing multiple different substring in a string at once, Replacing multiple substrings in Java when replacement text overlaps search text). In both, they create an or'ed search pattern (e.g foo|bar) and a Map of (pattern, replacement), and inside the matcher.find() loop, they look up and apply the replacement.
The problem I am having is that the matcher.group() function does not contain information on matching metacharacters, so I cannot distinguish between i and \bi\b. Please see the code below. What can I do to fix the problem?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.*;
public class ReplacementExample
{
public static void main(String argv[])
{
Map<String, String> replacements = new HashMap<String, String>();
replacements.put("\\bi\\b", "EYE");
replacements.put("i", "I");
String input = "fit i am";
String result = doit(input, replacements);
System.out.printf("%s\n", result);
}
public static String doit(String input, Map<String, String> replacements)
{
String patternString = join(replacements.keySet(), "|");
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
StringBuffer resultStringBuffer = new StringBuffer();
while (matcher.find())
{
System.out.printf("match found: %s at start: %d, end: %d\n",
matcher.group(), matcher.start(), matcher.end());
String matchedPattern = matcher.group();
String replaceWith = replacements.get(matchedPattern);
// Do the replacement here.
matcher.appendReplacement(resultStringBuffer, replaceWith);
}
matcher.appendTail(resultStringBuffer);
return resultStringBuffer.toString();
}
private static String join(Set<String> set, String delimiter)
{
StringBuilder sb = new StringBuilder();
int numElements = set.size();
int i = 0;
for (String s : set)
{
sb.append(Pattern.quote(s));
if (i++ < numElements-1) { sb.append(delimiter); }
}
return sb.toString();
}
}
This prints out:
match found: i at start: 1, end: 2
match found: i at start: 4, end: 5
fIt I am
Ideally, it should be fIt EYE am.

You mistyped one of your regexes:
replacements.put("\\bi\\", "EYE"); //Should be \\bi\\b
replacements.put("i", "I");
You may also want to make your regexes unique. There is no guarantee of order with map.getKeySet() so it may just be replacing i with I before checking \\bi\\b.

You could use capture groups, without straying too far from your existing design. So instead of using the matched pattern as the key, you look up based on the order within a List.
You would need to change the join method to put parantheses around each of the patterns, something like this:
private static String join(Set<String> set, String delimiter) {
StringBuilder sb = new StringBuilder();
sb.append("(");
int numElements = set.size();
int i = 0;
for (String s : set) {
sb.append(s);
if (i++ < numElements - 1) {
sb.append(")");
sb.append(delimiter);
sb.append("("); }
}
sb.append(")");
return sb.toString();
}
As a side note, the use of Pattern.quote in the original code listing would have caused the match to fail where those metacharacters were present.
Having done this, you would now need to determine which of the capture groups was responsible for the match. For simplicity I'm going to assume that none of the match patterns will themselves contain capture groups, in which case something like this would work, within the matcher while loop:
int index = -1;
for (int j=1;j<=replacements.size();j++){
if (matcher.group(j) != null) {
index = j;
break;
}
}
if (index >= 0) {
System.out.printf("Match on index %d = %s %d %d\n", index, matcher.group(index), matcher.start(index), matcher.end(index));
}
Next, we would like to use the resulting index value to index straight back into the replacements. The original code uses a HashMap, which is not suitable for this; you're going to have to refactor that to use a pair of Lists in some form, one containing the list of match patterns and the other the corresponding list of replacement strings. I won't do that here, but I hope that provides enough detail to create a working solution.

Related

Find chars in string that are not between double qotes

I want to find the occurrences of (a) specific character(s) but this String to search can't be between quotes:
Example:
"this is \"my\" example string"
If you look for the char 'm', then it should only return the index of 'm' from "example" as the other ' is between double quotes.
Another example:
"th\"i\"s \"is\" \"my\" example string"
I'm expecting something like:
public List<Integer> getOccurrenceStartIndexesThatAreNotBetweenQuotes(String snippet,String stringToFind);
One "naive" way would be to:
get all the start indexes of stringToFind in snippet
get all the indexes of the quotes in snippet
Depending of the start index of stringToFind, because you have the positions of the quotes, you can know if you are between quotes or not.
Is there a better way to do this?
EDIT:
What do I want to retrieve? The indexes of the matches.
Few things:
There can be many quoted content in the string to search in: "th\"i\"s \"is\" \"my\" example string"
In the string : "th\"i\"s \"is\" \"my\" example string", "i", "is" and "my" are between quotes.
It's not limited to letters and digits, we can have ';:()_-=+[]{} etc...
Here's one solution:
Algorithm:
Find all the "Dead Zone" regions within the String (e.g. regions that are off limits because they are within quotes)
Find all the regions where the String contains the search string in question (hitZones in the code).
Retain only the regions in the hitZones that are not contained in any deadZones. I will leave this part to you :)
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FindStrings
{
// Just a simple model class for regions
static class Pair
{
int s = 0;
int e = 0;
public Pair (int s, int e)
{
this.s = s;
this.e = e;
}
public String toString ()
{
return "[" + s + ", " + e + "]";
}
}
public static void main(String[] args)
{
String search = "other";
String str = "this is \"my\" example other string. And \"my other\" this is my str in no quotes.";
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(str);
List<Pair> deadZones = new ArrayList<Pair>();
while (m.find())
{
int s = m.start();
int e = m.end();
deadZones.add(new Pair(s, e - 1));
}
List<Pair> hitZones = new ArrayList<Pair>();
p = Pattern.compile(search);
m = p.matcher(str);
while (m.find())
{
int s = m.start();
int e = m.end();
hitZones.add(new Pair(s, e - 1));
}
System.out.println(deadZones);
System.out.println(hitZones);
}
}
Note: The s component of all Pairs in the hitZones, that are not within deadZones, will ultimately be what you want.
As Mamun suggested, you can remove all the quotes and strings between quotes and then search. The following is a regex solution (though I agree with Tim that it's probably not a job for a regex engine).
String snippetQuoteRemoved = snippet.replaceAll("(?:\")(\\w+)(?:\")","");
// Now simply search in snippetQuoteRemoved
NOTE: This will look for \w+ i.e. ([a-zA-Z0-9_]). Change it to whatever is suitable for your use case.
EDIT
I checked if it'd remove everything and that's not the case. Check here.
Also, for those extra special chars, just change the regex to (?:")([a-zA-Z0-9_';:()_\-=+\[\]\{\}]+)(?:").
Another solution:
get all the start indexes of stringToFind in snippet
get all the indexes of the quotes in snippet
Depending of the start index of stringToFind, because you have the positions of the quotes, you can know if you are between quotes or not.
public List<Integer> getOccurrenceIndexesNotInQuotes(String snippet,String patternToFind) {
List<Integer> allIndexes = getStartPositions(snippet,patternToFind);
List<Integer> allQuoteIndexes = getStartPositions(snippet,"\"");
TreeSet<Integer> allQuoteIndexesTree = new TreeSet<>(allQuoteIndexes);
List<Integer> finalIndexes = new ArrayList<>();
for (Integer index : allIndexes){
Integer quoteIndexValue = allQuoteIndexesTree.floor(index);
int quoteIndex = allQuoteIndexes.indexOf(quoteIndexValue);
if (quoteIndexValue == null || !isBetweenQuote(quoteIndex)){
finalIndexes.add(index);
}
}
return finalIndexes;
}
private List<Integer> getStartPositions(String stringToProcess, String regex) {
List<Integer> out = new ArrayList<>();
Matcher matcher = Pattern.compile(regex).matcher(stringToProcess);
while(matcher.find()) {
out.add(matcher.start());
}
return out;
}
private boolean isBetweenQuote(Integer indexInQuoteList){
return indexInQuoteList % 2 != 1;
}

String matching in java

I am currently struggling with my "dirty word" filter finding partial matches.
example: if I pass in these two params replaceWord("ass", "passing pass passed ass")
to this method
private static String replaceWord(String word, String input) {
Pattern legacyPattern = Pattern.compile(word, Pattern.CASE_INSENSITIVE);
Matcher matcher = legacyPattern.matcher(input);
StringBuilder returnString = new StringBuilder();
int index = 0;
while(matcher.find()) {
returnString.append(input.substring(index,matcher.start()));
for(int i = 0; i < word.length() - 1; i++) {
returnString.append('*');
}
returnString.append(word.substring(word.length()-1));
index = matcher.end();
}
if(index < input.length() - 1){
returnString.append(input.substring(index));
}
return returnString.toString();
}
I get p*sing p*s p**sed **s
When I really just want "passing pass passed **s.
Does anyone know how to avoid this partial matching with this method??
Any help would be great thanks!
This tutorial from Oracle should point you in the right direction.
You want to use a word boundary in your pattern:
Pattern p = Pattern.compile("\\bword\\b", Pattern.CASE_INSENSITIVE);
Note, however that this still is problematic (as profanity filtering always is). A "non-word character" that defines the boundary is anything not included in [0-9A-Za-z_]
So for example, _ass would not match.
You also have the problem of profanity derived terms ... where the term is prepended to say, "hole", "wipe", etc
I'm working on a dirty word filter as we speak, and the option I chose to go with was Soundex and some regex.
I first filter out strange character with \w which is [a-zA-Z_0-9].
Then use soundex(String) to make a string that you can check against the soundex string of the word you want to test.
String soundExOfDirtyWord = Soundex.soundex(dirtyWord);
String soundExOfTestWord = Soundex.soundex(testWord);
if (soundExOfTestWord.equals(soundExOfDirtyWord)) {
System.out.println("The test words sounds like " + dirtyWord);
}
I just keep a list of dirty words in the program and have SoundEx run through them to check. The algorithm is something worth looking at.
You could also use replaceAll() method from the Matcher class. It replaces all the occurences of the pattern with your specified replacement word. Something like below.
private static String replaceWord(String word, String input) {
Pattern legacyPattern = Pattern.compile("\\b" + word + "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = legacyPattern.matcher(input);
String replacement = "";
for (int i = 0; i < word.length() - 1; i++) {
replacement += "*";
}
replacement += word.charAt(word.length() - 1);
return matcher.replaceAll(replacement);
}

Quickest way to return list of Strings by using wildcard from collection in Java

I have set of 100000 String. And for example I want to get all strings starting with "JO" from that set. What would be the best solution for that?
I was thinking Aho-Corasick but the implementation I have does not support wild cards.
If you want all the strings starting with a sequence you can add all the String into a NavigableSet like TreeSet and get the subSet(text, text+'\uFFFF') will give you all the entries starting with text This lookup is O(log n)
If you want all the Strings with end with a sequence, you can do a similar thing, except you have to reverse the String. In this case a TreeMap from reversed String to forward String would be a better structure.
If you want "x*z" you can do a search with the first set and take a union with the values of the Map.
if you want contains "x", you can use a Navigable<String, Set<String>> where the key is each String starting from the first, second, third char etc The value is a Set as you can get duplicates. You can do a search like the starts with structure.
Here's a custom matcher class that does the matching without regular expressions (it only uses regex in the constructor, to put it more precisely) and supports wildcard matching:
public class WildCardMatcher {
private Iterable<String> patternParts;
private boolean openStart;
private boolean openEnd;
public WildCardMatcher(final String pattern) {
final List<String> tmpList = new ArrayList<String>(
Arrays.asList(pattern.split("\\*")));
while (tmpList.remove("")) { /* remove empty Strings */ }
// these last two lines can be made a lot simpler using a Guava Joiner
if (tmpList.isEmpty())
throw new IllegalArgumentException("Invalid pattern");
patternParts = tmpList;
openStart = pattern.startsWith("*");
openEnd = pattern.endsWith("*");
}
public boolean matches(final String item) {
int index = -1;
int nextIndex = -1;
final Iterator<String> it = patternParts.iterator();
if (it.hasNext()) {
String part = it.next();
index = item.indexOf(part);
if (index < 0 || (index > 0 && !openStart))
return false;
nextIndex = index + part.length();
while (it.hasNext()) {
part = it.next();
index = item.indexOf(part, nextIndex);
if (index < 0)
return false;
nextIndex = index + part.length();
}
if (nextIndex < item.length())
return openEnd;
}
return true;
}
}
Here's some test code:
public static void main(final String[] args) throws Exception {
testMatch("foo*bar", "foobar", "foo123bar", "foo*bar", "foobarandsomethingelse");
testMatch("*.*", "somefile.doc", "somefile", ".doc", "somefile.");
testMatch("pe*", "peter", "antipeter");
}
private static void testMatch(final String pattern, final String... words) {
final WildCardMatcher matcher = new WildCardMatcher(pattern);
for (final String word : words) {
System.out.println("Pattern " + pattern + " matches word '"
+ word + "': " + matcher.matches(word));
}
}
Output:
Pattern foo*bar matches word 'foobar': true
Pattern foo*bar matches word 'foo123bar': true
Pattern foo*bar matches word 'foo*bar': true
Pattern foo*bar matches word 'foobarandsomethingelse': false
Pattern *.* matches word 'somefile.doc': true
Pattern *.* matches word 'somefile': false
Pattern *.* matches word '.doc': true
Pattern *.* matches word 'somefile.': true
Pattern pe* matches word 'peter': true
Pattern pe* matches word 'antipeter': false
While this is far from being production-ready, it should be fast enough and it supports multiple wild cards (including in the first and last place). But of course if your wildcards are only at the end, use Peter's answer (+1).

How to determine where a regex failed to match using Java APIs

I have tests where I validate the output with a regex. When it fails it reports that output X did not match regex Y.
I would like to add some indication of where in the string the match failed. E.g. what is the farthest the matcher got in the string before backtracking. Matcher.hitEnd() is one case of what I'm looking for, but I want something more general.
Is this possible to do?
If a match fails, then Match.hitEnd() tells you whether a longer string could have matched. In addition, you can specify a region in the input sequence that will be searched to find a match. So if you have a string that cannot be matched, you can test its prefixes to see where the match fails:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LastMatch {
private static int indexOfLastMatch(Pattern pattern, String input) {
Matcher matcher = pattern.matcher(input);
for (int i = input.length(); i > 0; --i) {
Matcher region = matcher.region(0, i);
if (region.matches() || region.hitEnd()) {
return i;
}
}
return 0;
}
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+[0-9]+[a-z]+");
String[] samples = {
"*ABC",
"A1b*",
"AB12uv",
"AB12uv*",
"ABCDabc",
"ABC123X"
};
for (String sample : samples) {
int lastMatch = indexOfLastMatch(pattern, sample);
System.out.println(sample + ": last match at " + lastMatch);
}
}
}
The output of this class is:
*ABC: last match at 0
A1b*: last match at 3
AB12uv: last match at 6
AB12uv*: last match at 6
ABCDabc: last match at 4
ABC123X: last match at 6
You can take the string, and iterate over it, removing one more char from its end at every iteration, and then check for hitEnd():
int farthestPoint(Pattern pattern, String input) {
for (int i = input.length() - 1; i > 0; i--) {
Matcher matcher = pattern.matcher(input.substring(0, i));
if (!matcher.matches() && matcher.hitEnd()) {
return i;
}
}
return 0;
}
You could use a pair of replaceAll() calls to indicate the positive and negative matches of the input string. Let's say, for example, you want to validate a hex string; the following will indicate the valid and invalid characters of the input string.
String regex = "[0-9A-F]"
String input = "J900ZZAAFZ99X"
Pattern p = Pattern.compile(regex)
Matcher m = p.matcher(input)
String mask = m.replaceAll('+').replaceAll('[^+]', '-')
System.out.println(input)
System.out.println(mask)
This would print the following, with a + under valid characters and a - under invalid characters.
J900ZZAAFZ99X
-+++--+++-++-
If you want to do it outside of the code, I use rubular to test the regex expressions before sticking them in the code.

Is there a way to split strings with String.split() and include the delimiters? [duplicate]

I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
You can use lookahead and lookbehind, which are features of regular expressions.
System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));
And you will get:
[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]
The last one is what you want.
((?<=;)|(?=;)) equals to select an empty character before ; or after ;.
EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:
static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
You want to use lookarounds, and split on zero-width matches. Here are some examples:
public class SplitNDump {
static void dump(String[] arr) {
for (String s : arr) {
System.out.format("[%s]", s);
}
System.out.println();
}
public static void main(String[] args) {
dump("1,234,567,890".split(","));
// "[1][234][567][890]"
dump("1,234,567,890".split("(?=,)"));
// "[1][,234][,567][,890]"
dump("1,234,567,890".split("(?<=,)"));
// "[1,][234,][567,][890]"
dump("1,234,567,890".split("(?<=,)|(?=,)"));
// "[1][,][234][,][567][,][890]"
dump(":a:bb::c:".split("(?=:)|(?<=:)"));
// "[][:][a][:][bb][:][:][c][:]"
dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
// "[:][a][:][bb][:][:][c][:]"
dump(":::a::::b b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
// "[:::][a][::::][b b][::][c][:]"
dump("a,bb:::c d..e".split("(?!^)\\b"));
// "[a][,][bb][:::][c][ ][d][..][e]"
dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
// "[Array][Index][Out][Of][Bounds][Exception]"
dump("1234567890".split("(?<=\\G.{4})"));
// "[1234][5678][90]"
// Split at the end of each run of letter
dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
// "[Booo][yaaaa][h! Yipp][ieeee][!!]"
}
}
And yes, that is triply-nested assertion there in the last pattern.
Related questions
Java split is eating my characters.
Can you use zero-width matching regex in String split?
How do I convert CamelCase into human-readable names in Java?
Backreferences in lookbehind
See also
regular-expressions.info/Lookarounds
A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
string.replace(FullString, "," , "~,~")
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.
import java.util.regex.*;
import java.util.LinkedList;
public class Splitter {
private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");
private Pattern pattern;
private boolean keep_delimiters;
public Splitter(Pattern pattern, boolean keep_delimiters) {
this.pattern = pattern;
this.keep_delimiters = keep_delimiters;
}
public Splitter(String pattern, boolean keep_delimiters) {
this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
}
public Splitter(Pattern pattern) { this(pattern, true); }
public Splitter(String pattern) { this(pattern, true); }
public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
public Splitter() { this(DEFAULT_PATTERN); }
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.err.println("Syntax: java Splitter <pattern> <text>");
return;
}
Pattern pattern = null;
try {
pattern = Pattern.compile(argv[0]);
}
catch (PatternSyntaxException e) {
System.err.println(e);
return;
}
Splitter splitter = new Splitter(pattern);
String text = argv[1];
int counter = 1;
for (String part : splitter.split(text)) {
System.out.printf("Part %d: \"%s\"\n", counter++, part);
}
}
}
/*
Example:
> java Splitter "\W+" "Hello World!"
Part 1: "Hello"
Part 2: " "
Part 3: "World"
Part 4: "!"
Part 5: ""
*/
I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.
Edit: Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453
Pass the 3rd aurgument as "true". It will return delimiters as well.
StringTokenizer(String str, String delimiters, true);
I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:
String str = "Hello-World:How\nAre You&doing";
inputs = str.split("(?!^)\\b");
for (int i=0; i<inputs.length; i++) {
System.out.println("a[" + i + "] = \"" + inputs[i] + '"');
}
OUTPUT:
a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"
I am just using word boundary \b to delimit the words except when it is start of text.
I got here late, but returning to the original question, why not just use lookarounds?
Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));
output:
[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:
{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }
I hope that's easier to read. Thanks for the heads-up, #finnw.
I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:
Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:
https://stackoverflow.com/users/18393/cletus
Anyway, the code:
public class Split {
public static List<String> split(String s, String pattern) {
assert s != null;
assert pattern != null;
return split(s, Pattern.compile(pattern));
}
public static List<String> split(String s, Pattern pattern) {
assert s != null;
assert pattern != null;
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
ret.add(start >= s.length() ? "" : s.substring(start));
return ret;
}
private static void testSplit(String s, String pattern) {
System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
List<String> tokens = split(s, pattern);
System.out.printf("Found %d matches%n", tokens.size());
int i = 0;
for (String token : tokens) {
System.out.printf(" %d/%d: '%s'%n", ++i, tokens.size(), token);
}
System.out.println();
}
public static void main(String args[]) {
testSplit("abcdefghij", "z"); // "abcdefghij"
testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
}
}
I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
[o], '', [o], '', [o]
But the regexp o+ will return the expected result when splitting "aooob"
[], 'a', [ooo], 'b', []
To use this StringTokenizerEx:
final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
// uses the split String detected and memorized in 'aString'
final nextDelimiter = aStringTokenizerEx.getDelimiter();
}
The code of this class is available at DZone Snippets.
As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
Note: (late 2009 edit)
The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
The Google common-library Guava contains also a Splitter which is:
simpler to use
maintained by Google (and not by you)
So it may worth being checked out. From their initial rough documentation (pdf):
JDK has this:
String[] pieces = "foo.bar".split("\\.");
It's fine to use this if you want exactly what it does:
- regular expression
- result as an array
- its way of handling empty pieces
Mini-puzzler: ",a,,b,".split(",") returns...
(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above
Answer: (e) None of the above.
",a,,b,".split(",")
returns
"", "a", "", "b"
Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)
In any case, our Splitter is simply more flexible: The default behavior is simplistic:
Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]
If you want extra features, ask for them!
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]
Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.
Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by #cletus.
public static String[] split(CharSequence input, String pattern) {
return split(input, Pattern.compile(pattern));
}
public static String[] split(CharSequence input, Pattern pattern) {
Matcher matcher = pattern.matcher(input);
int start = 0;
List<String> result = new ArrayList<>();
while (matcher.find()) {
result.add(input.subSequence(start, matcher.start()).toString());
result.add(matcher.group());
start = matcher.end();
}
if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
return result.toArray(new String[0]);
}
I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.
I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.
Here are my tests:
#Test
public void splitsVariableLengthPattern() {
String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}
#Test
public void splitsEndingWithPattern() {
String[] result = Split.split("/foo/$bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}
#Test
public void splitsStartingWithPattern() {
String[] result = Split.split("$foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}
#Test
public void splitsNoMatchesPattern() {
String[] result = Split.split("/foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}
I will post my working versions also(first is really similar to Markus).
public static String[] splitIncludeDelimeter(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
int now, old = 0;
while(matcher.find()){
now = matcher.end();
list.add(text.substring(old, now));
old = now;
}
if(list.size() == 0)
return new String[]{text};
//adding rest of a text as last element
String finalElement = text.substring(old);
list.add(finalElement);
return list.toArray(new String[list.size()]);
}
And here is second solution and its round 50% faster than first one:
public static String[] splitIncludeDelimeter2(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer stringBuffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(stringBuffer, matcher.group());
list.add(stringBuffer.toString());
stringBuffer.setLength(0); //clear buffer
}
matcher.appendTail(stringBuffer); ///dodajemy reszte ciagu
list.add(stringBuffer.toString());
return list.toArray(new String[list.size()]);
}
Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.
package javaapplication2;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaApplication2 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";
// Terrifying regex:
// (a)|(b)|(c) match a or b or c
// where
// (a) is one or more digits optionally followed by a decimal point
// followed by one or more digits: (\d+(\.\d+)?)
// (b) is one of the set + * / - occurring once: ([+*/-])
// (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
Matcher tokenMatcher = tokenPattern.matcher(num);
List<String> tokens = new ArrayList<>();
while (!tokenMatcher.hitEnd()) {
if (tokenMatcher.find()) {
tokens.add(tokenMatcher.group());
} else {
// report error
break;
}
}
System.out.println(tokens);
}
}
Sample output:
[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]
I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
static String[] splitWithDelimiters(String s) {
if (s == null || s.length() == 0) {
return new String[0];
}
LinkedList<String> result = new LinkedList<String>();
StringBuilder sb = null;
boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
for (char c : s.toCharArray()) {
if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
if (sb != null) {
result.add(sb.toString());
}
sb = new StringBuilder();
wasLetterOrDigit = !wasLetterOrDigit;
}
sb.append(c);
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
I suggest using Pattern and Matcher, which will almost certainly achieve what you want. Your regular expression will need to be somewhat more complicated than what you are using in String.split.
I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:
new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims
If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with.
Example:
I want to split the string "boo:and:foo" and keep ':' at its righthand String.
String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");
Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution.
But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.
Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).
It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.
Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.
My quick test:
String str = "'ab','cd','eg'";
String[] stra = str.split("\\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();
Result:
'|ab|','|cd|','|eg|'|
A bit too much... :-)
Tweaked Pattern.split() to include matched pattern to the list
Added
// add match to the list
matchList.add(input.subSequence(start, end).toString());
Full source
public static String[] inclusiveSplit(String input, String re, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Pattern pattern = Pattern.compile(re);
Matcher m = pattern.matcher(input);
// Add segments before each match found
while (m.find()) {
int end = m.end();
if (!matchLimited || matchList.size() < limit - 1) {
int start = m.start();
String match = input.subSequence(index, start).toString();
matchList.add(match);
// add match to the list
matchList.add(input.subSequence(start, end).toString());
index = end;
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index, input.length())
.toString();
matchList.add(match);
index = end;
}
}
// If no match was found, return this
if (index == 0)
return new String[] { input.toString() };
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.
List splitWithTokens(str, pat) {
def tokens=[]
def lastMatch=0
def m = str=~pat
while (m.find()) {
if (m.start() > 0) tokens << str[lastMatch..<m.start()]
tokens << m.group()
lastMatch=m.end()
}
if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
tokens
}
[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each {
println splitWithTokens(*it)
}
An extremely naive and inefficient solution which works nevertheless.Use split twice on the string and then concatenate the two arrays
String temp[]=str.split("\\W");
String temp2[]=str.split("\\w||\\s");
int i=0;
for(String string:temp)
System.out.println(string);
String temp3[]=new String[temp.length-1];
for(String string:temp2)
{
System.out.println(string);
if((string.equals("")!=true)&&(string.equals("\\s")!=true))
{
temp3[i]=string;
i++;
}
// System.out.println(temp.length);
// System.out.println(temp2.length);
}
System.out.println(temp3.length);
String[] temp4=new String[temp.length+temp3.length];
int j=0;
for(i=0;i<temp.length;i++)
{
temp4[j]=temp[i];
j=j+2;
}
j=1;
for(i=0;i<temp3.length;i++)
{
temp4[j]=temp3[i];
j+=2;
}
for(String s:temp4)
System.out.println(s);
String expression = "((A+B)*C-D)*E";
expression = expression.replaceAll("\\+", "~+~");
expression = expression.replaceAll("\\*", "~*~");
expression = expression.replaceAll("-", "~-~");
expression = expression.replaceAll("/+", "~/~");
expression = expression.replaceAll("\\(", "~(~"); //also you can use [(] instead of \\(
expression = expression.replaceAll("\\)", "~)~"); //also you can use [)] instead of \\)
expression = expression.replaceAll("~~", "~");
if(expression.startsWith("~")) {
expression = expression.substring(1);
}
String[] expressionArray = expression.split("~");
System.out.println(Arrays.toString(expressionArray));
One of the subtleties in this question involves the "leading delimiter" question: if you are going to have a combined array of tokens and delimiters you have to know whether it starts with a token or a delimiter. You could of course just assume that a leading delim should be discarded but this seems an unjustified assumption. You might also want to know whether you have a trailing delim or not. This sets two boolean flags accordingly.
Written in Groovy but a Java version should be fairly obvious:
String tokenRegex = /[\p{L}\p{N}]+/ // a String in Groovy, Unicode alphanumeric
def finder = phraseForTokenising =~ tokenRegex
// NB in Groovy the variable 'finder' is then of class java.util.regex.Matcher
def finderIt = finder.iterator() // extra method added to Matcher by Groovy magic
int start = 0
boolean leadingDelim, trailingDelim
def combinedTokensAndDelims = [] // create an array in Groovy
while( finderIt.hasNext() )
{
def token = finderIt.next()
int finderStart = finder.start()
String delim = phraseForTokenising[ start .. finderStart - 1 ]
// Groovy: above gets slice of String/array
if( start == 0 ) leadingDelim = finderStart != 0
if( start > 0 || leadingDelim ) combinedTokensAndDelims << delim
combinedTokensAndDelims << token // add element to end of array
start = finder.end()
}
// start == 0 indicates no tokens found
if( start > 0 ) {
// finish by seeing whether there is a trailing delim
trailingDelim = start < phraseForTokenising.length()
if( trailingDelim ) combinedTokensAndDelims << phraseForTokenising[ start .. -1 ]
println( "leading delim? $leadingDelim, trailing delim? $trailingDelim, combined array:\n $combinedTokensAndDelims" )
}
If you want keep character then use split method with loophole in .split() method.
See this example:
public class SplitExample {
public static void main(String[] args) {
String str = "Javathomettt";
System.out.println("method 1");
System.out.println("Returning words:");
String[] arr = str.split("t", 40);
for (String w : arr) {
System.out.println(w+"t");
}
System.out.println("Split array length: "+arr.length);
System.out.println("method 2");
System.out.println(str.replaceAll("t", "\n"+"t"));
}
I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.
string[] mySplit(string s,string delimiter)
{
string[] result = s.Split(delimiter);
for(int i=0;i<result.Length-1;i++)
{
result[i] += delimiter; //this one would add the delimiter to each items end except the last item,
//you can modify it however you want
}
}
string[] res = mySplit(myString,myDelimiter);
Its not too elegant, but it'll do.

Categories