package xmlchars;
import java.util.regex.Pattern;
public class TestRegex {
public static final String SPECIAL_CHARACTERS = "(?i)^[^a-z_]|[^a-z0-9-_.]";
public static void main(String[] args) {
// TODO Auto-generated method stub
String name = "#1998St #";
Pattern pattern = Pattern.compile(SPECIAL_CHARACTERS);
System.out.println(pattern.matcher(name).replaceAll(""));//gives wrong output 1998St
}
}
Basically what i'm trying to achieve is
String to start only with a-z and _
String to contain a-z 0-9 _ - . after the start
Case insensitive for the whole string
You could say:
... SPECIAL_CHARACTERS = "^[a-z_][a-z0-9_]+$";
and define the pattern by saying:
Pattern pattern = Pattern.compile(SPECIAL_CHARACTERS, Pattern.CASE_INSENSITIVE);
I managed to crack the regex. Simple change to the existing.
"^[^a-z_]*|[^a-z_0-9-._]"
Here you go, with the working proof.
package xmlchars;
import java.util.regex.Pattern;
public class TestRegex {
public static final String SPECIAL_CHARACTERS = "^[^a-z_]*|[^a-z_0-9-._]";
public static void main(String[] args) {
// TODO Auto-generated method stub
String name = " # !`~!##$%^&*()-_=+{}[];:',<>/?19.- 98Cc#19 #/9_-8-.";
Pattern pattern = Pattern.compile(SPECIAL_CHARACTERS, Pattern.CASE_INSENSITIVE);
System.out.println(pattern.matcher(name).replaceAll("")); // output _19.-98Cc199_-8-.
}
}
I'll assume you are trying to identify anything in the String that doesn't match the pattern. What you have looks almost correct. It looks like your regex might work like this:
"(?i)^([^a-z_]|[^a-z0-9-_.])"
That would only match whenever one of those two groups appear at the start of the String. Instead, try this:
"(?i)(^[^a-z_])|[^a-z0-9-_.]"
To shorten it even further, you could use the predefined character class \\W which is the same as [^a-zA-Z_0-9]. With that, you wouldn't even need the case-insensitivity.
"(^\\W)|[\\W-.]"
Given a String called str, str.replaceAll("(^\\W)|[\\W-.]",""); will remove all invalid characters.
Test for your string:
class RegexTest
{
public static void main (String[] args)
{
String str = "#1998St #";
str = str.replaceAll("(^\\W)|[\\W-.]","");
System.out.println(str);
}
}
Output:
1998St
Related
I was using regex to find function start and end stored in string in java. But was unable to get end index.
String regexPart1 = "((public)|(private)|(protected)) [a-zA-Z_0-9\\<\\>\\,]+ ";
String regexPart2 = "\\(.*\\) (throws .*)?\\{.*}$";
Pattern pattern = Pattern.compile(regexPart1+"run"+regexPart2);
Matcher matcher = pattern.matcher(toEval);
while (matcher.find()) {
System.out.println(" Found: " + matcher.group());
}
With
toEval = "public class ClassEval{public static initialize() throws Exception(){System.out.println("Initialize");}public static void run() throws Exception{System.out.println("This should come only")}public static void main(String[] args){System.out.println("Hello");}}";
Expected output:
Found: public static void run(){System.out.println("This should come only")}
Output coming:
Found: public static void run(){System.out.println("This should come only")}public static void main(String[] args){System.out.println("Hello");}}
* is a greedy quantifier, meaning {.*} will match everything from the first { until the last }. Change it to {.*?} if you want it to stop at the first }. Of course it still won't be able to identify nested braces, but that's a whole other issue.
Firstly, I'm aware of similar questions that have been asked such as here:
How to split a string, but also keep the delimiters?
However, I'm having issue implementing a split of a string using Pattern.split() where the pattern is based on a list of delimiters, but where they can sometimes appear to overlap. Here is the example:
The goal is to split a string based on a set of known codewords which are surrounded by slashes, where I need to keep both the delimiter (codeword) itself and the value after it (which may be empty string).
For this example, the codewords are:
/ABC/
/DEF/
/GHI/
Based on the thread referenced above, the pattern is built as follows using look-ahead and look-behind to tokenise the string into codewords AND values:
((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))
Working string:
"123/ABC//DEF/456/GHI/789"
Using split, this tokenises nicely to:
"123","/ABC/","/DEF/","456","/GHI/","789"
Problem string (note single slash between "ABC" and "DEF"):
"123/ABC/DEF/456/GHI/789"
Here the expectation is that "DEF/456" is the value after "/ABC/" codeword because the "DEF/" bit is not actually a codeword, but just happens to look like one!
Desired outcome is:
"123","/ABC/","DEF/456","/GHI/","789"
Actual outcome is:
"123","/ABC","/","DEF/","456","/GHI/","789"
As you can see, the slash between "ABC" and "DEF" is getting isolated as a token itself.
I've tried solutions as per the other thread using only look-ahead OR look-behind, but they all seem to suffer from the same issue. Any help appreciated!
If you are OK with find rather than split, using some non-greedy matches, try this:
public class SampleJava {
static final String[] CODEWORDS = {
"ABC",
"DEF",
"GHI"};
static public void main(String[] args) {
String input = "/ABC/DEF/456/GHI/789";
String codewords = Arrays.stream(CODEWORDS)
.collect(Collectors.joining("|", "/(", ")/"));
// codewords = "/(ABC|DEF|GHI)/";
Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
Matcher m = p.matcher(input);
while(m.find()) {
System.out.print(m.group(0));
if(m.group(1) != null) {
System.out.print(" ← code word");
}
System.out.println();
}
}
}
Output:
/ABC/ ← code word
DEF/456
/GHI/ ← code word
789
Use a combination of positive and negative look arounds:
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
There's also a considerable simplification by using alternations inside single look ahead/behind.
See live demo.
Following some TDD principles (Red-Green-Refactor), here is how I would implement such behaviour:
Write specs (Red)
I defined a set of unit tests that explain how I understood your "tokenization process". If any test is not correct according to what you expect, feel free to tell me and I'll edit my answer accordingly.
import static org.assertj.core.api.Assertions.assertThat;
import java.util.List;
import org.junit.Test;
public class TokenizerSpec {
Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
#Test
public void itShouldTokenizeTwoConsecutiveCodewords() {
String input = "123/ABC//DEF/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
}
#Test
public void itShouldTokenizeMisleadingCodeword() {
String input = "123/ABC/DEF/456/GHI/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
}
#Test
public void itShouldTokenizeWhenValueContainsSlash() {
String input = "1/23/ABC/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
}
#Test
public void itShouldTokenizeWithoutCodewords() {
String input = "123/456/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123/456/789");
}
#Test
public void itShouldTokenizeWhenEndingWithCodeword() {
String input = "123/ABC/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/");
}
#Test
public void itShouldTokenizeWhenStartingWithCodeword() {
String input = "/ABC/123";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "123");
}
#Test
public void itShouldTokenizeWhenOnlyCodeword() {
String input = "/ABC//DEF//GHI/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
}
}
Implement according to the specs (Green)
This class make all the tests above pass
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public final class Tokenizer {
private final List<String> codewords;
public Tokenizer(String... codewords) {
this.codewords = Arrays.asList(codewords);
}
public List<String> splitPreservingCodewords(String input) {
List<String> tokens = new ArrayList<>();
int lastIndex = 0;
int i = 0;
while (i < input.length()) {
final int idx = i;
Optional<String> codeword = codewords.stream()
.filter(cw -> input.substring(idx).indexOf(cw) == 0)
.findFirst();
if (codeword.isPresent()) {
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
tokens.add(codeword.get());
i += codeword.get().length();
lastIndex = i;
} else {
i++;
}
}
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
return tokens;
}
}
Improve implementation (Refactor)
Not done at the moment (not enough time that I can spend on that answer now). I'll do some refactor on Tokenizer with pleasure if you request me to (but later). :-) Or you can do it yourself quite securely since you have the unit tests to avoid regressions.
i need help to substring a string when a a substring occurs.
Example
Initial string: 123456789abcdefgh
string to substr: abcd
result : 123456789
I checked substr method but it accept index position value.I need to search the occurrence of the substring and than pass the index?
If you want to split the String from the last number (a), then the code would look like this:
you can change the "a" to any char within the string
package nl.testing.startingpoint;
public class Main {
public static void main(String args[]) {
String[] part = getSplitArray("123456789abcdefgh", "a");
System.out.println(part[0]);
System.out.println(part[1]);
}
public static String[] getSplitArray(String toSplitString, String spltiChar) {
return toSplitString.split("(?<=" + spltiChar + ")");
}
}
Bear in mind that toSplitString.split("(?<=" + spltiChar + ")"); splits from the first occurrence of that character.
Hope this might help:
public static void main(final String[] args)
{
searchString("123456789abcdefghabcd", "abcd");
}
public static void searchString(String inputValue, final String searchValue)
{
while (!(inputValue.indexOf(searchValue) < 0))
{
System.out.println(inputValue.substring(0, inputValue.indexOf(searchValue)));
inputValue = inputValue.substring(inputValue.indexOf(searchValue) +
searchValue.length());
}
}
Output:
123456789
efgh
Use a regular expression, like this
static String regex = "[abcd[.*]]"
public String remove(String string, String regex) {
return string.contains(regex) ? string.replaceAll(regex) : string;
}
I have string something like this:
10:11:22 [UTP][ROX][ID:32424][APP STR]
I want to seperate each them of. How can I do it with regex?
I want to get seperately "10:11:22", "UTP", "ROX", "ID:32424", "APP STR" as strings.
This would be your macthing parttern: /\[([^\]]+)/g
Working demo # regex101.com
Working Java demo:
public class Main {
private static final String REGEX = "\\[([^\\]]+)";
private static final String INPUT = "10:11:22 [UTP][ROX][ID:32424][APP STR]";
private static Pattern pattern;
private static Matcher matcher;
public static void main(String[] args) {
Pattern pattern = Pattern.compile(REGEX);
Matcher matcher = pattern.matcher(INPUT);
while (matcher.find()) {
System.out.println(matcher.toString());
}
}
}
The simplest solution I can think of, with the drawback that it will create a blank final entry, is this:
"10:11:22 [UTP][ROX][ID:32424][APP STR]".split("[\[\]]+")
That will return you an array as this:
["10:11:22",
"UTP",
"ROX",
"ID:32424",
"APP STR",
""]
If you want regex to do the job. Then try the below,
(?:^([^\s]*)|\[([^]]*)\])
DEMO
All the strings you want are stored separately in groups.
I would like to do some simple String replace with a regular expression in Java, but the replace value is not static and I would like it to be dynamic like it happens on JavaScript.
I know I can make:
"some string".replaceAll("some regex", "new value");
But i would like something like:
"some string".replaceAll("some regex", new SomeThinkIDontKnow() {
public String handle(String group) {
return "my super dynamic string group " + group;
}
});
Maybe there is a Java way to do this but i am not aware of it...
You need to use the Java regex API directly.
Create a Pattern object for your regex (this is reusable), then call the matcher() method to run it against your string.
You can then call find() repeatedly to loop through each match in your string, and assemble a replacement string as you like.
Here is how such a replacement can be implemented.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExCustomReplacementExample
{
public static void main(String[] args)
{
System.out.println(
new ReplaceFunction() {
public String handle(String group)
{
return "«"+group.substring(1, group.length()-1)+"»";
}
}.replace("A simple *test* string", "\\*.*?\\*"));
}
}
abstract class ReplaceFunction
{
public String replace(String source, String regex)
{
final Pattern pattern = Pattern.compile(regex);
final Matcher m = pattern.matcher(source);
boolean result = m.find();
if(result) {
StringBuilder sb = new StringBuilder(source.length());
int p=0;
do {
sb.append(source, p, m.start());
sb.append(handle(m.group()));
p=m.end();
} while (m.find());
sb.append(source, p, source.length());
return sb.toString();
}
return source;
}
public abstract String handle(String group);
}
Might look a bit complicated at the first time but that doesn’t matter as you need it only once. The subclasses implementing the handle method look simpler. An alternative is to pass the Matcher instead of the match String (group 0) to the handle method as it offers access to all groups matched by the pattern (if the pattern created groups).