I am writing code to detect bad keywords in a file. Here are the steps that I follow:
Tokenize using StreamTokenizer
Use pattern matcher to find the matches
while(streamTokenizer.nextToken() != StreamTokenizer.TT_EOF){
if(streamTokenizer.ttype == StreamTokenizer.TT_WORD) {
String token = streamTokenizer.sval.trim().replaceAll("\\\\n", "")
final Matcher matcher = badKeywordPattern.matcher(token)
if(matcher.find()) { // bad tokens found
return true;
}
}
}
String token = streamTokenizer.sval.trim().replaceAll("\\\\n", "") is done to match token spanning multiple lines with \. Example:
bad\
token
However the replace is not working. Any suggestions? Any other ways to do this?
Assuming you want to remove all \ placed at end of the line, along with line separator you could use replaceAll("\\\\\\R","").
To represent \ in regex (which is what replaceAll uses) we need to escape it with another \, which leaves us with \\. But since \ is also special in String literals we need to escape each of them again with another backslash which leaves us with "\\\\"
Since Java 8 we can use \R (which needs to be written as "\\R" since \ requires escaping) to represent line separators like \r \n or \r\n pair.
If I understand correctly, you do not want to use regex (which is what String.replaceAll does), just do literal string replacement with String.replace, and use one fewer backslash:
String token = streamTokenizer.sval.trim().replace("\\\n", "")
Based on #Pshemo answer, which shows you how \ & \n presented in regex, and as mentioned here. You could do it like this:
String[] tkns = streamTokenizer.sval.trim().split("\\\\\\R"); // yourString = "bad\\\ntaken"
StringBuffer token= new StringBuffer();
for (String tkn : tkns)
{
token.append(tkn);
//System.out.println(tkn);
}
//final Matcher matcher = badKeywordPattern.matcher(token)
Related
Apologies if this has already been answered.
I am using the following code to search for a substring:
String subject = "ABC"
String subString = "AB"
Pattern pattern = Pattern.compile(subString, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(subject);
while (matcher.find()){
//Matched
}
But when my subject string contains a $ in the beginning, it does not work since it is a special character.
String subject = "$ABC"
String subString = "$"
How does one handle that?
By escaping the special character in the subString. Like,
String subString = "\\$";
or telling the Pattern to match literals. Like,
Pattern pattern = Pattern.compile(subString, Pattern.LITERAL | Pattern.CASE_INSENSITIVE);
There are few meta characters in regex. And some of them which are supported by regex in java are
( ) [ ] { { \ ^ $ | ? * + . < > - = !
So $ is a indeed meta character here. The meta character conveys special meaning to the regex engine and hence can't be use literally. So in order to use them you have to combine them with escape character which is backslash \
So String subject = "\\$ABC"
String subString = "\\$"
would do. Java uses double backslash instead of single for escape character unlike the other regex engine.
I have a string with \r\n, \r, \n or \" characters in it. How can I replace them faster?
What I already have is:
String s = "Kerner\\r\\n kyky\\r hihi\\n \\\"";
System.out.println(s.replace("\\r\\n", "\n").replace("\\r", "").replace("\\n", "").replace("\\", ""));
But my code does not look beautiful enough.
I found on the Internet something like:
replace("\\r\\n|\\r|\\n|\\", "")
I tried that, but it didn't work.
You can wrap it in a method, put /r/n, /n and /r in a list. iterate the list and replace all such characters and return the modified string.
public String replaceMultipleSubstrings(String original, List<String> mylist){
String tmp = original;
for(String str: mylist){
tmp = tmp.replace(str, "");
}
return tmp;
}
Test:
mylist.add("\\r");
mylist.add("\\r\\n");
mylist.add("\\n");
mylist.add("\\"); // add back slash
System.out.println("original:" + s);
String x = new Main().replaceMultipleSubstrings(s, mylist);
System.out.println("modified:" + x);
Output:
original:Kerner\r\n kyky\r hihi\n \"
modified:Kerner kyky hihi "
I don't know if your current replacement logic be correct, but it says now that either \n, \r, or \r\n gets replaced with empty string, and backslash also gets replaced with empty string. If so, then you can try the following regex replace all:
String s = "Kerner\\r\\n kyky\\r hihi\\n \\\"";
System.out.println(s.replaceAll("\\r|\\n|\\r\\n|\\\\", ""));
One problem I saw with your attempt is that you are calling replace(), not replaceAll(), so it would only do a single replacement and then stop.
String.replaceAll() can be used, in your question you tried to use String.replace() which does not interpret regular expressions, only plain replacement strings...
You also need to escape the \\ again, i.e. \\\\ instead of \\
String s = "Kerner\\r\\n kyky\\r hihi\\n \\\"";
System.out.println(s.replaceAll("\\\\r|\\\\n|\\\\\"", ""));
Output
Kerner kyky hihi
Note the differences between String.replaceAll() and String.replace()
String.replaceAll()
Replaces each substring of this string that matches the given regular
expression with the given replacement.
String.replace()
Replaces each substring of this string that matches the literal target
sequence with the specified literal replacement sequence.
Use a regular expression if you want to do all the replaces in one go.
http://www.javamex.com/tutorials/regular_expressions/search_replace.shtml
I'm getting message from other program where some characters are changed:
\n (enter) -> #
(hash) # -> \#
\ -> \\\\
When I'm trying to reverse these change with my code it's not working, probably of that
Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
This is my code:
public String changeChars(final String message) {
String changedMessage = message;
changedMessage = changePattern(changedMessage, "[^\\\\][#]", "\n");
changedMessage = changePattern(changedMessage, "([^\\\\][\\\\#])", "#");
changedMessage = changePattern(changedMessage, "[\\\\\\\\\\\\\\\\]", "\\\\");
return changedMessage;
}
private String changePattern(final String message, String patternString, String replaceString) {
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(message);
return matcher.replaceAll(replaceString);
}
I assume that your encoding method works like this.
replace all \ with \\\\
mark originally placed # as \#
now since we know that all originally placed # have \ before it we can use it to mark new lines \n with #.
Code for that could be something like
data = data.replace("\\", "\\\\\\\\");
data = data.replace("#", "\\#");
data = data.replace("\n", "#");
To reverse this operation we need to start from the end (form last replacement)
We will replace all # that don't have \ before it with new line \n marks (if we started with 2nd replacement \# -> # we wouldn't know later which of # ware replacements of \n).
After that we can safely replace \# with # (this way we will get rid of additional \ that wasn't in original String and it won't bother our last replacement step).
and lastly we replace \\\\ with \.
Here is how we can do it.
//your previous regex [^\\\\][#] describes "any character that is not \ and #
//but since we don't want to include that additional non `\` mark while replacing
//we should use negative look-behind mechanism "(?<!prefix)"
data = data.replaceAll("(?<!\\\\)#", "\n");
//now since we got rid of additional "#" its time to replace `\#` to `#`
data = data.replace("\\#", "#");
//and lastly `\\\\` to `\`
data = data.replace("\\\\\\\\", "\\");
I want to split the following string "Good^Evening" i used split option it is not split the value. please help me.
This is what I've been trying:
String Val = "Good^Evening";
String[] valArray = Val.Split("^");
I'm assuming you did something like:
String[] parts = str.split("^");
That doesn't work because the argument to split is actually a regular expression, where ^ has a special meaning. Try this instead:
String[] parts = str.split("\\^");
The \\ is really equivalent to a single \ (the first \ is required as a Java escape sequence in string literals). It is then a special character in regular expressions which means "use the next character literally, don't interpret its special meaning".
The regex you should use is "\^" which you write as "\\^" as a Java String literal; i.e.
String[] parts = "Good^Evening".split("\\^");
The regex needs a '\' escape because the caret character ('^') is a meta-character in the regex language. The 2nd '\' escape is needed because '\' is an escape in a String literal.
try this
String str = "Good^Evening";
String newStr = str.replaceAll("[^]+", "");
I found a few references to regex filtering out non-English but none of them is in Java, aside from the fact that they are all referring to somewhat different problems than what I am trying to solve:
Replace all non-English characters
with a space.
Create a method that returns true
if a string contains any non-English
character.
By "English text" I mean not only actual letters and numbers but also punctuation.
So far, what I have been able to come with for goal #1 is quite simple:
String.replaceAll("\\W", " ")
In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?
As for goal #2, I could simply trim() the string after the above replaceAll(), then check if it's empty. But... Is there a more efficient way to do this?
In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?
\W is equivalent to [^\w], and \w is equivalent to [a-zA-Z_0-9]. Using \W will replace everything which isn't a letter, a number, or an underscore — like tabs and newline characters. Whether or not that's a problem is really up to you.
By "English text" I mean not only actual letters and numbers but also punctuation.
In that case, you might want to use a character class which omits punctuation; something like
[^\w.,;:'"]
Create a method that returns true if a string contains any non-English character.
Use Pattern and Matcher.
Pattern p = Pattern.compile("\\W");
boolean containsSpecialChars(String string)
{
Matcher m = p.matcher(string);
return m.find();
}
This works for me
private static boolean isEnglish(String text) {
CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
return asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
}
Here is my solution. I assume the text may contain English words, punctuation marks and standard ascii symbols such as #, %, # etc.
private static final String IS_ENGLISH_REGEX = "^[ \\w \\d \\s \\. \\& \\+ \\- \\, \\! \\# \\# \\$ \\% \\^ \\* \\( \\) \\; \\\\ \\/ \\| \\< \\> \\\" \\' \\? \\= \\: \\[ \\] ]*$";
private static boolean isEnglish(String text) {
if (text == null) {
return false;
}
return text.matches(IS_ENGLISH_REGEX);
}
Assuming an english word is made up of characters from: [a-zA-Z_0-9]
To return true if a string contains any non-English character, use string.matches:
return !string.matches("^\\w+$");