How to remove special characters from input text - java

I want to remove all special characters from input text as well as some restricted words.
Whatever the things I want to remove, that will come dynamically
(Let me clarify this: Whatever the words I need to exclude they will be provided dynamically - the user will decide what needs to be excluded. That is the reason I did not include regex. restricted_words_list (see my code) will get from the database just to check the code working or not I kept statically ),
but for demonstration purposes, I kept them in a String array to confirm whether my code is working properly or not.
public class TestKeyword {
private static final String[] restricted_words_list={"#","of","an","^","#","<",">","(",")"};
private static final Pattern restrictedReplacer;
private static Set<String> restrictedWords = null;
static {
StringBuilder strb= new StringBuilder();
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
strb.setLength(strb.length()-1);
restrictedReplacer = Pattern.compile(strb.toString(),Pattern.CASE_INSENSITIVE);
strb = new StringBuilder();
}
public static void main(String[] args)
{
String inputText = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
System.out.println("inputText : " + inputText);
String modifiedText = restrictedWordCheck(inputText);
System.out.println("Modified Text : " + modifiedText);
}
public static String restrictedWordCheck(String input){
Matcher m = restrictedReplacer.matcher(input);
StringBuffer strb = new StringBuffer(input.length());//ensuring capacity
while(m.find()){
if(restrictedWords==null)restrictedWords = new HashSet<String>();
restrictedWords.add(m.group()); //m.group() returns what was matched
m.appendReplacement(strb,""); //this writes out what came in between matching words
for(int i=m.start();i<m.end();i++)
strb.append("");
}
m.appendTail(strb);
return strb.toString();
}
}
The output is :
inputText : abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg
Modified Text : abcd abc# cbda ssef jjj the gg wh&at gggg ss%ss ### (()) DhD
Here the excluded words are of and an, but only some of the special characters, not all that I specified in restricted_words_list
Now I got a better Solution:
String inputText = title;// assigning input
List<String> restricted_words_list = catalogueService.getWordStopper(); // getting all stopper words from database dynamically (inside getWordStopper() method just i wrote a query and getting list of words)
String finalResult = "";
List<String> stopperCleanText = new ArrayList<String>();
String[] afterTextSplit = inputText.split("\\s"); // split and add to list
for (int i = 0; i < afterTextSplit.length; i++) {
stopperCleanText.add(afterTextSplit[i]); // adding to list
}
stopperCleanText.removeAll(restricted_words_list); // remove all word stopper
for (String addToString : stopperCleanText)
{
finalResult += addToString+";"; // add semicolon to cleaned text
}
return finalResult;

public String replaceAll(String regex,
String replacement)
Replaces each substring of this string (which matches the given regular expression) with the given replacement.
Parameters:
regex - the regular expression to which this string is to be
matched
replacement - the string to be substituted for each match.
So you just need to provide replacement parameter with an empty String.

You should change your loop
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
to this:
for(String str:restricted_words_list){
strb.append("\\b*").append(Pattern.quote(str)).append("\\b*|");
}
Because with your loop you're matching the restricted_words_list elements only if there is something before and after the match. Since abc# does not have anything after the # it will not be replaced. If you add * (which means 0 or more occurences) to the \\b on either side it will match things like abc# as well.

You may consider to use Regex directly to replace those special character with empty ''? Check it out: Java; String replace (using regular expressions)?, some tutorial here: http://www.vogella.com/articles/JavaRegularExpressions/article.html

You can also do like this :
String inputText = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
String regx="([^a-z^ ^0-9]*\\^*)";
String textWithoutSpecialChar=inputText.replaceAll(regx,"");
System.out.println("Without Special Char:"+textWithoutSpecialChar);
String yourSetofString="of|an"; // your restricted words.
String op=textWithoutSpecialChar.replaceAll(yourSetofString,"");
System.out.println("output : "+op);
o/p :
Without Special Char:abcd abc cbda ssef of jjj the gg an what gggg ssss h
output : abcd abc cbda ssef jjj the gg what gggg ssss h

String s = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg (blah) and | then";
String[] words = new String[]{ " of ", "|", "(", " an ", "#", "#", "&", "^", ")" };
StringBuilder sb = new StringBuilder();
for( String w : words ) {
if( w.length() == 1 ) {
sb.append( "\\" );
}
sb.append( w ).append( "|" );
}
System.out.println( s.replaceAll( sb.toString(), "" ) );

Related

Get the value after a string and comma and ends if there is character '|'

If I have string variable :
String word = "wordA";
and I have another string variable :
String fullText= "wordA,A A|wordB,B B|wordC,C C|wordD,D D";
so is it possible to get the value after the comma and ends with | ?
Example
If word equals "wordA" then I get only "A A" because in fullText right after wordA and comma is 'A A' and ends with |
and if word equals "wordD" then varible result is "D D" based on the variable fullText.
So how to get this variable result in Java ?
You can use a simple regular expression. Like this:
String text = fullText.replaceAll(".*" + word + ",([^\\|]+).*", "$1");
Alternatively:
Matcher matcher = Pattern.compile(word + ",([^\\|]+)").matcher(fullText);
matcher.find();
matcher.group(1); // "A A" for word = wordA
If you are using Java8 you can use stream like so :
String result = Arrays.stream(fullText.split("\\|")) // split with |
.filter(s -> s.startsWith(word + ",")) // filter by start with word + ','
.findFirst() // find first or any
.map(a -> a.substring(word.length() + 1)) // get every thing after work + ','
.orElse(null); // or else null or any default value
How about this:
public static String search(String fullText, String key) {
Pattern re = Pattern.compile("(?:^|\\|)" + key + ",([^|]*)(?:$|\\|)");
Matcher matcher = re.matcher(fullText);
if (matcher.find()) {
return matcher.group(1);
}
return null;
}
Example:
String fullText= "wordA,A A|wordB,B B|wordC,C C|wordD,D D";
System.out.println(search(fullText, "wordA"));
System.out.println(search(fullText, "wordB"));
System.out.println(search(fullText, "wordC"));
System.out.println(search(fullText, "wordD"));
Output:
A A
B B
C C
D D
UPDATE: To avoid recompiling the regex at each search:
private static final Pattern RE = Pattern.compile("(?:^|\\|)([^,]*),([^|]*)(?:$|(?=\\|))");
public static String search(String fullText, String key) {
Matcher matcher = RE.matcher(fullText);
while (matcher.find()) {
if (matcher.group(1).equals(key)) {
return matcher.group(2);
}
}
return null;
}

How to put Java Regex matches to Resultant String?

How to tokenize an String like in lexer in java?
Please refer to the above question. I never used java regex . How to put the all substring into new string with matched characters (symbols like '(' ')' '.' '<' '>' ") separated by single space . for e.g. before regex
String c= "List<String> uncleanList = Arrays.asList(input1.split("x"));" ;
I want resultant string like this .
String r= " List < String > uncleanList = Arrays . asList ( input1 . split ( " x " ) ) ; "
Referring to the code that you linked to, matcher.group() will give you a single token. Simple use a StringBuilder to append this token and a space to get a new string where the tokens are space-separated.
String c = "List<String> uncleanList = Arrays.asList(input1.split(\"x\"));" ;
Pattern pattern = Pattern.compile("\\w+|[+-]?[0-9\\._Ee]+|\\S");
Matcher matcher = pattern.matcher(c);
StringBuilder sb = new StringBuilder();
while (matcher.find()) {
String token = matcher.group();
sb.append(token).append(" ");
}
String r = sb.toString();
System.out.println(r);
String c = "List<String> uncleanList = Arrays.asList(input1.split('x'));";
Matcher matcher = Pattern.compile("\\<|\\>|\\\"|\\.|\\(|\\)").matcher(c);
while(matcher.find()){
String symbol = matcher.group();
c = c.replace(symbol," " + symbol + " ");
}
Actually if you look deeply You can figure out that you have to separate only not alphabet symbols and space ((?![a-zA-Z]|\ ).)

Length of String within tags in java

We need to find the length of the tag names within the tags in java
{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}
so the length of Student tag is 7 and that of subject tag is 7 and that of marks is 5.
I am trying to split the tags and then find the length of each string within the tag.
But the code I am trying gives me only the first tag name and not others.
Can you please help me on this?
I am very new to java. Please let me know if this is a very silly question.
Code part:
System.out.println(
getParenthesesContent("{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}"));
public static String getParenthesesContent(String str) {
return str.substring(str.indexOf('{')+1,str.indexOf('}'));
}
You can use Patterns with this regex \\{(\[a-zA-Z\]*)\\} :
String text = "{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}";
Matcher matcher = Pattern.compile("\\{([a-zA-Z]*)\\}").matcher(text);
while (matcher.find()) {
System.out.println(
String.format(
"tag name = %s, Length = %d ",
matcher.group(1),
matcher.group(1).length()
)
);
}
Outputs
tag name = Student, Length = 7
tag name = Subject, Length = 7
tag name = Marks, Length = 5
You might want to give a try to another regex:
String s = "{Abc}{Defg}100{Hij}100{/Klmopr}{/Stuvw}"; // just a sample String
Pattern p = Pattern.compile("\\{\\W*(\\w++)\\W*\\}");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + ", length: " + m.group(1).length());
}
Output you get:
Abc, length: 3
Defg, length: 4
Hij, length: 3
Klmopr, length: 6
Stuvw, length: 5
If you need to use charAt() to walk over the input String, you might want to consider using something like this (I made some explanations in the comments to the code):
String s = "{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}";
ArrayList<String> tags = new ArrayList<>();
for(int i = 0; i < s.length(); i++) {
StringBuilder sb = new StringBuilder(); // Use StringBuilder and its append() method to append Strings (it's more efficient than "+=") String appended = ""; // This String will be appended when correct tag is found
if(s.charAt(i) == '{') { // If start of tag is found...
while(!(Character.isLetter(s.charAt(i)))) { // Skip characters that are not letters
i++;
}
while(Character.isLetter(s.charAt(i))) { // Append String with letters that are found
sb.append(s.charAt(i));
i++;
}
if(!(tags.contains(sb.toString()))) { // Add final String to ArrayList only if it not contained here yet
tags.add(sb.toString());
}
}
}
for(String tag : tags) { // Printing Strings contained in ArrayList and their length
System.out.println(tag + ", length: " + tag.length());
}
Output you get:
Student, length: 7
Subject, length: 7
Marks, length: 5
yes use regular expression, find the pattern and apply that.

complex regular expression in Java

I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:
I can get any text string that must be of the format:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
And that works fine if the <either a url or string> is just an alphanumeric string. But it all falls apart when the embedded string is a url of the format:
tcp://someurl.something:port
Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string?
Here's an example:
public static void main(String[] args) {
String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
boolean matchFound = false;
ArrayList<String> values = new ArrayList<>();
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
Matcher m3 = Pattern.compile(pattern2).matcher(name);
while (m3.find()) {
matchFound = true;
String m = m3.group(2);
System.out.println("regex found match: " + m);
values.add(m);
}
}
In the above example, my results would be:
myString1
tcp://someurl.com:8989
myString2
1
And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters
You mention that the format is constant:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
Capture groups can do this for you with the pattern:
"M:(.*):D:(.*):C:(.*):Q:(.*)"
Or you can do a String.split() with a pattern of "M:|:D:|:C:|:Q:". However, the split will return an empty element at the first index. Everything else will follow.
public static void main(String[] args) throws Exception {
System.out.println("Regex: ");
String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
System.out.println();
System.out.println("String.split(): ");
String[] pieces = data.split("M:|:D:|:C:|:Q:");
for (String piece : pieces) {
System.out.println(piece);
}
}
Results:
Regex:
<some text>
tcp://someurl.something:port
<some more text>
<a number>
String.split():
<some text>
tcp://someurl.something:port
<some more text>
<a number>
To extract the URL/text part you don't need the regular expression. Use
int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);
Assuming you need to do some validation along with the parsing:
break the regex into different parts like this:
String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
String url_regex = "."; //theres a bunch online, pick your favorite.
String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
String q_regex = "\\d+"; //what sort of number exactly? assuming any string of digits here
String regex = "M:(?<M>" + m_regex + "):"
+ "D:(?<D>" + d_regex + "):"
+ "C:(?<D>" + c_regex + "):"
+ "Q:(?<D>" + q_regex + ")";
Pattern p = Pattern.compile(regex);
Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.
Then you can retrieve each part by its name:
Matcher m = p.matcher( input );
if (m.matches()) {
String m_part = m.group( "M" );
...
String q_part = m.group( "Q" );
}
You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex. Though you definitely lose the simplicity makes it harder to understand it with a quick glance. (I wouldn't do this, just pointing out its possible and has its own benefits)

replaceFirst for character "`"

First time here. I'm trying to write a program that takes a string input from the user and encode it using the replaceFirst method. All letters and symbols with the exception of "`" (Grave accent) encode and decode properly.
e.g. When I input
`12
I am supposed to get 28AABB as my encryption, but instead, it gives me BB8AA2
public class CryptoString {
public static void main(String[] args) throws IOException, ArrayIndexOutOfBoundsException {
String input = "";
input = JOptionPane.showInputDialog(null, "Enter the string to be encrypted");
JOptionPane.showMessageDialog(null, "The message " + input + " was encrypted to be "+ encrypt(input));
public static String encrypt (String s){
String encryptThis = s.toLowerCase();
String encryptThistemp = encryptThis;
int encryptThislength = encryptThis.length();
for (int i = 0; i < encryptThislength ; ++i){
String test = encryptThistemp.substring(i, i + 1);
//Took out all code with regard to all cases OTHER than "`" "1" and "2"
//All other cases would have followed the same format, except with a different string replacement argument.
if (test.equals("`")){
encryptThis = encryptThis.replaceFirst("`" , "28");
}
else if (test.equals("1")){
encryptThis = encryptThis.replaceFirst("1" , "AA");
}
else if (test.equals("2")){
encryptThis = encryptThis.replaceFirst("2" , "BB");
}
}
}
I've tried putting escape characters in front of the grave accent, however, it is still not encoding it properly.
Take a look at how your program works in each loop iteration:
i=0
encryptThis = '12 (I used ' instead of ` to easier write this post)
and now you replace ' with 28 so it will become 2812
i=1
we read character at position 1 and it is 1 so
we replace 1 with AA making 2812 -> 28AA2
i=2
we read character at position 2, it is 2 so
we replace first 2 with BB making 2812 -> BB8AA2
Try maybe using appendReplacement from Matcher class from java.util.regex package like
public static String encrypt(String s) {
Map<String, String> replacementMap = new HashMap<>();
replacementMap.put("`", "28");
replacementMap.put("1", "AA");
replacementMap.put("2", "BB");
Pattern p = Pattern.compile("[`12]"); //regex that will match ` or 1 or 2
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()){//we found one of `, 1, 2
m.appendReplacement(sb, replacementMap.get(m.group()));
}
m.appendTail(sb);
return sb.toString();
}
encryptThistemp.substring(i, i + 1); The second parameter of substring is length, are you sure you want to be increasing i? because this would mean after the first iteration test would not be 1 character long. This could throw off your other cases which we cannot see!

Categories