How can I remove punctuation from input text in Java?

How can I remove punctuation from input text in Java? - java

I am trying to get a sentence using input from the user in Java, and i need to make it lowercase and remove all punctuation. Here is my code:
String[] words = instring.split("\\s+");
for (int i = 0; i < words.length; i++) {
words[i] = words[i].toLowerCase();
}
String[] wordsout = new String[50];
Arrays.fill(wordsout,"");
int e = 0;
for (int i = 0; i < words.length; i++) {
if (words[i] != "") {
wordsout[e] = words[e];
wordsout[e] = wordsout[e].replaceAll(" ", "");
e++;
}
}
return wordsout;
I cant seem to find any way to remove all non-letter characters. I have tried using regexes and iterators with no luck. Thanks for any help.

This first removes all non-letter characters, folds to lowercase, then splits the input, doing all the work in a single line:
String[] words = instring.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");
Spaces are initially left in the input so the split will still work.
By removing the rubbish characters before splitting, you avoid having to loop through the elements.

You can use following regular expression construct
Punctuation: One of !"#$%&'()*+,-./:;<=>?#[]^_`{|}~
inputString.replaceAll("\\p{Punct}", "");

You may try this:-
Scanner scan = new Scanner(System.in);
System.out.println("Type a sentence and press enter.");
String input = scan.nextLine();
String strippedInput = input.replaceAll("\\W", "");
System.out.println("Your string: " + strippedInput);
[^\w] matches a non-word character, so the above regular expression will match and remove all non-word characters.

If you don't want to use RegEx (which seems highly unnecessary given your problem), perhaps you should try something like this:
public String modified(final String input){
final StringBuilder builder = new StringBuilder();
for(final char c : input.toCharArray())
if(Character.isLetterOrDigit(c))
builder.append(Character.isLowerCase(c) ? c : Character.toLowerCase(c));
return builder.toString();
}
It loops through the underlying char[] in the String and only appends the char if it is a letter or digit (filtering out all symbols, which I am assuming is what you are trying to accomplish) and then appends the lower case version of the char.

I don't like to use regex, so here is another simple solution.
public String removePunctuations(String s) {
String res = "";
for (Character c : s.toCharArray()) {
if(Character.isLetterOrDigit(c))
res += c;
}
return res;
}
Note: This will include both Letters and Digits

If your goal is to REMOVE punctuation, then refer to the above. If the goal is to find words, none of the above solutions does that.
INPUT: "This. and:that. with'the-other".
OUTPUT: ["This", "and", "that", "with", "the", "other"]
but what most of these "replaceAll" solutions is actually giving you is:
OUTPUT: ["This", "andthat", "withtheother"]

Related

Using regex to split sentence into tokens stripping it of all the necessary punctuation excluding punctuation that is part of a word

So I wish to split a sentence into separate tokens. However, I don't want to get rid of certain punctuations that I wish to be part of tokens. For example, "didn't" should stay as "didn't" at the end of a word if the punctuation is not followed by a letter it should be taken out. So, "you?" should be converted to "you" same with the begining: "?you" should be "you".
String str = "..Hello ?don't #$you %know?";
String[] strArray = new String[10];
strArray = str.split("[^A-za-z]+[\\s]|[\\s]");
//strArray[strArray.length-1]
for(int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
This should just print out:
hello0
don't1
you2
know3

Rather than splitting, you should prefer to use find to find all the tokens as you want with this regex,
[a-zA-Z]+(['][a-zA-Z]+)?
This regex will only allow sandwiching a single ' within it. If you want to allow any other such character, just place it within the character set ['] and right now it will allow only once and in case you want to allow multiple times, you will have to change ? at the end with a * to make it zero or more times.
Checkout your modified Java code,
List<String> tokenList = new ArrayList<String>();
String str = "..Hello ?don't #$you %know?";
Pattern p = Pattern.compile("[a-zA-Z]+(['][a-zA-Z]+)?");
Matcher m = p.matcher(str);
while (m.find()) {
tokenList.add(m.group());
}
String[] strArray = tokenList.toArray(new String[tokenList.size()]);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
However, if you insist on using split method only, then you can use this regex to split the values,
[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+
Which basically splits the string on one or more white space optionally surrounded by non-alphabet characters or split by sequence of one or more non-alphabet and non single quote character. Here is the sample Java code using split,
String str = ".. Hello ?don't #$you %know?";
String[] strArray = Arrays.stream(str.split("[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+")).filter(x -> x.length()>0).toArray(String[]::new);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
Notice here, I have used filter method on streams to filter tokens of zero length as split may generate zero length tokens at the start of array.

Java: Replace a specific character with a substring in a string at index

I am struggling with how to actually do this. Say I have this string
"This Str1ng i5 fun"
I want to replace the '1' with "One" and the 5 with "Five"
"This StrOneng iFive fun"
I have tried to loop thorough the string and manually replace them, but the count is off. I have also tried to use lists, arrays, stringbuilder, etc. but I cannot get it to work:
char[] stringAsCharArray = inputString.toCharArray();
ArrayList<Character> charArraylist = new ArrayList<Character>();
for(char character: stringAsCharArray) {
charArraylist.add(character);
}
int counter = startPosition;
while(counter < endPosition) {
char temp = charArraylist.get(counter);
String tempString = Character.toString(temp);
if(Character.isDigit(temp)){
char[] tempChars = digits.getDigitString(Integer.parseInt(tempString)).toCharArray(); //convert to number
charArraylist.remove(counter);
int addCounter = counter;
for(char character: tempChars) {
charArraylist.add(addCounter, character);
addCounter++;
}
counter += tempChars.length;
endPosition += tempChars.length;
}
counter++;
}
I feel like there has to be a simple way to replace a single character at a string with a substring, without having to do all this iterating. Am I wrong here?

String[][] arr = {{"1", "one"},
{"5", "five"}};
String str = "String5";
for(String[] a: arr) {
str = str.replace(a[0], a[1]);
}
System.out.println(str);
This would help you to replace multiple words with different text.
Alternatively you could use chained replace for doing this, eg :
str.replace(1, "One").replace(5, "five");
Check this much better approach : Java Replacing multiple different substring in a string at once (or in the most efficient way)

You can do
string = string.replace("1", "one");
Don't use replaceAll, because that replaces based on regular expression matches (so that you have to be careful about special characters in the pattern, not a problem here).
Despite the name, replace also replaces all occurrences.
Since Strings are immutable, be sure to assign the result value somewhere.

Try the below:
string = string.replace("1", "one");
string = string.replace("5", "five");
.replace replaces all occurences of the given string with the specified string, and is quite useful.

Split String Using comma but ignore comma if it is brackets or quotes

I've seen many examples, but I am not getting the expected result.
Given a String:
"manikanta, Santhosh, ramakrishna(mani, santhosh), tester"
I would like to get the String array as follows:
manikanta,
Santhosh,
ramakrishna(mani, santhosh),
tester
I tried the following regex (got from another example):
"(\".*?\"|[^\",\\s]+)(?=\\s*,|\\s*$)"

This does this trick:
String[] parts = input.split(", (?![^(]*\\))");
which employs a negative lookahead to assert that the next bracket char is not a close bracket, and produces:
manikanta
Santhosh
ramakrishna(mani, santhosh)
tester
The desired output as per your question keeps the trailing commas, which I assume is an oversight, but if you really do want to keep the commas:
String[] parts = input.split("(?<=,) (?![^(]*\\))");
which produces the same, but with the trailing commas intact:
manikanta,
Santhosh,
ramakrishna(mani, santhosh),
tester

Suppose, we can split with whitespaces (due to your example), then you can try this regex \s+(?=([^\)]*\()|([^\)\(]*$)) like:
String str = "manikanta, Santhosh, ramakrishna(mani, santhosh), ramakrishna(mani, santhosh), tester";
String[] ar = str.split("\\s+(?=([^\\)]*\\()|([^\\)\\(]*$))");
Where:
\s+ any number of whitespaces
(?=...) positive lookahead, means that after current position must be the string, that matches to ([^\\)]*\\() or | to ([^\\)\\(]*$)
([^\\)]*\\() ignores whitespaces inside the ( and )
([^\\)\\(]*$)) all whitespaces, if they are not followed by ( and ), here is used to split a part with the tester word

As I stated in my comment to the question this problem may be impossible to solve by regular expressions.
The following code (java) gives a hint what to do:
private void parse() {
String string = null;
char[] chars = string.toCharArray();
List<String> parts = new ArrayList<String>();
boolean split = true;
int lastEnd = 0;
for (int i = 0; i < chars.length; i++) {
char c = chars[i];
switch (c) {
case '(':
split = false;
break;
case ')':
split = true;
break;
}
if (split && c == ',') {
parts.add(string.substring(lastEnd, i - 1));
lastEnd = i++;
}
}
}
Note that the code lacks some checks for constraints (provided string is null, array borders, ...).

Java Finding all words begining with a letter

I am trying to get all words that begin with a letter from a long string. How would you do this is java? I don't want to loop through every letter or something inefficient.
EDIT: I also can't use any in built data structures (except arrays of course)- its for a cs class. I can however make my own data structures (which i have created sevral).

You could try obtaining an array collection from your String and then iterating through it:
String s = "my very long string to test";
for(String st : s.split(" ")){
if(st.startsWith("t")){
System.out.println(st);
}
}

You need to be clear about some things. What is a "word"? You want to find only "words" starting with a letter, so I assume that words can have other characters too. But what chars are allowed? What defines the start of such a word? Whitespace, any non letter, any non letter/non digit, ...?
e.g.:
String TestInput = "test séntènce îwhere I'm want,to üfind 1words starting $with le11ers.";
String regex = "(?<=^|\\s)\\pL\\w*";
Pattern p = Pattern.compile(regex, Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = p.matcher(TestInput);
while (matcher.find()) {
System.out.println(matcher.group());
}
The regex (?<=^|\s)\pL\w* will find sequences that starts with a letter (\pL is a Unicode property for letter), followed by 0 or more "word" characters (Unicode letters and numbers, because of the modifier Pattern.UNICODE_CHARACTER_CLASS).
The lookbehind assertion (?<=^|\s) ensures that there is the start of the string or a whitespace before the sequence.
So my code will print:
test
séntènce ==> contains non ASCII letters
îwhere ==> starts with a non ASCII letter
I ==> 'm is missing, because `'` is not in `\w`
want
üfind ==> starts with a non ASCII letter
starting
le11ers ==> contains digits
Missing words:
,to ==> starting with a ","
1words ==> starting with a digit
$with ==> starting with a "$"

You could build a HashMap -
HashMap<String,String> map = new HashMap<String,String>();
example -
ant, bat, art, cat
Hashmap
a -> ant,art
b -> bat
c -> cat
to find all words that begin with "a", just do
map.get("a")

You can get the first letter of the string and check with API method that if it is letter or not.
String input = "jkk ds 32";
String[] array = input.split(" ");
for (String word : array) {
char[] arr = word.toCharArray();
char c = arr[0];
if (Character.isLetter(c)) {
System.out.println( word + "\t isLetter");
} else {
System.out.println(word + "\t not Letter");
}
}
Following are some sample output:
jkk isLetter
ds isLetter
32 not Letter

Scanner scan = new Scanner(text); // text being the string you are looking in
char test = 'x'; //whatever letter you are looking for
while(scan.hasNext()){
String wordFound = scan.next();
if(wordFound.charAt(0)==test){
//do something with the wordFound
}
}
this will do what you are looking for, inside the if statement do what you want with the word

Regexp way:
public static void main(String[] args) {
String text = "my very long string to test";
Matcher m = Pattern.compile("(^|\\W)(\\w*)").matcher(text);
while (m.find()) {
System.out.println("Found: "+m.group(2));
}
}

You can use split() method. Here is an example :
String string = "your string";
String[] parts = string.split(" C");
for(int i=0; i<parts.length; i++) {
String[] word = parts[i].split(" ");
if( i > 0 ) {
// ignore the rest words because don't starting with C
System.out.println("C" + word[0]);
}
else { // Check 1st excplicitly
for(int j=0; j<word.length; j++) {
if ( word[j].startsWith("c") || word[j].startsWith("C"))
System.out.println(word[j]);
}
}
}
where "C" is you letter. Just then loop around the array. For parts[0] you have to check if it starts with "C". It was my mistake to start looping from i=1. The correct is from 0.

Remove a specific word from a string

I'm trying to remove a specific word from a certain string using the function replace() or replaceAll() but these remove all the occurrences of this word even if it's part of another word!
Example:
String content = "is not like is, but mistakes are common";
content = content.replace("is", "");
output: "not like , but mtakes are common"
desired output: "not like , but mistakes are common"
How can I substitute only whole words from a string?

What the heck,
String regex = "\\s*\\bis\\b\\s*";
content = content.replaceAll(regex, "");
Remember you need to use replaceAll(...) to use regular expressions, not replace(...)
\\b gives you the word boundaries
\\s* sops up any white space on either side of the word being removed (if you want to remove this too).

content = content.replaceAll("\\Wis\\W|^is\\W|\\Wis$", "");

You can try replacing " is " by " ". The is with a space before and one after, replaced by a single space.
Update:
To make it work for the first "is" in the sentence, also do another replace of "is " for "". Replacing the first is and the first space, with an empty string.

public static void main(String[] args) {
Scanner s = new Scanner(System.in);
String input = s.nextLine();
char c = s.next().charAt(0);
System.out.println(removeAllOccurrencesOfChar(input, c));
}
public static String removeAllOccurrencesOfChar(String input, char c) {
String r = "";
for (int i = 0; i < input.length(); i ++) {
if (input.charAt(i) != c) r += input.charAt(i);
}
return r;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I remove punctuation from input text in Java? - java

You can use following regular expression construct Punctuation: One of !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ inputString.replaceAll("\\p{Punct}", "");

I don't like to use regex, so here is another simple solution. public String removePunctuations(String s) { String res = ""; for (Character c : s.toCharArray()) { if(Character.isLetterOrDigit(c)) res += c; } return res; } Note: This will include both Letters and Digits

Related

Using regex to split sentence into tokens stripping it of all the necessary punctuation excluding punctuation that is part of a word

Java: Replace a specific character with a substring in a string at index

Split String Using comma but ignore comma if it is brackets or quotes

Java Finding all words begining with a letter

Remove a specific word from a string

Categories

Resources