Capitalize first letter in each word with symbols in sentence - java

I have a database that stores strings that people wrote. Those string for example defines the name of movies.
In order to overcome duplicates and some other things, I did that no matter what the user typed, it will make every first letter capital. In that manner, all of the strings will be saved in the same way.
The way I do it is by using:
String[] words = query.split( "\\s+" );
for (int i = 0; i < words.length; i++) {
String Word = words[i].replaceAll( "[^\\w]", "" );
words[i] = Word.substring( 0, 1 ).toUpperCase() + Word.substring( 1 );
}
query = TextUtils.join( " ", words );
However, I faced a problem when I tried to type something like: Tom & Jerry.
In that case, I got an error with the &. Do I just need to set if conditions to check for every letter such as &, (, ), $ and so on?

toUpperCase handles non-letter characters just fine, and just returns the same character. The problem with your code is that it assumes each word is non-empty, which is no longer true after you remove the special characters.
To make a long story short, just keep the special characters, and you should be OK:
for (int i = 0; i < words.length; i++) {
String word = words[i];
words[i] = word.substring( 0, 1 ).toUpperCase() + word.substring( 1 );
}

Related

(hello-> h3o) How to replace in a String the middle letters for the number of letters replaced

I need to build a method which receive a String e.g. "elephant-rides are really fun!". and return another similar String, in this example the return should be: "e6t-r3s are r4y fun!". (because e-lephan-t has 6 middle letters, r-ide-s has 3 middle letters and so on)
To get that return I need to replace in each word the middle letters for the number of letters replaced leaving without changes everything which isn't a letter and the first and the last letter of every word.
for the moment I've tried using regex to split the received string into words, and saving these words in an array of strings also I have another array of int in which I save the number of middle letters, but I don't know how to join both arrays and the symbols into a correct String to return
String string="elephant-rides are really fun!";
String[] parts = string.split("[^a-zA-Z]");
int[] sizes = new int[parts.length];
int index=0;
for(String aux: parts)
{
sizes[index]= aux.length()-2;
System.out.println( sizes[index]);
index++;
}
You may use
String text = "elephant-rides are really fun!";
Pattern r = Pattern.compile("(?U)(\\w)(\\w{2,})(\\w)");
Matcher m = r.matcher(text);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, m.group(1) + m.group(2).length() + m.group(3));
}
m.appendTail(sb); // append the rest of the contents
System.out.println(sb);
// => e6t-r3s are r4y fun!
See the Java demo
Here, (?U)(\\w)(\\w{2,})(\\w) matches any Unicode word char capturing it into Group 1, then captures any 2 or more word chars into Group 2 and then captures a single word char into Group 3, and inside the .appendReplacement method, the second group contents are "converted" into its length.
Java 9+:
String text = "elephant-rides are really fun!";
Pattern r = Pattern.compile("(?U)(\\w)(\\w{2,})(\\w)");
Matcher m = r.matcher(text);
String result = m.replaceAll(x -> x.group(1) + x.group(2).length() + x.group(3));
System.out.println( result );
// => e6t-r3s are r4y fun!
For the instructions you gave us, this would be sufficient:
String [] result = string.split("[\\s-]");
for (int i=0; i<result.length; i++){
result[i] = "" + result[i].charAt(0) + ((result[i].length())-2) + result[i].charAt(result[i].length()-1);
}
With your input, it creates the array [ "e6t", "r3s", "a1e", "r4y", "f2!" ]
And it works even with one or two sized words, but it gives result such as:
Input: I am a small; Output: [ "I-1I", "a0m", "a-1a", "s3l" ]
Again, for the instructions you gave us this would be legal.
Hope I helped!

Using regex to split sentence into tokens stripping it of all the necessary punctuation excluding punctuation that is part of a word

So I wish to split a sentence into separate tokens. However, I don't want to get rid of certain punctuations that I wish to be part of tokens. For example, "didn't" should stay as "didn't" at the end of a word if the punctuation is not followed by a letter it should be taken out. So, "you?" should be converted to "you" same with the begining: "?you" should be "you".
String str = "..Hello ?don't #$you %know?";
String[] strArray = new String[10];
strArray = str.split("[^A-za-z]+[\\s]|[\\s]");
//strArray[strArray.length-1]
for(int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
This should just print out:
hello0
don't1
you2
know3
Rather than splitting, you should prefer to use find to find all the tokens as you want with this regex,
[a-zA-Z]+(['][a-zA-Z]+)?
This regex will only allow sandwiching a single ' within it. If you want to allow any other such character, just place it within the character set ['] and right now it will allow only once and in case you want to allow multiple times, you will have to change ? at the end with a * to make it zero or more times.
Checkout your modified Java code,
List<String> tokenList = new ArrayList<String>();
String str = "..Hello ?don't #$you %know?";
Pattern p = Pattern.compile("[a-zA-Z]+(['][a-zA-Z]+)?");
Matcher m = p.matcher(str);
while (m.find()) {
tokenList.add(m.group());
}
String[] strArray = tokenList.toArray(new String[tokenList.size()]);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
However, if you insist on using split method only, then you can use this regex to split the values,
[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+
Which basically splits the string on one or more white space optionally surrounded by non-alphabet characters or split by sequence of one or more non-alphabet and non single quote character. Here is the sample Java code using split,
String str = ".. Hello ?don't #$you %know?";
String[] strArray = Arrays.stream(str.split("[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+")).filter(x -> x.length()>0).toArray(String[]::new);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
Notice here, I have used filter method on streams to filter tokens of zero length as split may generate zero length tokens at the start of array.

Split string into repeated characters

I want to split the string "aaaabbbccccaaddddcfggghhhh" into "aaaa", "bbb", "cccc". "aa", "dddd", "c", "f" and so on.
I tried this:
String[] arr = "aaaabbbccccaaddddcfggghhhh".split("(.)(?!\\1)");
But this eats away one character, so with the above regular expression I get "aaa" while I want it to be "aaaa" as the first string.
How do I achieve this?
Try this:
String str = "aaaabbbccccaaddddcfggghhhh";
String[] out = str.split("(?<=(.))(?!\\1)");
System.out.println(Arrays.toString(out));
=> [aaaa, bbb, cccc, aa, dddd, c, f, ggg, hhhh]
Explanation: we want to split the string at groups of same chars, so we need to find out the "boundary" between each group. I'm using Java's syntax for positive look-behind to pick the previous char and then a negative look-ahead with a back reference to verify that the next char is not the same as the previous one. No characters were actually consumed, because only two look-around assertions were used (that is, the regular expresion is zero-width).
What about capturing in a lookbehind?
(?<=(.))(?!\1|$)
as a Java string:
(?<=(.))(?!\\1|$)
here I am taking each character and Checking two conditions in the if loop i.e String can't exceed the length and if next character is not equaled to the first character continue the for loop else take new line and print it.
for (int i = 0; i < arr.length; i++) {
char chr= arr[i];
System.out.print(chr);
if (i + 1 < arr.length && arr[i + 1] != chr) {
System.out.print(" \n");
}
}

Java regular expression for finding two words that occur close together

I am trying to write a regular expression that will count the number of times two words co-occur within a certain proximity (within 5 words of each other) in a string, without double counting words.
For example, if I had a string:
"The man liked his big hat. The hat was very big."
In this case, the regex should see the "big hat" in the first sentence and the "hats are big" in the second sentence, returning a total of 2. Note that in the second sentence, there are several words between "hat" and "big", they also appear in a different order than the first sentence, but they still occur within a 5-word window.
If regular expressions are not the correct way to approach this problem, please let me know what I should try instead.
A bit like Stephen C but using library classes to assist in the mechanics.
String input = "The man liked his big hat. The hat was very big";
int proximity = 5;
// split input into words
String[] words = input.split("[\\W]+");
// create a Deque of the first <proximity> words
Deque<String> haystack = new LinkedList<String>(Arrays.asList(Arrays.copyOfRange(words, 0, proximity)));
// count duplicates in the first <proximity> words
int count = haystack.size() - new HashSet<String>(haystack).size();
System.out.println("initial matches: " + count);
// process the rest of the words
for (int i = proximity; i < words.length; i++) {
String word = words[i];
System.out.println("matching '" + word + "' in [" + haystack + "]");
if (haystack.contains(word)) {
System.out.println("matched word " + word + " at index " + i);
count++;
}
// remove the first word
haystack.removeFirst();
// add the current word
haystack.addLast(word);
}
System.out.println("total matches:" + count);
If regular expressions are not the correct way to approach this problem, please let me know what I should try instead.
Regexes might work, but they are not the best way to do this.
A better way to do this is to break the input string into a sequence of words (e.g. using String.split(...)) and then loop through the sequence something like this:
String[] words = input.split("\\s");
int count = 0;
for (int i = 0; i < words.length; i++) {
if (words[i].equals("big")) {
for (int j = i + 1; j < words.length && j - i < 5; j++) {
if (words[j].equals("hat")) {
count++;
}
}
}
}
// And repeat for "hat" followed by "big".
You may need to vary that depending on exactly what you are trying to count, but that's the general idea.
If you need to do this for many, many combinations of words, then it would be worth looking for a more efficient solution. But as a once-off or low volume use-case, simplest is best.
Gee... all that code in the other answers... how about this one line solution:
int count = input.split("big( \\b.*?){1,5}hat").length + input.split("hat( \\b.*?){1,5}big").length - 2;
This regex will match each occurence of two words co-occur within 5 words of each other
([a-zA-Z]+)(?:[^ ]* ){0,5}\1[^a-zA-Z]
([a-zA-Z]+) will match word if you can etheir match [0-9] in your words you can replace ([a-zA-Z0-9]+).
(?:[^ ]* ){0,5} to match between 0 and 5 words
\1[^a-zA-Z] to match the repetition of your word
Then you can use this with a Pattern and find each occurence of repetited word

How can I filter out non letters from a text file using the scanner delimiter including the single quote or apostrophe in Java

Pls I want to keep a count of every word from a file, and this count should not include non letters like the apostrophe, comma, fullstop, question mark, exclamation mark, e.t.c. i.e just letters of the alphabet.
I tried to use a delimiter like this, but it didn't include the apostrophe.
Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
int totalWordCount = 0;
//Firstly to count all the words in the file without the restricted characters
while (fileScanner.hasNext()) {
fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
totalWordCount++;
}
System.out.println("There are " + totalWordCount + " word(s)");
//Then later I create an array to store each individual word in the file for counting their lengths.
Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
String[] words = new String[totalWordCount];
for (int i = 0; i < totalWordCount; ++i) {
words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
}
This doesn't seem to work !
Please how can I go about this ?
Seems to me that you don't want to filter using anything but spaces and end lines. For example the word "they're" would return as two words if you're using a ' to filter your number of words. Here's how you could change your original code to make it work.
Scanner fileScanner = new Scanner(new File("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt"));
int totalWordCount = 0;
ArrayList<String> words = new ArrayList<String>();
//Firstly to count all the words in the file without the restricted characters
while (fileScanner.hasNext()) {
//Add words to an array list so you only have to go through the scanner once
words.add(fileScanner.next());//This defaults to whitespace
totalWordCount++;
}
System.out.println("There are " + totalWordCount + " word(s)");
fileScanner.close();
Using the Pattern.compile() turns your string into a regular expression. The '\s' character is predefined in the Pattern class to match all white space characters.
There is more information at
Pattern Documentation
Also, make sure to close your Scanner classes when you're done. This could prevent your second scanner from opening.
Edit
If you want to count the letters per word you can add the following code to the above code
int totalLetters = 0;
int[] lettersPerWord = new int[words.size()];
for (int wordNum = 0; wordNum < words.size(); wordNum++)
{
String word = words.get(wordNum);
word = word.replaceAll("[.,:;()?!\" \t\n\r\']+", "");
lettersPerWord[wordNum] = word.length();
totalLetters = word.length();
}
I have tested this code and it appears to work for me. The replaceAll, according to the JavaDoc uses a regular expression to match so it should match any of those characters and essentially remove it.
The Delimiter is not a regular expression, so with your example it is looking for things split between "[.,:;()?!\" \t\n\r]+"
You can either use regexp instead of the Delimiter
using the regexp class with the group method may be what your looking for.
String pattern = "(.*)[.,:;()?!\" \t\n\r]+(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
if (m.find( )) {
System.out.println("Found value: " + m.group(1) );
}
Play with those classes and you will see it is much more similar to what you need
You could try this regex in your delimiter:
fileScanner.useDelimiter(("[^a-zA-Z]|[^\']")).next();
This will use any non-letter character OR non apostrophe as a delimiter. That way your words will include the apostrophe but not any other non-letter character.
Then you'll have to loop through each word and check for apostrophe's and account for them if you want the length to be accurate. You could just remove each apostrophe and the length will match the number of letters in the word, or you could create word objects with their own length fields, so that you can print the word as is, and know the number of letter characters in that word.

Categories