Storing words from a .txt file into a String array

Storing words from a .txt file into a String array - java

I was going through the answers of this question asked by someone previously and I found them to be very helpful. However, I have a question about the highlighted answer but I wasn't sure if I should ask there since it's a 6 year old thread.
My question is about this snippet of code given in the answers:
private static boolean isAWord(String token)
{
//check if the token is a word
}
How would you check that the token is a word? Would you .contains("\\s+") the string and check to see if it contains characters between them? But what about when you encounter a paragraph? I'm not sure how to go about this.
EDIT: I think I should've elaborated a bit more. Usually, you'd think a word would be something surrounded by " " but, for example, if the file contains a hyphen (which is also surrounded by a blank space), you'd want the isAWord() method to return false. How can I verify that something is actually a word and not punctuation?

Since the question wasn't entirely clear, I made two methods. First method consistsOfLetters just goes through the whole string and returns false if it has any numbers/symbols. This should be enough to determine if a token is word (if you don't mind if that words exists in dictionary or not).
public static boolean consistsOfLetters(String string) {
for(int i=0; i<string.length(); i++) {
if(string.charAt(i) == '.' && (i+1) == string.length() && string.length() != 1) break; // if last char of string is ., it is still word
if((string.toLowerCase().charAt(i) < 'a' || string.toLowerCase().charAt(i) > 'z')) return false;
} // toLowerCase is used to avoid having to compare it to A and Z
return true;
}
Second method helps us divide original String (for example a sentence of potentional words) based on " " character. When that is done, we go through every element there and check if it is a word. If it's not a word it returns false and skips the rest. If everything is fine, returns true.
public static boolean isThisAWord(String string) {
String[] array = string.split(" ");
for(int i = 0; i < array.length; i++) {
if(consistsOfLetters(array[i]) == false) return false;
}
return true;
}
Also, this might not work for English since English has apostrophes in words like "don't" so a bit of further tinkering is needed.

The Scanner in java splits string using his WHITESPACE_PATTERN by default, so splitting a string like "He's my friend" would result in an array like ["He's", "my", "friend"].
If that is sufficient, just remove that if clause and dont use that method.
If you want to make it to "He","is" instead of "He's", you need a different approach.
In short: The method works like verification check -> if the given token is not supposed to be in the result, then return false, true otherwise.

return token.matches("[\\pL\\pM]+('(s|nt))?");
matches requires the entire string to match.
This takes letters \pL and zero-length combining diacritical marks \pM (accents).
And possibly for English apostrophe, should you consider doesn't and let's one term (for instance for translation purposes).
You might also consider hyphens.
There are several single quotes and dashes.
Path path = Paths.get("..../x.txt");
Charset charset = Charset.defaultCharset();
String content = Files.readString(path, charset)
Pattern wordPattern = Pattern.compile("[\\pL\\pM]+");
Matcher m = wordPattern.matcher(content);
while (m.find()) {
String word = m.group(); ...
}

Related

How to capitalize first letter and lowercase the rest while keeping the word capital if it is in fully uppercase - java

getSentenceCaseText()
return a string representation of current text in sentence case. Sentence case is the
conventional way of using capital letters in a sentence or capitalizing only the first
word and any proper nouns. In addition, all capital word should remain as it is.
For this assignment, noun is limited to words that have ONE capital letter at the beginning.
**As an example the string "First SenTence. secOND sentence. tHIRD SENTENCE"
its output will be (First sentence. Second sentence. Third SENTENCE)**
This is my code for the above assignment. I could capitalize the first letter after every dot and set the rest as lowercase but i couldn't find out how to keep full uppercase word as it is.
This is my code below:
public String getSentenceCaseText(String text) {
int pos = 0;
boolean capitalize = true;
StringBuilder sb = new StringBuilder(text);
while (pos < sb.length()){
sb.setCharAt(pos, Character.toLowerCase(sb.charAt(pos)));
if (sb.charAt(pos) == '.') {
capitalize = true;
} else if (capitalize && !Character.isWhitespace(sb.charAt(pos))) {
sb.setCharAt(pos, Character.toUpperCase(sb.charAt(pos)));
capitalize = false;
}
pos++;
}
return sb.toString();
}

Most of the logic that you have posted works fine. The problem is words like "SENTENCE" because the logic that you are using to check the capitalization is incorrect.
The biggest problem is that you are trying to iterate over the words and check at the same time if that string is or not capitalize.
The easiest way is to separate concerns; try to check beforehand if the word is or not capitalized and act accordingly.
First create a method that just checks if a word is or not capitalized. For example:
public static boolean isUpper(String s, int start) {
for(int i = start; i < s.length(); i++) {
char c = s.charAt(i);
if(c == '.' || c == ' ')
return true;
if (!Character.isUpperCase(c))
return false;
}
return true;
}
This method receives a string (to be checked) and an int value (i.e., start) that tells the method from which part of the string should the checking start.
For the getSentenceCaseText method follow the following strategy. First check if the current word is capitalized:
boolean capitalize = isUpper(text, pos);
if is capitalized the method should skip this word and move to the next one. Otherwise, it capitalizes the first character and lowers the remaining. Apply the same logic to all words in the text.
The code could look like the following:
public static String getSentenceCaseText(String text) {
int pos = 0;
StringBuilder sb = new StringBuilder(text);
// Iterate over the string
while (pos < sb.length()){
// First check if the word is capitalized
boolean capitalize = isUpper(text, pos);
char c = sb.charAt(pos);
// Make the first letter capitalized for the cases that it was not
sb.setCharAt(pos, Character.toUpperCase(c));
pos++;
// Let us iterate over the word
// If it is not capitalized let us lower its characters
// otherwise just ignore and skip the characters
for (;pos < sb.length() && text.charAt(pos) != '.' && text.charAt(pos) != ' '; pos++)
if(!capitalize)
sb.setCharAt(pos, Character.toLowerCase(sb.charAt(pos)));
// Finally we need to skip all the spaces
for(; pos < sb.length() && text.charAt(pos) == ' '; pos++ );
}
return sb.toString();
}
Use this strategy and this code as guide, build upon it and implement your own method.

As the input string may contain multiple sentences, it should be split into sentences, then each sentence should be split to words, the word in all caps remains the same, the first word in sentence is capitalized, and the rest words are turned lowercase.
It can be done using a regular expression to split a string along with keeping delimiters.
static String capitalizeSentence(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays
.stream(input.split("((?<=[.!\\?]\\s?)|(?=[.!\\?]\\s?))"))
.flatMap(sent -> {
String[] words = sent.split("((?<=[^\\w])|(?=[^\\w]))");
return
Stream.concat(
Stream.of(words[0].matches("[A-Z]+") // process first word
? words[0]
: (Character.toUpperCase(words[0].charAt(0)) +
(words[0].length() > 1 ? words[0].substring(1).toLowerCase() : ""))
),
// process the rest of words
Arrays.stream(words)
.skip(1)
.map(word -> word.matches("[A-Z]+") ? word : word.toLowerCase())
);
})
.collect(Collectors.joining());
}
Test:
System.out.println(capitalizeSentence("o! HI!first SenTence. secOND sentence. tHIRD: SENTENCE. am I OK?! yes, I am fine!!"));
Output:
O! HI!First sentence. Second sentence. Third: SENTENCE. Am I OK?! Yes, I am fine!!

It is good practice to split a problem in smaller ones. So I recommend to handle sentences by splitting the text at the "." and iterate over the sentences.
Then you handle the words of every sentence by splitting sentence at the " ".
Then you check for every word, if it is completely in capital letters. if so, leave unchanged.
If not, you check, if it is a noun, wich here means, it has a capital first letter and no capital letters else. If so, leave it unchanged, otherwise convert it to lower case completely.
You then (as the last step) capitalize the first letter of the first word of every sentence.
This way you do not need any global flags whatsoever. And you can easily test your algorithm - for words and for special cases like beginning of a sentence. If you need to add other characters as "." for splitting the text into sentences - easy.
If you want special treatment for other cases of words - easy..

You should have a boolean - isCapitalize.
At first, it's true.
During iterating your text, over words, you should create the new manipulated text with the same word.
If isCapitalize flag is true, write the word in a first-capital-letter manner. Else, write it in small-capital-letters. If the whole word has capital letters (that's the reason we iterate over words) - right the whole word in capital letters.
If you have ".", the flag is on for 1 word only.
Now take the above text and formulate it into code.
Let us know if you need some help.

Set the first word as uppercase and the rest lowercase
String output = someString.substring(0, 1).toUpperCase() + someString.substring(1).toLowerCase(Locale.ROOT);

Leetcode Valid Palindrome Question Problem Debugging [duplicate]

This question already has answers here:
String replace method is not replacing characters
(5 answers)
Closed 2 years ago.
I'm struggling to understand what's wrong with my code for this Leetcode problem.
Problem: Given a string, determine if it is a palindrome, considering only alphanumeric characters and ignoring cases.
Right now, I am passing 108/476 cases, and I am failing this test: "A man, a plan, a canal: Panama".
Here is my code, please help me identify the problem!
class Solution {
public boolean isPalindrome(String s) {
if (s.isEmpty()) return true;
s.replaceAll("\\s+","");
int i = 0;
int j = s.length() - 1;
while (i <= j) {
if (Character.toLowerCase(s.charAt(i)) != Character.toLowerCase(s.charAt(j))) {
return false;
}
i++;
j--;
}
return true;
}
}

Your replaceAll method is incorrect
Your replaceAll method currently only removes spaces. It should remove all the special characters and keep only letters. If we use the regex way like you do, this is (one of) the best regex to use:
s = s.replaceAll("[^a-zA-Z]+","");
You could be tempted to use the \W (or [^\w]) instead, but this latest regex matches [a-zA-Z0-9_], including digits and the underscore character. Is this what you want? then go and use \W instead. If not, stick to [^a-zA-Z].
If you want to match all the letters, no matter the language, use the following:
s = s.replace("\\P{L}", "");
Note that you could shorten drastically your code like this, although it's definitely not the fastest:
class Solution {
public boolean isPalindrome(String s) {
s = s.replaceAll("\\P{L}", "");
return new StringBuilder(s).reverse().toString().equalsIgnoreCase(s);
}
}

Your regex is invalid. Try this:
s = s.replaceAll("[\\W]+", "");
\W is used for anything that is not alphanumeric.

By s.replaceAll("\\s+",""); you are only removing the spaces but you also have to remove anything except alphanumeric characters such as punctuation, in this case ,.

How to make multiple inputs of a single character register as one character?

I'm unsure of the code for this, but if one were to input "oooooooooo" after a prompt (like in an if-statement or something where the program registers "o" as "one" or something), how could you make "oooooooooo" translate into "o"?
Would one have to write down manually various iterations of "o" (like, "oo" and "ooo" and "oooo"...etc.). Would it be similar to something like the ignore case method where O and o become the same? So "ooo..." and "o" end up as the same string.

Although probably overkill for this one use-case, it would be helpful to learn how to use regexes in the future. Java provides a regex library to use called Pattern. For example, the regex /o+ne/ would match any string "o...ne" with at least one "o".

using regex:
public static String getSingleCharacter(String input){
if(input == null || input.length() == 0) return null;
if(input.length() == 1) return input;
if(!input.toLowerCase().matches("^\\w*?(\\w)(?!\\1|$)\\w*$")){
return Character.toString(input.toLowerCase().charAt(0));
}
return null;
}
if the method returns null then the characters are not all the same, else it will return that single char represented as a string.

Use the regular expression /(.)\1+/ and String#replaceAll() to match runs of two or more of the same character and then replace the match with the value of the first match group identified with $1 as follows:
public static String squeeze(String input) {
return input.replaceAll("(.)\\1+", "$1");
}
String result = squeeze("aaaaa bbbbbbb cc d");
assert(result.equals("a b c d"));

public string condense(String input) {
if(input.length >= 3) {
for(int i=0; i< input.length-2; i++){
if(input.substring(i,i+1) != input.substring(i+1,i+2)){
return input;
}
}
}
return input.substring(0,1);
}
This checks if the string is 3 characters or longer, and if so it loops through the entire string. If every character in the string is the same, then it returns a condensed version of the string.

How to check a string which only contains one word. If a string has a sentence it should return false

I know it's a wierd to ask a question like this. But i've got no options. The problem is
I've come across a requirement where i happens to add a condition where, If there is an input as a string, I should be able to allow all the strings which only contains one word. So if there are many words I should reject.
How to add such check when I don't have specificity on such string.

If the words are separated by some kind of white space, you could use a simple regular expression for this:
Pattern wordPattern = Pattern.compile("\\w+");
Matcher wordMatcher = wordPattern.matcher(inputString);
if (!wordMatcher.matches()) {
// discard user input
}
This will match all word characters ([a-zA-Z_0-9]). If your definition of "word" is different, the regex will need to be adapted.

So many ways you can achieve it,
One of the simplest is..
String str = "abc def";
String [] array = str.trim().split(" ");
if(array.lenght==1){
// allow if lenght = 1, or a word....
}else{
// don't allow if lenght !=1 , or not a word..., dosomething else, or skip
}

You can split the string on a regular expression that represents a sequence of white spaces and then see how many parts you get. Here's a function to do it:
public static boolean is_word(String s) {
return (s.length() > 0 && s.split("\\s+").length == 1);
}
System.out.println(is_word("word"));
System.out.println(is_word("two words"));
System.out.println(is_word("word\tabc\txyz"));
System.out.println(is_word(""));
Output:
true
false
false
false
The length check on the input string is required if you want to say that an empty string is not a word, which would seem reasonable.

Counting comma and any text in java String

I'm trying to write a function to count specific Strings.
The Strings to count look like the following:
first any character except comma at least once -
the comma -
any chracter but at least once
example string:
test, test, test,
should count to 3
I've tried do that by doing the following:
int countSubstrings = 0;
final Pattern pattern = Pattern.compile("[^,]*,.+");
final Matcher matcher = pattern.matcher(commaString);
while (matcher.find()) {
countSubstrings++;
}
Though my solution doesn't work. It always ends up counting to one and no further.

Try this pattern instead: [^,]+
As you can see in the API, find() will give you the next subsequence that matches the pattern. So this will find your sequences of "non-commas" one after the other.

Your regex, especially the .+ part will match any char sequence of at least length 1. You want the match to be reluctant/lazy so add a ?: [^,]*,.+?
Note that .+? will still match a comma that directly follows a comma so you might want to replace .+? with [^,]+ instead (since commas can't match with this lazyness is not needed).
Besides that an easier solution might be to split the string and get the length of the array (or loop and check the elements if you don't want to allow for empty strings):
countSubstrings = commaString.split(",").length;
Edit:
Since you added an example that clarifies your expectations, you need to adjust your regex. You seem to want to count the number of strings followed by a comma so your regex can be simplified to [^,]+,. This matches any char sequence consisting of non-comma chars which is followed by a comma.
Note that this wouldn't match multiple commas or text at the end of the input, e.g. test,,test would result in a count of 1. If you have that requirement you need to adjust your regex.

So, quite good answers are already given. Very readable. Something like this should work, beware, it's not clean and probably not the fastest way to do this. But is is quite readable. :)
public int countComma(String lots_of_words) {
int count = 0;
for (int x = 0; x < lots_of_words.length(); x++) {
if (lots_of_words.charAt(x) == ',') {
count++;
}
}
return count;
}
Or even better:
public int countChar(String lots_of_words, char the_chosen_char) {
int count = 0;
for (int x = 0; x < lots_of_words.length(); x++) {
if (lots_of_words.charAt(x) == the_chosen_char) {
count++;
}
}
return count;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Storing words from a .txt file into a String array - java

Related

How to capitalize first letter and lowercase the rest while keeping the word capital if it is in fully uppercase - java

Leetcode Valid Palindrome Question Problem Debugging [duplicate]

How to make multiple inputs of a single character register as one character?

How to check a string which only contains one word. If a string has a sentence it should return false

Counting comma and any text in java String

Categories

Resources