This question already has answers here:
split a string in java into equal length substrings while maintaining word boundaries
(2 answers)
Closed 6 years ago.
In Java or Groovy is there a library or a simple implementation
that for a text it would create substring at some length but not breaking a word in the middle?
An example method with an input: substring("My very long text", 9 /*substring length*/, true /*break on whole words only*/)
An output because without keeping the words it would result in My very l. Since I want to break on the whole words only it will be My very.
In case there are no spaces it would cut the string at the index:
substring("MyVeryLongText", 9 /*substring length*/, true /*break on whole words only*/) --> MyVeryLon
I believe I can say that there isn’t built into Java. We wrote our own method for a similar task. There could easily be some free library out there, but I don’t think I’d introduce a new dependency for this relatively simple problem. You will want to decide what you want to happen if there is no space at which to break (substring("Beginning with a long word", 4, true)). Maybe the library you find doesn’t do what you want in this case. If writing your own, you need to take the cases into account where the original string is too short (substring("Cat", 4, true)) and where the space comes right after the 4th char (substring("Long text", 4, true)).
Related
This question already has answers here:
Spelling correction for data normalization in Java
(5 answers)
Closed 5 years ago.
Is there a library in Java - that can do a spell check.
I have a ArrayList of Categories - which is a list of Words {Fox,Lion,Wolf,Snake}. This list can be very big.
The program will ask the user to input the "Animal": If the user makes a spelling mistake e.g. inserts "Fix" or "Loin"..
Is there a way to compare the input to the elements of the List and find the closest in similarity and use the corresponding element instead of the misspelled input for the rest of the program.
You are probably looking for the Levenshtein Distance between two strings. The distance grows the more dissimilar the strings are.
Apache commons has an implementation: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/similarity/LevenshteinDistance.html
If it is just a list of animals, then you could write your own code to check if the word is part of the list, it's length, amount of matching characters etc..
This question already has answers here:
Using regular expressions to validate a numeric range
(11 answers)
Closed 7 years ago.
I'm trying to verify that a user inputs coordinates in the format: x,y
so I'm using regular expressions. x and y can be between 1 and 15, so I used
[1-15]\\d{1,2},[1-15]\\d{1,2}
but that didn't work because 15 isn't a digit obviously. I changed it to
\\d{1,2},\\d{1,2}
so at least I can confirm that its two one or two digit numbers, but it could be up to 99 on either side; not good. I've tried a few other ways like
\\d{1}|[1]\\d[0-5]\\d...
but nothing works and honestly I've been looking at all this so long it doesn't make sense anymore.
More importantly, is it even good practice to use java's regular expression feature? I could think of other ways to do this, but this is just for a personal project I'm working on and trying out different approaches to things I usually do messily.
I think you understand the [...] the wrong way: it means you specify a range of characters. So [1-15] means: 1to 1and 5. It is thus equivalent to 1|5.
You can however specify the digits 1 to 15 with [1-9]|1[0-5]. Plugging this into your regex results in:
([1-9]|1[0-5])\\d{1,2},([1-9]|1[0-5])\\d{1,2}
I know that we can do this by following ways
StringBuilder
Use substring
But i am looking a way where i have a compressed String say a5b4c2 etc which means a is 5 times b is 4 times etc so String is actually aaaaabbbbcc something like that.
So char at index 2 should return a and char at index 6 should return b.
What can be the best approach for this?
My question is more about what is the best approach to decompress String ?
My question is more about handling this compressed string rather than finding the character at specific index.
Decompress the string until you get the index you want to know. Or you could decompress the whole string and cache it.
What can be the best approach for this?
Without any more specific requirements, I believe the best approach is the simplest approach you can think of.
I would, parse each pair of the letter and the number in turn, reduce the index by that number and if the remaining index is < 0 you have the letter you want.
Check what index you are searching for, and start adding up the numbers of characters. Every time you add, check if the index falls within the previous interval and the current one. If it does, you've found what your character is, otherwise add again.
For example, the workflow given your string a5b4c2, if you want the character at index 7, could be like this:
current position: 0
index we are looking for: 7
add first character's count: 0+5 = 5
does 7 fall within 0 and 5? no, add again
current position: 5
add second character's count: 5+4 = 9
does 7 fall within 5 and 9? yes, so our character must be 'b'.
I'm not sure if this is more efficient or faster than decompressing the string and just using charAt() or something, it's just a different way of approaching it.
EDIT: Since the question is more about how to decompress the string, you could use a StringBuilder and use a for loop to append the correct number of the character to your string... sounds like the simplest way to me.
I need to implement a spell checker in java , let me give you an example for a string lets say "sch aproblm iseasili solved" my output is "such a problem is easily solved".The maximum length of the string to correct is 64.As you can see my string can have spaces inserted in the wrong places or not at all and even misspelled words.I need a little help in finding a efficient algorithm of coming up with the corrected string. I am currently trying to delete all spaces in my string and inserting spaces in every possible position , so lets say for the word (it apply to a sentence as well) "hot" i generate the next possible strings to afterwords be corrected word by word using levenshtein distance : h o t ; h ot; ho t; hot. As you can see i have generated 2^(string.length() -1) possible strings. So for a string with a length of 64 it will generate 2^63 possible strings, which is damn high, and afterwords i need to process them one by one and select the best one by a different set of parameters such as : - total editing distance (must take the smallest one)
-if i have more strings with same editing distance i have to choose the one with the fewer number of words
-if i have more strings with the same number of words i need to choose the one with the total maximum frequency the words have( i have a dictionary of the most frequent 8000 words along with their frequency )
-and finally if there are more strings with the same total frequency i have to take the smallest lexicographic one.
So basically i generate all possible strings (inserting spaces in all possible positions into the original string) and then one by one i calculate their total editing distance, nr of words ,etc. and then choose the best one, and output the corrected string. I want to know if there is a easier(in terms of efficiency) way of doing this , like not having to generate all possible combinations of strings etc.
EDIT:So i thought that i should take another approach on this one.Here is what i have in mind: I take the first letter from my string , and extract from the dictionary all the words that begin with that letter.After that i process all of them and extract from my string all possible first words. I will remain at my previous example , for the word "hot" by generating all possible combinations i got 4 results , but with my new algorithm i obtain only 2 "hot" , and "ho" , so it's already an improvement.Though i need a little bit of help in creating a recursive or PD algorithm for doing this . I need a way to store all possible strings for the first word , then for all of those all possible strings for the second word and so on and finally to concatenate all possibilities and add them into an array or something. There will still be a lot of combinations for large strings but not as many as having to do ALL of them. Can someone help me with a pseudocode or something , as this is not my strong suit.
EDIT2: here is the code where i generate all the possible first word from my string http://pastebin.com/d5AtZcth .I need to somehow implement this to do the same for the rest and combine for each first word with each second word and so on , and store all these concatenated into an array or something.
A few tips for you:
try correcting just small parts of the string, not everything at once.
90% of erros (IIRC) have 1 edit distance from the source.
you can use a phonetic index to match words against words that sound alike.
you can assume most typos are QWERTY errors (j=>k, h=>g), and try to check them first.
A few more ideas can be found in this nice article:
http://norvig.com/spell-correct.html
The following list contains 1 correct word called "disastrous" and other incorrect words which sound like the correct word?
A. disastrus
B. disasstrous
C. desastrous
D. desastrus
E. disastrous
F. disasstrous
Is it possible to automate generation of wrong choices given a correct word, through some kind of java dictionary API?
No, there is nothing related in java API. You can make a simple algorithm which will do the job.
Just make up some rules about letters permutations and doubling and add generated words to the Set until you get enough words.
There are a number of algorithms for matching words by sound - 'soundex' is the one that springs to mind, but I remember uncovering a few when I did some research on this a couple of years ago. I expect the problem you would find is that they take a word and return a value that represents how the word sounds so you can see if two spellings sound similar (so the words in the question should generate similar values); but I expect doing the reverse, i.e. taking the value and generating similar sounding spellings, would be quite hard.