So I am writing a scrabble word suggestion program that I decided to do because I wanted to learn sets (don't worry, I at least got that part) and referencing info/data not created within the program. Im pretty new to Java (and programming in general), but I was wondering how to pull words from a word list .FIC file in order to check them against words generated from the letters inputted.
To clarify, I have written a program which takes a series of letters and returns a set of every possible word created from those letters. for example:
input:
abc
would give a set containing the "words":
a, ab, ac, abc, acb, b, ba, bc, bac, bca, c, ca, cb, cab, cba
What I am asking, really, is how to check those to find the ones contained in the .FIC file.
The file is the "official crosswords" file from the Moby project word list and I am still (very) shaky on parsing and other file dealing-with methods. I am continuing to research so I dont have any prototype code for that.
Sorry if the question isn't entirely clear.
edit: here is the method that makes the "words" to make it easier to understand the idea. The part I don't understand is specifically how to pull a word(as a string) from the .FIC file.
private static Set<String> Words(String s)
{
Set<String> tempwords = new TreeSet<String>();
if (s.length() == 1)
{ // base case, last letter
tempwords.add(s);
// System.out.println(s); uncomment when debugging
}
else
{
//set up to add each letter in s
for (int i = 0; i < s.length(); i++)
{ //cut the i letter out of the string
String remaining = s.substring(0, i) + s.substring(i+1);
//recursion to add all combinations of letters onto the current letter/"word"
for (String permutation : Words(remaining))
{
// System.out.println(s.substring(i, i+1) + permutation); uncomment when debugging
//add the full length words
tempwords.add(s.substring(i, i+1) + permutation);
// System.out.println(permutation); uncomment when debugging
//add the not-full-length words
tempwords.add(permutation);
}
}
}
// System.out.println(tempwords); uncomment when debugging
return tempwords;
}
I dont know if it is the best solution, but i figured it out (hobbs the line thing helped a lot, thank you). I found that this works:
public static void main(String[] args) throws FileNotFoundException
{
Scanner s = new Scanner(new FileReader("C:/Users/Sean/workspace/Imbored/bin/113809of.fic"));
while(true)
{
words.clear();
String letters = enterLetters();
words.addAll(Words(letters));
while(s.hasNextLine()) {
String line = s.nextLine();
String finalword = checkWords(line, words);
if (finalword != null) finalwordset.add(finalword);
}
s.reset();
System.out.println(finalwordset);
System.out.println();
System.out.println("_________________________________________________________________________");
}
}
A few things:
The checkWords method checks if the current word from the file is in the generated list of "words"
The enterletters method takes user inputted letters and returns them in a string
The Words method returns a set of strings of all of the possible combinations of the characters in the given string, with each character used up to as many times as it appears in the string and no repeated "words" in the returned set.
finalwordset and words are arraylists of strings defined as instance variables(i would put them in the main method but I'm lazy and it doesn't matter for this case)
I am very sure there is a better/more efficient way to do this, but this at least works.
Finally: I decided to answer rather than delete because I didn't see this answered anywhere else, so if it is feel free to delete the question or link to the other answer or whatever, at this point it is to help other people.
Related
I have a dictionary with many words. And i hope search the longest concatenated word (that is, the longest word that is comprised entirely of
shorter words in the file). I give the method a descending word from their length. How can I check that all the symbols have been used from the dictionary?
public boolean tryMatch(String s, List dictionary) {
String nextWord = new String();
int contaned = 0;
//Цикл перебирающий каждое слово словаря
for(int i = 1; i < dictionary.size();i++) {
nextWord = (String) dictionary.get(i);
if (nextWord == s) {
nextWord = (String) dictionary.get(i + 1);
}
if (s.contains(nextWord)) {
contaned++;
}
}
if(contaned >1) {
return true;
}
return false;
}
If you have a sorted list of words, finding compound words is easy, but it will only perform well if the words are in a Set.
Let's look at the compound word football, and of course assume that both ball and foot are in the work list.
By definition, any compound word using foot as the first sub-word must start with foot.
So, when iterating the list, remember the current active "stem" words, e.g. when seeing foot, remember it.
Now, when seeing football, you check if the word starts with the stem word. If not, clear the stem word, and make new word the stem word.
If it does, the new word (football) is a candidate for being a compound word. The part after the stem is ball, so we need to check if that is a word, and if so, we found a compound word.
Checking is easy for simple case, i.e. wordSet.contains(remain).
However, compound words can be made up of more than 2 words, e.g. whatsoever. So after finding that it is a candidate from the stem word what, the remain is soever.
You can simply try all lengths of that (soever, soeve, soev, soe, so, s), and if one of the shorter ones are words, you repeat the process.
All that I am doing in my project is taking two values(that I am reading from two different excel files) and checking how similar they are.! I tried using the pattern and matcher classes which works perfectly fine when both the words are exactly the same (as in organisation and organisation/s). In my data I have say something like (employee and employment), I just need "employ" as the common string between the two, in which case..pattern and matches fails.! I am stuck with this since a week.I have about 700 rows in the first excel file and about 9000 in the other. Each cell value that I am reading into the program using java, I am storing them in two separate variables. Next, i tried using 4 for loops to match word by word and character by character to find only those characters that match between the two.I have pasted the coded for the for loop implementation. Four for loops are like driving me nuts.! Any help in completing this would be greatly appreciated.
String str1 = "Cover for employees of the company";
String str2 = "Employment Agencies ";
String str,strfinal;
String[] count1 = str1.split("\\s+");
String[] count2 = str2.split("\\s+");
char[] count11 = str1.toCharArray();
char[] count22 = str2.toCharArray();
for(int i=0;i<count1.length;i++)
{
for(int j=0;j<count2.length;j++)
{
for(int m=0;m<count1[i].length();m++)
{
for(int n=0;n<count2[j].length();n++)
{
if(count11[m]==count22[n])
{
// please look at the logic that I am looking for to implement
}
}
}
}
}
Expected output: employ
one more concept that I am trying to implement (in order to make my program more efficient) is..
cover ----(compared with) employment. First character itself does not match.Implies go to the next word in the second string. Once all words in the second string are traversed and checked for, go to the next word in the first string and compare this word with all the words in the second string.
Okay.. so this is what I am looking for right now.. Any help will be greatly appreciated.
Thanks!
I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)
So, here's just example string line among thousands of string lines:
(split() doesn't work.!!! )
test.csv
"31451 CID005319044 15939353 C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O "
"12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O "
"9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O "
I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position.
You can see there are big spaces between terms, so that's why I said 5th position.
In this case, how can I extract terms located in 5th position for each line?
One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that.
Because the length of whitespace is random, I can not use the .split() function.
For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**
Here is a solution for your problem using the string split and index of,
import java.util.ArrayList;
public class StringSplit {
public static void main(String[] args) {
String[] seperatedStr = null;
int fourthStrIndex = 0;
String modifiedStr = null, finalStr = null;
ArrayList<String> strList = new ArrayList<String>();
strList.add("31451 CID005319044 15939353 C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O ");
strList.add("12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O ");
strList.add("9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O ");
for (String item: strList) {
seperatedStr = item.split("\\s+");
fourthStrIndex = item.indexOf(seperatedStr[3]) + seperatedStr[3].length();
modifiedStr = item.substring(fourthStrIndex, item.length());
finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
System.out.println(finalStr.trim());
}
}
}
Output:
beta-lipoic acid
saponin
Berberine
Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:
String s[] = str.split("\\s\\s+");
for (String string : s) {
System.out.println(string);
}
Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)
public static List<String> getData(String str) {
List<String> list = new ArrayList<>();
String s="";
int count=0;
for(char c : str.toCharArray()){
System.out.println(c);
if (c==' '){
count++;
}else {
s = s+c;
}
if(count>1&&!s.equalsIgnoreCase("")){
list.add(s);
count=0;
s="";
}
}
return list;
}
This would be a relatively easy fix if it weren't for beta-lipoic acid...
Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.
Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array
While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...
Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like
Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
// return line[4].append(line[5]) or something like that
}
Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes
line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");
Then hopefully the only thing that is left would be the term you're looking for.
Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.
Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.
Given a string, I need to print out all permutations of the string. How should I do that? I have tried
for(int i = 0; i<word.length();i++)
{
for(int j='a';j<='z';j++){
word = word.charAt(i)+""+(char)j;
System.out.println(word);
}
}
Is there a good way about doing this?
I'm not 100% sure that I understand what you are trying to do. I'm going to go by your original wording of the question and your comment to #ErstwhileIII's answer, which make me think that it's not really "permutations" (i.e. rearrangement of the letters in the word) that you are looking for, but rather possible single-letter modifications (not sure what a better word for this would be either), like this:
Take a word like "hello" and print a list of all "versions" you can get by adding one "typo" to it:
hello -> aello, bello, cello, ..., zello, hallo, hbllo, hcllo, ..., hzllo, healo, heblo, ...
If that's indeed what you're looking for, the following code will do that for you pretty efficiently:
public void process(String word) {
// Convert word to array of letters
char[] letters = word.toCharArray();
// Run through all positions in the word
for (int pos=0; pos<letters.length; pos++) {
// Run through all letters for the current position
for (char letter='a'; letter<='z'; letter++) {
// Replace the letter
letters[pos] = letter;
// Re-create a string and print it out
System.out.println(new String(letters));
}
// Set the current letter back to what it was
letters[pos] = word.charAt(pos);
}
}
OH .. to print out all permutations of a string, consider your algorithm first. What is the definition of "all permutations" .. for example:
String "a" would have answer a only
String "ab" would have answer: ab, ba
String "abc" would have answer: abc acb, bca, bac, cba, cab
Reflect on the algorithm you would use (write it down in english) .. then translate to Java code
While not the most efficient, a recursive solution might be easiest to use (i.e. for a string of length n, go through each of the characters and follow that with the permutations of the string with that character removed).
EDIT: Ok... you changed your request. Permutations is a whole other story. I think this will help: Generating all permutations of a given string
Not sure what you are trying to do... Example 1 is to get the alphabet one letter next to another. Example 2 is to print whatever you gave us there as an example.
//Example 1
String word=""; //empty string
for(int i = 65; i<=90;i++){ //65-90 are the Ascii numbers for capital letters
word+=(char)i; //cast int to char
}
System.out.println(word);
//Example 2
String word="";
for (int i=65;i<=90;i++){
word+=(char)i+"rse";
if(i!=90){ //you don't want this at the end of your sentence i suppose :)
word+=", ";
}
}
System.out.println(word);
I'm using the BreakIterator class in Java to break paragraph into sentences. This is my code :
public Map<String, Double> breakSentence(String document) {
sentences = new HashMap<String, Double>();
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(document);
Double tfIdf = 0.0;
int start = bi.first();
for(int end = bi.next(); end != BreakIterator.DONE; start = end, end = bi.next()) {
String sentence = document.substring(start, end);
sentences.put(sentence, tfIdf);
}
return sentences;
}
The problem is when the paragraph contain titles or numbers, for example :
"Prof. Roberts trying to solve a problem by writing a 1.200 lines of code."
What my code will produce is :
sentences :
Prof
Roberts trying to solve a problem by writing a 1
200 lines of code
Instead of 1 single sentence because of the period in titles and numbers.
Is there a way to fix this to handle titles and numbers with Java?
Well this is a bit of a tricky situation, and I've come up with a sticky solution, but it works nevertheless. I'm new to Java myself so if a seasoned veteran wants to edit this or comment on it and make it more professional by all means, please make me look better.
I basically added some control measures to what you already have to check and see if words exist like Dr. Prof. Mr. Mrs. etc. and if those words exist, it just skips over that break and moves to the next break (keeping the original start position) looking for the NEXT end (preferably one that doesn't end after another Dr. or Mr. etc.)
I'm including my complete program so you can see it all:
import java.text.BreakIterator;
import java.util.*;
public class TestCode {
private static final String[] ABBREVIATIONS = {
"Dr." , "Prof." , "Mr." , "Mrs." , "Ms." , "Jr." , "Ph.D."
};
public static void main(String[] args) throws Exception {
String text = "Prof. Roberts and Dr. Andrews trying to solve a " +
"problem by writing a 1.200 lines of code. This will " +
"work if Mr. Java writes solid code.";
for (String s : breakSentence(text)) {
System.out.println(s);
}
}
public static List<String> breakSentence(String document) {
List<String> sentenceList = new ArrayList<String>();
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(document);
int start = bi.first();
int end = bi.next();
int tempStart = start;
while (end != BreakIterator.DONE) {
String sentence = document.substring(start, end);
if (! hasAbbreviation(sentence)) {
sentence = document.substring(tempStart, end);
tempStart = end;
sentenceList.add(sentence);
}
start = end;
end = bi.next();
}
return sentenceList;
}
private static boolean hasAbbreviation(String sentence) {
if (sentence == null || sentence.isEmpty()) {
return false;
}
for (String w : ABBREVIATIONS) {
if (sentence.contains(w)) {
return true;
}
}
return false;
}
}
What this does, is basically set up two starting points. The original starting point (the one you used) is still doing the same thing, but temp start doesn't move unless the string looks ready to be made into a sentence. It take the first sentence:
"Prof."
and checks to see if that broke because of a weird word (ie does it have Prof. Dr. or w/e in the sentence that might have caused that break) if it does, then tempStart doesn't move, it stays there and waits for the next chunk to come back. In my slightly more elaborate sentence the next chunk also has a weird word messing up the breaks:
"Roberts and Dr."
It takes that chunk and because it has a Dr. in it it continues on to the third chunk of sentence:
"Andrews trying to solve a problem by writing a 1.200 lines of code."
Once it reaches the third chunk that was broken and without any wierd titles that may have caused a false break, it then starts from temp start (which is still at the beginning) to the current end, basically joining all three parts together.
Now it sets the temp start to the current 'end' and continues.
Like I said this may not be a glamorous way to get what you want, but nobody else volunteered and it works shrug
It appears that Prof. Roberts only gets split if Roberts begins with a capital letter.
If Roberts begins with a lowercase r, it does not get split.
So... I guess that's how BreakIterator deals with periods.
I'm sure further reading of the documentation will explain how this behavior can be modified.