Having issues with apostrophes in strings (Scala) - java

I'm running into some weird issues in Scala right now. I'm writing a spell checker and the dictionary is in a .txt file that is being read in and stored in a map. In my dictionary is the word "Boston's". I did a check to see if "Boston's" was in the map by using the contains method and it's there. However, the real issue arises when I do the spell check on a document.
"Boston's" is being read in from the document I'm spell checking and stored in a ListBuffer, but when I check if my "dictionary" map contains it, it says it doesn't. So I did a println on both instances of "Boston's" (in my "dictionary" map and in my "wordToBeChecked" list) and I noticed something odd:
Both are there, but they look different. The one in my wordToBeChecked list looks as if it contains a single quote rather than an apostrophe. I've been trying to fix this for hours, but now I'm officially stumped.

Related

Java processing lines in file and data structures

I have read a bit about multidimensional arrays would it make sense to solve this problem using such data structures in Java, or how should I proceed?
Problem
I have a text file containing records which contain multiple lines. One record is anything between <SUBBEGIN and <SUBEND.
The lines in the record follow no predefined order and may be absent from a record. In the input file (see below) I am only interested in lines MSISDN, CB,CF and ODBIC fields.
For each of these fields I would like to apply regular expressions to extract the value to the right of the equals.
Output file would be a comma separated file containing these values, example
MSISDN=431234567893 the value 431234567893 is written to the output file
error checking
NoMSISDNnofound when no MSISDN is found in a record
noCFUALLPROVNONE when no CFU-ALL-PROV-NONE is found in a recored
Search and replace operations
CFU-ALL-PROV-NONE should be replaced by CFU-ALL-PROV-1/1/1
CFU-TS10-ACT-914369223311 should be replaced by CFU-TS10-ACT-1/1/0/4369223311
Output for first record
431234567893,BAOC-ALL-PROV,BOIC-ALL-PROV,BOICEXHC-ALL-PROV,BICROAM-ALL-PROV,CFU-ALL-PROV-1/1/1,CFB-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFU-TS10-ACT-1/1/1/4369223311,BAIC,BAOC
Input file
<BEGINFILE>
<SUBBEGIN
IMSI=11111111111111;
MSISDN=431234567893;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
IMEISV=4565676567576576;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-
YES-YES-NO;
ODBIC=BAIC;
ODBOC=BAOC;
ODBROAM=ODBOHC;
ODBPRC=ENTER;
ODBPRC=INFO;
ODBPLMN=NONE;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=YES;
ODBADULTSMS=YES;
<SUBEND
<SUBBEGIN
IMSI=11111111111133;
MSISDN=431234567899;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO+-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-NO-NONE-YES-65535-YES-YES-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFD-TS10-REG-91430000000-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-YES-YES-NO;
ODBIC=BICCROSSDOMESTIC;
ODBOC=BAOC;
ODBROAM=ODBOH;
ODBPRC=INFO;
ODBPLMN=PLMN1
ODBPLMN=PLMN3;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=NO;
ODBADULTSMS=YES;
<SUBEND
From what I understand, you are simply reading a text file and processing it and maybe replacing some words. You do not therefore need a data structure to store the words in. Instead you can simply read the file line by line and pass it through a bunch of if statements (maybe a couple booleans to check if the specific parameters you are searching for have been found?) and then rewrite the line you want to a new file.
Dealing with big files to implement data in machine learning algorithms, I did it by passing all of the file contents in a variable, and then using the String.split("delimeter") method (Supported from Java 8 and later), I broke the contents in a one-dimensional array, where each cell had the info before the delimeter.
Firstly read the file via a scanner or your way of doing it (let content be the variable with your info), and then break it with
content.split("<SUBEND");

how to find whether a substring in file is already present in hashmap?

I have a hashMap(guava bimap) in which keys and values both are strings, I wanted to write a program which parses the given file and replaces all the strings in the file which are also in BiMap with corresponding values from Bimap.
for example: i have a file called test.txt has following text
Java is a set of several computer software and specifications developed by Sun Microsystems.
and my BiMap has
"java i" => "value1"
"everal computer" => "value2" etc..
So now i want my program to take test.txt and Bimap as input and give an output which looks something like this
value1s a set of svalue2 software and specifications developed by Sun Microsystems.
please point me towards any algorithm which can do this, the program takes large files as input so brute force may not be a good idea.
Edit: I'm using fixed length strings for keys and values.
That example was just intended to show the operation.
Thanks.
For a batch operation like this, I would avoid putting a lot of data into the memory. Therefore I'd recommend you to write the new content into a new file. If the file in the end must be the exact same file, you can still replace one file by the other, at the end of the process. read, write and flush each new line separately, and you won't have any memory issues.

What type of Trie is this?

I want to add words an opensource Java word splitting program for Khmer (a language that does not have spaces between words). The developers have not worked on it in a long time, and I haven't been able to contact them for details (http://sourceforge.net/projects/khmer/files/Khmer%20Word%20Breaking/Khmer%20Word%20Breaking%20program%20V1.0/). Supposedly the list was created from a Khmer dictionary, and I would like to re-create the file to include more words.
Can anyone identify what format the word dictionary is in (I believe it is some type of Trie)? Here are the first few lines:
0ឳមអគណជយឍឫហកដពទឱលថឦឡញឩខនឧផប។ឋវឭឈឃឥឌឰឪសងចភធឯតឆរ
1ទ
0ក
1
1ីែមគួណជយ៍ៀហកទុលេញ៉ឺនំឹៃូឈឃោាឿសងចិ្ធើតៅរ
1គនសងរ
0ទ
0ា
0យ
0ព
0ន
1
1រ
0ា
0ស
0ី
1
And does anyone know how I would go about making a new one (I have a large wordlist, but I am not sure how to get it into this format).
Thanks!
After a quick look through the code, I have a theory.
Create a SearchTree which extends TreeItem. For each word in your dictionary, call addWord from TreeItem. When the iteration is done, call export on SearchTree. Use new file as the word input file.
Additionally, there may be an undocumented parameter for khwrdbrk.jar, --create, that will read the words for the new tree from standard input.
Again, just a theory, but let me know what happens if you test it out.

is there a dictionary i can download for java?

is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary
Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.
Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.
OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.
If you are on a unix like OS look in /usr/share/dict.
Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html
Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary

Are there some better ways to implement find as you type in Java with a fairly small data set?

I've got about 2500 short phrases in a file. I want to be able to find phrases as I type possible substrings of them. My app has a text box and a list of phrases. The text box is initially empty and the list contains all 2500 phrases, since the empty string is a substring of all of them. As I type in the text box, the list updates so that it always only contains phrases which contain the text box's value as a substring.
At the moment I have one of Google's Multimaps, specifically:
LinkedHashMultimap<String, String>
with every single possible substring mapped to its possible matches. This takes a while to load (about a second) and I think it must be taking up quite a bit of space (which may be a concern in the future.) It's very fast with the lookups though.
Is there a way I could do this with some other data structure or strategy that would be quicker to load and take less space (possibly at the expense of the speed of the lookups)?
If your list only contains 2500 elements, a simple loop and checking contains() on all of them should be fast enough.
If it grows bigger and/or is too slow, you can apply some easy optimizations:
Don't search immediately as the user types each character, but introduce some delay. So if he types "foobar" really fast, you only search for "foobar", not first "f" then "fo" then "foo",...
Reuse your previous results: if the user first types "foo" and then extends that to "foobar", don't search in the whole original list again, but search inside the results for "foo" (because everything that contains "foobar" must contain "foo").
In my experience, these these basic optimizations already get you quite far.
Now, if the list grows so big that even that is too slow, some "smarter" optimizations as proposed in other answers here (tries, suffix trees,...) would be needed.
You'll want to look into using the Trie data structure.
Try simply looping over the entire list and calling contains() - doing that 2500 times is probably completely unnoticeable.
You definetely need a Suffix Tree.. (wiki)
(i think this implementation could be ok: link)
EDIT:
I've read your comment, you shouldn't blindly check if the string is a substring somewhere in you phrase, you usually start with a word, not with a space. So maybe it's better to tokenize words inside your phrase?
Are you allowed to do it? Otherwise the best way is to build an automata for every phrase or using similar algorithms (for example the Karp-Rabin string search algorithm).
Wouter Coekaerts has a good approach, but I would go a bit further.
Don't bring up anything when the textbox contains a single character. The results won't be useful. You may find that this is true for two characters as well.
Precompute the results for two characters. When there are two characters bring up the precomputed list.
When a third character is added do the 'contains' search on the list you have currently displayed (anything that doesn't contain c1c2 can't contain c1c2c3). By now the list should be small enough that 'contains' has perfectly adequate performance.
Similarly for four characters etc.
As said above, put in a little delay before starting the search. Or better still arrange for a search to be killed if another character is typed before it finishes.

Categories