Markov Chain Text Generation - java

We were just assigned a new project in my data structures class -- Generating text with markov chains.
Overview
Given an input text file, we create an initial seed of length n characters. We add that to our output string and choose our next character based on frequency analysis..
This is the cat and there are two dogs.
Initial seed: "Th"
Possible next letters -- i, e, e
Therefore, probability of choosing i is 1/3, e is 2/3.
Now, say we choose i. We add "i" to the output string. Then our seed becomes
hi and the process continues.
My solution
I have 3 classes, Node, ConcreteTrie, and Driver
Of course, the ConcreteTrie class isn't a Trie of the traditional sense. Here is how it works:
Given the sentence with k=2:
This is the cat and there are two dogs.
I generate Nodes Th, hi, is, ... + ... , gs, s.
Each of these nodes have children that are the letter that follows them. For example, Node Th would have children i and e. I maintain counts in each of those nodes so that I can later generate the probabilities for choosing the next letter.
My question:
First of all, what is the most efficient way to complete this project? My solution seems to be very fast, but I really want to knock my professor's socks off. (On my last project A variation of the Edit distance problem, I did an A*, a genetic algorithm, a BFS, and Simulated Annealing -- and I know that the problem is NP-Hard)
Second, what's the point of this assignment? It doesn't really seem to relate to much of what we've covered in class. What are we supposed to learn?

On the relevance of this assignment with what you covered in class (Your second question). The idea of a 'data structures' class is to expose students to the very many structures frequently encountered in CS: lists, stacks, queues, hashes, trees of various types, graphs at large, matrices of various creed and greed, etc. and to provide some insight into their common implementations, their strengths and weaknesses and generally their various fields of application.
Since most any game / puzzle / problem can be mapped to some set of these structures, there is no lack of subjects upon which to base lectures and assignments. Your class seems interesting because while keeping some focus on these structures, you are also given a chance to discover real applications.
For example in a thinly disguised fashion the "cat and two dogs" thing is an introduction to statistical models applied to linguistics. Your curiosity and motivation prompted you to make the relation with markov models and it's a good thing, because chances are you'll meet "Markov" a few more times before graduation ;-) and certainly in a professional life in CS or related domain. So, yes! it may seem that you're butterflying around many applications etc. but so long as you get a feel for what structures and algorithms to select in particular situations, you're not wasting your time!
Now, a few hints on possible approaches to the assignment
The trie seems like a natural support for this type of problem. Maybe you can ask yourself however how this approach would scale, if you had to index say a whole book rather than this short sentence. It seems mostly linearly, although this depends on how each choice on the three hops in the trie (for this 2nd order Markov chain) : as the number of choices increase, picking a path may become less efficient.
A possible alternative storage for the building of the index is a stochatisc matrix (actually a 'plain' if only sparse matrix, during the statistics gathering process, turned stochastic at the end when you normalize each row -or column- depending on you set it up) to sum-up to one (100%). Such a matrix would be roughly 729 x 28, and would allow the indexing, in one single operation, of a two-letter tuple and its associated following letter. (I got 28 for including the "start" and "stop" signals, details...)
The cost of this more efficient indexing is the use of extra space. Space-wise the trie is very efficient, only storing the combinations of letter triplets effectively in existence, the matrix however wastes some space (you bet in the end it will be very sparsely populated, even after indexing much more text that the "dog/cat" sentence.)
This size vs. CPU compromise is very common, although some algorithms/structures are somtimes better than others on both counts... Furthermore the matrix approach wouldn't scale nicely, size-wize, if the problem was changed to base the choice of letters from the preceding say, three characters.
None the less, maybe look into the matrix as an alternate implementation. It is very much in spirit of this class to try various structures and see why/where they are better than others (in the context of a specific task).
A small side trip you can take is to create a tag cloud based on the probabilities of the letters pairs (or triplets): both the trie and the matrix contain all the data necessary for that; the matrix with all its interesting properties, may be more suited for this.
Have fun!

You using bigram approach with characters, but usually it applied to words, because the output will be more meaningful if we use just simple generator as in your case).
1) From my point of view you doing all right. But may be you should try slightly randomize selection of the next node? E.g. select random node from 5 highest. I mean if you always select node with highest probability your output string will be too uniform.
2) I've done exactly the same homework at my university. I think the point is to show to the students that Markov chains are powerful but without extensive study of application domain output of generator will be ridiculous

Related

Algorithm to remove words in corpus with small occurrence

I have a large (+/- 300,000 lines) dataset of text fragments that contain some noisy elements. With noisy I mean words of slang, type errors, etc… I wish to filter out these noisy elements to have a more clean dataset.
I read some papers that propose to filter these out by keeping track of the occurrence of each word. By setting a treshold (eg. less than 20) we can assume these words are noise and thus can safely be removed from the corpus.
Maybe there are some libraries or algorithms that do this in a fast and efficient way. Ofcourse I tried it myself first but this is EXTREMELY slow!
So to summarize, I am looking for an algorithm that can filter out words, in a fast and efficient way, that occur less than a particular treshold. Maybe I add a small example:
This is just an example of whaat I wish to acccomplish.
The words 'whaat' and 'acccomplish' are misspelled and thus likely to occur less often (If we assume to live in a perfect world and typos are rare …). I wish to end up with
This is just an example of I wish to.
Thanks!
PS: If possible, I'd like to have an algorithm in Java (or pseudo-code so I can write it myself)
I think you are over complicating it with the approach suggested in comments.
You can do it with 2 passes on the data:
Build a histogram: Map<String,Integer> that counts number of occurances
For each word, print it to the new 'clean' file if and only if map.get(word) > THRESHOLD
As a side note, if any - I think a fixed threshold approach is not the best choice, I personally would filter words that occure less than MEAN-3*STD where MEAN is the average number of words, and STD is the standard deviation. (3 standard deviations mean you are catching words that are approximately out of the expected normal distribution with probability of ~99%). You can 'play' with the constant factor and find what best suits your needs.

Genetic Algorithm - Grouping people: Only find solutions containing criteria X,Y and Z

I am trying to solve the following problem:
I have a list of 30 people.
These people need to be divided into groups of 6.
Each person has given the names of 3 other people who they would like to be in a group with.
I thought of solving this problem using a genetic algorithm.
The fitness function could evaluate all the groups, and assign a fitness score based on how many people per room have all their preferences met. (or a scoring system similar to that)
Example:
One of the generated solutions is: 1,3,19,5,22,2,7,8,11,12,13,14,15,13,17....etc
I would assume the first 5 people are in the first group, and the next 5 in the the next group and calculate a fitness value from that.
I think that this solution would work - does anyone see a better way of doing this?
My main question is this:
If I want to make sure person A and B are definitely in the same group, I could implement the fitness function to check for this and assign a terrible fitness if this condition isn't met. Is this the best way to do it? It seems quite inefficient.
Is there a way to 'lock' certain parts of the solution ("certain genes") and just solve or the remainder?
Any help or insights will be appreciated.
Thanks in advance.
AK
Just to clarify a bit, your problem isn't about genetic programming but genetic algorithms, which are two different things. Genetic programming is about generating (using evolutionary algorithms) executable individuals that will generate your solutions while genetic algorithms individuals represent directly your solutions.
That being said, your two assumptions are corrects. Data representation is a key element of evolutionary algorithms in general and a bad representation may hinder efficient solution space exploration. Your current data representation seems correct to me, given groups are only allowed to have exactly 5 individuals. Your second thought about the way to enforce some criteria is also right. Putting a large fitness value (preferably one that can't represent a potentially valid even if bad solution) such as infinity (if your library / language allows it easily) is the preferred way to express invalid solutions in literature. This has multiple advantages over simply deleting invalid individuals: During the selection stage, bad individuals won't be selected and thus the solution space they represent won't be explored as much as interesting ones, which is computationally good because it surely won't contain optimal solutions. Knowing a solution is bad is good knowledge, after all. At the same time, genetic diversity is really important in evolutionary algorithms in order to avoid stagnation. At least some bad individual should be kept for the sake of genetic diversity in order to explore solution spaces between currently represented zones.
The goal of genetic algorithms is to compute solutions that are either impossible or too hard to compute analytically or by brute-force. Trying to dynamically lock down some genes with heuristics would require much knowledge about the inner working of your problems as well as the underlying evolution mechanisms and would be defeating the purpose of using evolutionary algorithms. The effective goal of evolutionary algorithms is to lock down genes that seems correct.
In fact, if you are a priori absolutely positively certain that some given genes must have a given value, don't represent them in your individuals. For instance, make your first group 3 individuals long if you are sure that the 2 others must be of some given value. You can then code your evaluation function as if there was 5 individuals in the first group but won't be evolving / searching to replace the 2 fixed ones.
What does your crossover operation look like? The way you have it laid out in your description, I'm not sure how you implement it cleanly. For instance if you have two solutions:
1, 2, 3, 4, 5, ....., 30
and
1, 2, 30, 29,......,10
Assuming you're using single point crossover function, you would have the potential to get multiple assignments for the same people and other people not being assigned at all using the genomes above.
I would have a genome with 30 values, where each value defines a person's group assignment (1-6). It would look like 656324113255632....etc. So person 1 is assigned group 6, person 2 group 5, etc. This would make the crossover operation easier to implement, because you don't have to ensure that after crossover each new solution is a valid assignment regardless of whether it's optimal.
The fitness function would assign a penalty for each group not having the proper number of members (5), and additional penalties for group member assignments that are suboptimal. I would make the first penalty significantly larger than the second, and then adjust these to get the results you're looking for.
This can be modeled as a generalized quadratic assignment problem (GQAP). This problem allows to specify a number of equipment (people) that demand a certain capacity, a number of locations (groups) that offer a capacity and the weights matrix that specifies the closeness between equipment and the distance matrix specifying the distance between locations. Additionally, there are install costs, but these are not required for your problem. I have implemented this problem in HeuristicLab. It's not part of the trunk, but I can send you the plugin if you're interested (or you compile it yourself).
It seems that the most challenging part of using a genetic algorithm for this problem is implementing the crossover. Here's how I would do it:
First choose a constant, C. C will stay constant throughout all generations, and I will explain it's purpose in a moment.
I will use a smaller example than 5 groups of 6 to demonstrate this crossover, but the premise is the same. Say we have 2 parents, each consisting of 3 groups of 3. Let's make one [[1,2,3],[4,5,6],[7,8,9]], and the other [[9,4,3],[5,7,8],[6,1,2]].
Make a list of possible numbers (1 through total number of people), in this case it is simply [1,2,3,4,5,6,7,8,9]. Remove 1 random number from the list. Let's say we remove 2. The list becomes [1,3,4,5,6,7,8,9]
We assign each remaining number a probability. The probability starts at 1, and goes up by C for any matches with the parents. For example, in parent 1, 3 and 2 are in the same group so 3 would have a probability of 1+C. Same thing with 6 because it forms a match in parent 2. 1 would have a probability of 1+2C, because it is in the same group as 2 in both parents. Based on these probabilities, use a roulette wheel type selection. Let's say we pick 6.
Now, we have 2 and 6 in the same group. We similarly look for matches with these numbers and make probabilities. For each parent, we add C if it matches with only 2 or only 6, and 2C if it matches with both. Continue this until the group is done (for 3x3 this is the last selection, but for 5x6 there would be a few more)
4.Choose a new random number that has not been picked and continue for other groups
One of the good things about this crossover, is that it basically includes mutations already. There are chances built in to group people that were not grouped in their parents
Credit: I adapted the idea from the Omicron Genetic Algorithm

Addition Chains [duplicate]

How can you compute a shortest addition chain (sac) for an arbitrary n <= 600 within one second?
Notes
This is the programming competition on codility for this month.
Addition chains are numerically very important, since they are the most economical way to compute x^n (by consecutive multiplications).
Knuth's Art of Computer Programming, Volume 2, Seminumerical Algorithms has a nice introduction to addition chains and some interesting properties, but I didn't find anything that enabled me to fulfill the strict performance requirements.
What I've tried (spoiler alert)
Firstly, I constructed a (highly branching) tree (with the start 1-> 2 -> ( 3 -> ..., 4 -> ...)) such that for each node n, the path from the root to n is a sac for n. But for values >400, the runtime is about the same as for making a coffee.
Then I used that program to find some useful properties for reducing the search space. With that, I'm able to build all solutions up to 600 while making a coffee. But for n, I need to compute all solutions up to n. Unfortunately, codility measures the class initialization's runtime, too...
Since the problem is probably NP-hard, I ended up hard-coding a lookup table. But since codility asked to construct the sac, I don't know if they had a lookup table in mind, so I feel dirty and like a cheater. Hence this question.
Update
If you think a hard-coded, full lookup table is the way to go, can you give an argument why you think a full computation/partly computed solutions/heuristics won't work?
I have just got my Golden Certificate for this problem. I will not provide a full solution because the problem is still available on the site.I will instead give you some hints:
You might consider doing a deep-first search.
There exists a minimal star-chain for each n < 12509
You need to know how prune your search space.
You need a good lower bound for the length of the chain you are looking for.
Remember that you need just one solution, not all.
Good luck.
Addition chains are numerically very important, since they are the
most economical way to compute x^n (by consecutive multiplications).
This is not true. They are not always the most economical way to compute x^n. Graham et. all proved that:
If each step in addition chain is assigned a cost equal to the product
of the numbers at that step, "binary" addition chains are shown to
minimize the cost.
Situation changes dramatically when we compute x^n (mod m), which is a common case, for example in cryptography.
Now, to answer your question. Apart from hard-coding a table with answers, you could try a Brauer chain.
A Brauer chain (aka star-chain) is an addition chain where each new element is formed as the sum of the previous element and some element (possibly the same). Brauer chain is a sac for n < 12509. Quoting Daniel. J. Bernstein:
Brauer's algorithm is often called "the left-to-right 2^k-ary method",
or simply "2^k-ary method". It is extremely popular. It is easy to
implement; constructing the chain for n is a simple matter of
inspecting the bits of n. It does not require much storage.
BTW. Does anybody know a decent C/C++ implementation of Brauer's chain computation? I'm working partially on a comparison of exponentiation times using binary and Brauer's chains for both cases: x^n and x^n (mod m).

Percentage Similarity Analysis (Java)

I have following situation:
String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically";
String b = "Web Crawler computer program browses the World Wide Web";
Is there any idea or standard algorithm to calculate the percentage of similarity?
For instance, above case, the similarity estimated by manual looking should be 90%++.
My idea is to tokenize both Strings and compare the number of tokens matched. Something like
(7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....
Can anyone give some guidelines???
Above is part of my project, Plagiarism Analyzer.
Hence, the words matched will be exactly same without any synonyms.
The only matters in this case is that how to calculate a quite accurate percentage of similarity.
Thanks a lot for any helps.
As Konrad pointed out, your question depends heavily on what you mean by "similar".
In general, I would say the following guidelines should be of use:
normalize the input by reducing a word to it's base form and lowercase it
use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences
You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.
That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”
A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(nm) in general, where n and m are the lengths of the strings.
Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.
In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.
I second what Konrad Rudolf has already said.
Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.
Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well
Sentence splitting. Figure out your units of comparison.
stop-word removal: take out a, an, the, of, etc.
bag of words percentage: what percentage of the overall words match, independent of ordering
(much more aggressive) you could try synonym expansion, which counts synonyms as matched words.
The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).
The human-similarity can never be easily calculated: for instance these three words
cellphone car message
mobile automobile post
The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.
One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library
The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming

Way to store a large dictionary with low memory footprint + fast lookups (on Android)

I'm developing an android word game app that needs a large (~250,000 word dictionary) available. I need:
reasonably fast look ups e.g. constant time preferable, need to do maybe 200 lookups a second on occasion to solve a word puzzle and maybe 20 lookups within 0.2 second more often to check words the user just spelled.
EDIT: Lookups are typically asking "Is in the dictionary?". I'd like to support up to two wildcards in the word as well, but this is easy enough by just generating all possible letters the wildcards could have been and checking the generated words (i.e. 26 * 26 lookups for a word with two wildcards).
as it's a mobile app, using as little memory as possible and requiring only a small initial download for the dictionary data is top priority.
My first naive attempts used Java's HashMap class, which caused an out of memory exception. I've looked into using the SQL lite databases available on android, but this seems like overkill.
What's a good way to do what I need?
You can achieve your goals with more lowly approaches also... if it's a word game then I suspect you are handling 27 letters alphabet. So suppose an alphabet of not more than 32 letters, i.e. 5 bits per letter. You can cram then 12 letters (12 x 5 = 60 bits) into a single Java long by using 5 bits/letter trivial encoding.
This means that actually if you don't have longer words than 12 letters / word you can just represent your dictionary as a set of Java longs. If you have 250,000 words a trivial presentation of this set as a single, sorted array of longs should take 250,000 words x 8 bytes / word = 2,000,000 ~ 2MB memory. Lookup is then by binary search, which should be very fast given the small size of the data set (less than 20 comparisons as 2^20 takes you to above one million).
IF you have longer words than 12 letters, then I would store the >12 letters words in another array where 1 word would be represented by 2 concatenated Java longs in an obvious manner.
NOTE: the reason why this works and is likely more space-efficient than a trie and at least very simple to implement is that the dictionary is constant... search trees are good if you need to modify the data set, but if the data set is constant, you can often run a way with simple binary search.
I am assuming that you want to check if given word belongs to dictionary.
Have a look at bloom filter.
The bloom filter can do "does X belong to predefined set" type of queries with very small storage requirements. If the answer to query is yes, it has small (and adjustable) probability to be wrong, if the answer to query is no, then the answer guaranteed to be correct.
According the Wikipedia article you could need less than 4 MB space for your dictionary of 250 000 words with 1% error probability.
The bloom filter will correctly answer "is in dictionary" if the word actually is contained in dictionary. If dictionary does not have the word, the bloom filter may falsely give answer "is in dictionary" with some small probability.
A very efficient way to store a directory is a Directed Acyclic Word Graph (DAWG).
Here are some links:
Directed Acyclic Word Graph or DAWG description with sourcecode
Construction of the CDAWG for a Trie
Implementation of directed acyclic word graph
You'll be wanting some sort of trie. Perhaps a ternary search trie would be good I think. They give very fast look-up and low memory usage. This paper gives some more info about TSTs. It also talks about sorting so not all of it will apply. This article might be a little more applicable. As the article says, TSTs
combine the time efficiency of digital
tries with the space efficiency of
binary search trees.
As this table shows, the look-up times are very comparable to using a hash table.
You could also use the Android NDK and do the structure in C or C++.
The devices that I worked basically worked from a binary compressed file, with a topology that resembled the structure of a binary tree. At the leafs, you would have the Huffmann compressed text. Finding a node would involve having to skip to various locations of the file, and then only load the portion of the data really needed.
Very cool idea as suggested by "Antti Huima" trying to Store dictionary words using long. and then search using binary search.

Categories