I have a program that is randomly generating sentences based on a bunch of text documents of all the nouns, verbs, adjectives, and adverbs. Does anyone know a way to determine if a noun/verb are plural or singular, or if there any text documents that contain a list of singular nouns/verbs and plural nouns? I'm doing this all in Java, and I have a decent idea of how to get information off of a website, so if there are any websites that could do that as well, I'd also appreciate those.
I am afraid, you cannot solve this by having a fixed list of words, especially verbs. Consider sentences:
You are free. We are free.
In the first one, are is singular, it is plural. Using a proper tagger as #jdaz suggests is the only way how you can do it in a reliable way.
If you work with English or a few other supported languages, StanfordNLP is an excellent choice. If you need a broad language coverage, you can use UDPipe that is natively in C++ but has a Java binding.
The first step would be to look it up in a list. For English you can reduce the size of the list by only including singular nouns, and then apply some basic string processing to find plurals: if your word ends in -s and is not in the list, cut off the -s and look again. If it now is in the list, it was a simple plural (car/cars). If not, continue. If it ends in -ies, remove that, append -y and look again. Now you will capture remedies/remedy. There are a number of such patterns you can use.
Some irregular nouns need to be in an exception list (ox/oxen), but there aren't that many. Some words of course are unspecified, like sheep, data, or police. Here you need to look at the context: if the noun is followed by a singular verb (eg eats, or is), then it would be singular as well.
With (English) verbs you can generally only identify the third person singular (with a similar procedure as used for nouns; you's need a list of exceptions for verbs anding in -s (such as kiss)). Forms of to be are more helpful, but the second person singular is an issue (are). However, unless you have direct speech in your texts, it will not be used very frequently.
Part of speech taggers can also only make these decisions on context, so I don't think they will be much of a help here. It's likely to be overkill. A couple of word lists and simple heuristic rules will probably give you equal or better accuracy using far fewer resources. This is the way these things were done before large amounts of annotated data were available.
In the end it depends on your circumstances. It might be quicker to simply use an existing tagger, but for this limited problem you might get better accuracy and speed with the rule-based approach, (or even a combined one for accuracy).
Related
I have several lists of Strings already classified like
<string> <tag>
088 9102355 PHONE NUMBER
091 910255 PHONE NUMBER
...
Alfred St STREET
German St STREET
...
RE98754TO IDENTIFIER
AUX9654TO IDENTIFIER
...
service open all day long DESCRIPTION
service open from 8 to 22 DESCRIPTION
...
jhon.smith#email.com EMAIL
jhon.smith#anothermail.com EMAIL
...
www.serviceSite.com URL
...
Florence CITY
...
with a lot of strings per tag and i have to make a java program which, given
a new List of String(supposed all of the same tag), assigns a probability for each tag to the list.
The program has to be completely language independent and all the knowledge has to came from the lists of tagged strings as the one described above.
I think that this problem can be solved with NER approaches (i.e machine learning algorithms like CRF) but those are usually for unstructured text like a chapter from a book, or a paragraph of a web page, and not for list of independent strings.
I Thought to use CRF (i.e Conditional Random Field) because I found a similar approach used in the Karma Data integration Tool as described in this Article, paragraph 3.1
where the "semantic types" are the my tags.
To tackle the program I have downloaded the Stanford Named Entity Recognizer (NER) and played a bit
with it's JAVA API through NERDemo.java finding two problems:
the training file for the CRFClassifier has to have one word per row, therefore I haven't found a way to classify groups of words with a single tag
I don't understand if I have to make one Classifier per tag or a single Classifier for all, because a single string could be classified with n different tags and it is the user that chooses between them. So I'm rather interested in the probability assigned by the classifiers instead of the exact class matching. Furthermore
i haven't any "no Tag" Strings so I don't know how the Classifier behaves without them to assign the probabilities.
Is this the right approach to the problem? Is There a way To use The Stanford NER
or another JAVA API with CRF or other suitable Machine Learning Algoritm to do it?
Update
I managed to train the CRF classifier first with each word classified independently with the tag and each group of words separated by two commas( classified as "no Tag"(0) ), then with the group of words as a single word with underscores replacing spaces but I have very disappointing results in the little test I made. I haven't quite get which features I have to include and which exclude from the ones described in the NERFeatureFactory javadoc considering they can't have anything to do with language.
Update 2
The test results are beginning to make sense, I've divided each string(tagging every Token) from the others with two new Lines, instead of the horrible "two commas labeled with 0", and I've used the Stanford PTBTokenizer instead of the one that I made. Moreover I've tuned the features, turning on the usePrev and useNext features and using suffix/prefix Ngrams up to 6 characters of length and other things.
The training file named training.tsv has this format:
rt05201201010to identifier
1442955884000 identifier
rt100005154602cv identifier
Alfred street
Street street
Robert street
Street street
and theese are the flags in the the propeties file:
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useTags=false
useWordPairs=false
useDisjunctive=true
useSequences=false
usePrevSequences=true
useNextSequences=true
# the next flag can have these values: IO, IOB1, IOB2, IOE1, IOE2, SBIEO
entitySubclassification=IO
printClassifier=HighWeight
cacheNGrams=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
However I found another problem, I managed to train only 39 labels with 100 strings each, though I have like 150 labels with more than 1000 string each, but even so it takes like 5 minutes to train and if I rise these numbers a bit it throws a Java Heap Out of Memory Error.
Is there a way to scale up to those numbers with a single classifier? Is it better to train 150 (or less, maybe one with two or three labels) little classifiers and combine them later? Do I need to train with 1000+ strings each label or can I stop to 100(maybe choosing them quite different from one another)?
The first thing you should be aware of is that (linear chain) CRF taggers are not designed for this purpose. They came as a very nice solution for context-based prediction, i.e. when you have words before and after named entities, and you look for clues in a limited window (e.g. 2 words before / after current word). This is why you had to insert double lines: to delimit sentences. They also provide coherence between tags affected to words, which is indeed a good thing in your case.
A CRF tagger should work, but with an extra cost in learning step which you could be avoided by using simpler (maximum entropy, SVM) but still accurate machine learning methods. In Java, for your task, wouldn't Weka be a better solution? I would also consider BIO tagging as not relevant in your case.
Whatever software / coding you use, it is not surprising that ngrams at character level gives good improvements, but I believe you may add dedicated features. For instance, since morphological clues are important (presence of an "#", upper case or digits characters), you may use codes (see ref [1]) that are a very convenient method to describe strings. You'll also most probably obtain better results by using lists of names (lexicon) that may be triggered as additional features.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)
Suppose I've got a sentence that I'm forming dynamically based on data passed in from someone else. Let's say it's a list of food items, like: ["apples", "pears", "oranges"] or ["bread", "meat"]. I want to take these sentences and form a sentence: "I like to eat apples, pears and oranges" and "I like to eat bread and meat."
It's easy to do this for just English, but I'm fairly sure that conjunctions don't work the same way in all languages (in fact I bet some of them are wildly different). How do you localize a sentence with a dynamic number of items, joined in some manner?
If it helps, I'm working on this for Android, so you will be able to use any libraries that it provides (or one for Java).
I would ask native speakers of the languages you want to target. If all the languages use the same construct (a comma between all the elements except for the last one), then simply build a comma separated list of the n - 1 elements, and use an externalized pattern and use java.util.MessageFormat to build the whole sentence with the n - 1 elements string as first argument and the nth element as second argument. You might also externalize the separator (comma in English) to make it more flexible if needed.
If some languages use another construct, then define several Strategy implementations, externalize the name of the strategy in order to know which strategy to use for a given locale, and then ask the appropriate strategy to format the list for you. The strategy for English would use the algorithm described above. The strategy for another language would use a different way of joining the elements together, but it could be reused for several languages using externalized patterns if those languages use the same construct but with different words.
Note that it's because it's so difficult that most of the programs simply do List of what I like: bread, meat.
The problem, of course, is knowing the other languages. I don't know how you'd do it accurately without finding native speakers for your target languages. Google Translate is an option, but you should still have someone who speaks the language proofread it.
Android has this built-in functionality in the Plurals string resources.
Their documentation provides examples in English (res/valules/strings.xml) and also in another language (placed in res/values-pl/strings.xml)
Valid quantities: {zero, one, two, few, many, other}
Their example:
<?xml version="1.0" encoding="utf-8"?>
<resources>
<plurals name="numberOfSongsAvailable">
<item quantity="one">One song found.</item>
<item quantity="other">%d songs found.</item>
</plurals>
</resources>
So it seems like their approach is to use longer sentence fragments than just the individual words and plurals.
Very difficult. Take french for example: it uses articles in front of each noun. Your example would be:
J'aime manger des pommes, des poires et des oranges.
Or better:
J'aime les pommes, les poires et les oranges.
(Literally: I like the apples, the pears and the oranges).
So it's no longer only a problem of conjunctions. It's a matter of different grammatical rules that goes beyond conjunctions. Needless to say that other languages may raise totally different issues.
This is unfortunately nearly impossible to do. That is because of conjugation on one side and different forms depending on gender on the other. English doesn't have that distinction but many (most of?) other languages do. To make the matter worse, verb form might depend on the following noun (both gender-dependent form and conjugation).
Instead of such nice sentences, we tend to use lists of things, for example:
The list of things I would like to eat:
apples
pears
oranges
I am afraid that this is the only reasonable thing to do (variable, modifiable list).
What would be the best approach to compare two hexadecimal file signatures against each other for similarities.
More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.
The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.
As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.
This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.
Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.
Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;
Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm
But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.
Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.
There is a copy of the finished paper on GitHub
For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).
Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)
One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.
Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).
A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see
this presentation.
There has been a serious amount of research recently into automating the detection of duplicate bug reports in bug repositories. This is essentially the same problem you are facing. The difference is that you are using binary data. They are similar problems because you will be looking for strings that have the same basic pattern, even though the patterns may have some slight differences. A straight-up distance algorithm probably won't serve you well here.
This paper gives a good summary of the problem as well as some approaches in its citations that have been tried.
ftp://ftp.computer.org/press/outgoing/proceedings/Patrick/apsec10/data/4266a366.pdf
As somebody has pointed out, similarity with known string and bioinformatics problem might help. Longest common substring is very brittle, meaning that one difference can halve the length of such a string. You need a form of string alignment, but more efficient than Smith-Waterman. I would try and look at programs such as BLAST, BLAT or MUMMER3 to see if they can fit your needs. Remember that the default parameters, for these programs, are based on a biology application (how much to penalize an insertion or a substitution for instance), so you should probably look at re-estimating parameters based on your application domain, possibly based on a training set. This is a known problem because even in biology different applications require different parameters (based, for instance, on the evolutionary distance of two genomes to compare). It is also possible, though, that even at default one of these algorithms might produce usable results. Best of all would be to have a generative model of how viruses change and that could guide you in an optimal choice for a distance and comparison algorithm.
I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.
The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)
What would be your choice in such case please?
EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.
You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.
If you are looking for a 'how much' match, you should use Soundex. Here is a Java implementation of this algorithm.
Check out Double Metaphone, an improved soundex from 1990.
http://commons.apache.org/codec/userguide.html
http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup
According to me Jaro-Winkler algorithm will suit your requirement best.
Here is a Short summary of Jaro-Winkler Distance Algo
One of the PDF which compares different algorithms --> Link to PDF
is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary
Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.
Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.
OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.
If you are on a unix like OS look in /usr/share/dict.
Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html
Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary