Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root?
If it makes it easier, I don't want to filter out, just to get a count. So if there are some false positives, it's not the end of the world, as long as there's a more or less uniformly over exaggerated rate.
A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?
Some quick thoughts:
The Scunthorpe problem (and follow the links to "Swear filter" for more)
British or American English? fanny, fag etc
Political correctness: "black" or "Afro-American"?
Edit:
Be very careful and again here. Normal words can offend, whether by choice or ignorance
Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?
I'd make a huge list.
Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.
Note that any NLP logic like this will be subject to attacks of "character replacement":
For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)
One problem with filters of this kind is their tendency to flag entirely proper English town names like Scunthorpe. While that can be reduced by checking the whole word rather than parts, you then find people taking advantage by merging their offensive words with adjacent text.
It depends what your text source is, but I'd go for some kind of established and proven pattern matching algorithm, using a Trie for example.
Use the morphy lemmatizer built into WordNet, and then determine whether the lemma is an obscenity. This will solve the problem of different verb forms, plurals, etc...
I would advocate a large list of simple regex's. Smaller than a list of the variants, but not trying to capture anything more than letter alternatives in any given expression: like "f[u_-##$%^&*.]ck".
You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.
Here are a couple of useful links:
A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.
Data Mining (ppt) - This was written by a colleague of mine.
Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).
There are webservices that do this kind of thing in English.
I'm sure there are others, but I've used WebPurify in a project for precisely this reason before.
At Melissa Data, when my manager , the director of Massachusetts Research and Development and I refactored a Data Profiler targeted at Relational Databases , we counted profanities by the number of Levinshtein Distance matches where the number of insertions, deletions and substitutions is tunable by the user so as to allow for spelling mistakes, Germanic equivalents of English language, plurals, as well as whitespace and non-whitespace punctuation. We speeded up the running time of the Levinshtein Distance calculation by looking only in the diagonal bands of the n by n matrix.
Related
I have a 140 characters texts and a set of keywords.
What I want to do is to write an algorithm that will help me compute a percentage matching between my text and keywords in order to qualify a text as repesenting an IT event annonciation.
For example:
Text: "Tomorrow will take place our weekly event which about computer. We will discuss about how to implement algorithms. This will be very great."
keyword: "event, computer, database, Software, algorithms"
Here the matching is 3 words over 5 keywords which is 60%
Does that make sense, using word count and compare it to the number of keyword ? Is this approch accurate?
Does anyone has dealt with something like this before?
Thanks for your support.
Yes, this makes definitely sense. However, you will have to evaluate in practice whether it is precise enough for your purpose. It depends pretty much on the texts you are dealing with.
If you want to try something that is a bit more advanced but not too complex: Cosine similarity is another common measure to compare texts.
There are tons of algorithms and libraries for text classification. LingPipe is a nice Java library that might help you.
If you are interested in using a library, you find a good overview in the top answer to this quora question.
Recently I've been assigned to build a translation memory for a new project. The idea is the TM is a cache layer on top of the RPC layer which will call the Google Translate API to translate if there is no match in the TM. I consider using the source text as key in TM and I need a fuzzy matching algorithm to match a query text with key in TM. If the result is higher than some threshold like 0.85 (range is 0 to 1) the cached translated text will be used instead of calling google service.
I've read a lot of articles/blogs/papers, but still don't know where to start.
TD-IDF+cosine similarity seems not good enough? Levenshtein distance?
What about semantic similarity? But how?
I read about this
In the comments #mbatchkarov seems provide a correct direction.
Does anyone has similar experience on the subject? Any suggestions are welcome.
A lot of the time the accepted answer to the question you linked to can get you quite far. You can compare the word (lemma) overlap between a query and all queries in the cache. To improve performance, you can incorporate word similarity to help you link semantically similar words. The thesaurus-building software I linked to in my is BSD-licensed, so you are free to use it as you see fit. If you need any help using it, the developers (disclaimer: I am a part of the team) will be happy to help out. In fact, I've got a few pre-built thesauri lying around. These should probably be a part of the software, but they are too large to upload to github.
Whichever approach you go for, be aware that there will be many cases where this does not work well. This is because the approaches discussed in that question are about semantic similarity, and your application may require semantic equivalence. For example, "I like big ginger cats" and "We like big ginger cats" or "We like small ginger cats" are very similar in meaning, but it would be wrong to use the translation of one as a translation of the other.
I have come across similar problems a few times in the past and want to know what language (methodology) if any is used to solve similar problems (I am a J2EE/java developer):
problem: Out of a probable set of words, with a given rule (say the word can be a combination of A and X, and always starts with a X, each word is delimited by a space), you have to read a sequence of words and parse through the input to decide which of the words are syntatctically correct. In a nutshell these are problems that involve parsing techniques. Say simulate the logic of an vending machine in Java.
So what I want to know is what are the techniques/best approach to solve problems pertaining to parsing inputs. Like alien language processing problem in google code jam
Google code jam problem
Do we use something like ANTLR or some library in java.
I know this question is slightly generic, but I had no other way of expressing it.
P.S: I do not want a solution, I am looking for best way to solve such recurring problems.
You can use JavaCC for complex parsing.
For relative simple parsing and event processing I use enum(s) as a state machine. esp as a push parser.
For very simple parsing, you can use indexOf or split(" ") with equals, switch or startsWith
If you want to simulate the logic of a something that is essentially a finite state automation, you can simply code the FSA by hand. This is a standard computer science solution. A less obvious way to do this is to use a lexer-generator (there are lots of them) to generate the FSA from descriptions of the valid sequences of events (in lexer-generator speak, these are called "characters" but you can cheat and substitute event occurrences for characters).
If you have complex recursive rules about matching, you'll want a more traditional parser.
You can code these by hand, too, if the grammar isn't complicated; see my ?SO answer on "how to build a recursive descent parser". If your grammar is complex or it changes quickly, you'll want to use a standard parser generator. Other answers here suggest specific ones but there are many to choose from, all generally very capable.
[FWIW, I applied parser generators to recognizing valid transaction sequences in 1974 in TRW POS terminals the May Company department store. Worked pretty well.]
You can use ANTLR which is good, It will help in complex problem But you can also use regular expressions eg: spilt("\\s+").
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I donĀ“t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?
Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.
I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.
I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.