Text classifier with word splitting using StanfordNLP classifier

Text classifier with word splitting using StanfordNLP classifier - java

After a quite successful start to StanfordNLP (and with the german module) I tried out classifying numerical data. This also exited with good results.
At least I tried to set up a classifier for categorizing text documents (both mails and scanned documents) but this was quite frustrating. What I want to do is working with a classifier on word base, not with n-grams. My training file has two columns: First with the category of the text, second with the text itself, without tabs or line breakers.
The properties file has the following content:
1.splitWordsWithPTBTokenizer=true
1.splitWordsRegexp=false
1.splitWordsTokenizerRegexp=false
1.useSplitWords=true
But when I start training the classifier like this...
ColumnDataClassifier cdc = new ColumnDataClassifier("classifier.properties");
Classifier<String, String> classifier =
cdc.makeClassifier(cdc.readTrainingExamples("data.train"));
...then I get many lines starting with the following hint:
[main] INFO edu.stanford.nlp.classify.ColumnDataClassifier - Warning: regexpTokenize pattern false didn't match on
My questions are:
1) Any idea what is wrong with my properties? I think, my training file is okay.
2) I want to use the words/tokens that I got from CoreNLP with the german model. Is this possible?
Thanks for any answers!

The numbering is correct, you don't have to put 2's in the beginning of the lines, as one other answer states. 1 stands for first data column, not for the first column in general in your training file (which is the category). Options with a 2. in the beginning would be for the second data column, or the third column in general in your training file - which you don't have.
I don't know about using the words/tokens you got from CoreNLP, but it also took me a while to find out how to use word n-grams, so maybe for some people this will be helpful:
# regex for splitting on whitespaces
1.splitWordsRegexp=\\s+
# enable word n-grams, just like character n-grams are used
1.useSplitWordNGrams=true
# range of values of n for your n-grams. (1-grams to 4-grams in this example)
1.minWordNGramLeng=1
1.maxWordNGramLeng=4
# use word 1-grams (just single words as features), obsolete if you're using
# useSplitWordNGrams with minWordNGramLeng=1
1.useSplitWords=true
# use adjacent word 2-grams, obsolete if you're using
# useSplitWordNGrams with minWordNGramLeng<=2 and maxWordNGramLeng>=2
1.useSplitWordPairs=true
# use word 2-grams in every possible combination, not just adjacent words
1.useAllSplitWordPairs=true
# same as the pairs but 3-grams, also not just adjacent words
1.useAllSplitWordTriples=true
for more information have a look at http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/classify/ColumnDataClassifier.html

You are saying your training file has two columns, first with the category of the text, second with the text itself. Based on this, your properties file is incorrect, because you are adding rules to the first column there.
Modify your properties to be applied to the column where the text is as follows:
2.splitWordsWithPTBTokenizer=true
2.splitWordsRegexp=false
2.splitWordsTokenizerRegexp=false
2.useSplitWords=true
Furthermore, I would suggest to work through the Software/Classifier/20 Newsgroups wiki, this shows some practical examples on how to work with the Stanford Classifier, and how to set up options through the properties file.

Related

Find words in multiple files starting with a specific set of characters and replace the whole word with another word

I need to read through multiple files and check for all occurrences of words that start with a specific pattern and replace it in all the files with another word. For example, I need to find all words beginning with 'Hello' in a set of files which may contain words like 'Hellotoall' and then I want the word to be replaced with 'Greetings', just an example. I have tried:
content = content.replaceAll("/Hello(\\w)+/g", "Greetings");
This code results in : Greetingstoall, but I want the whole word to be replaced with 'Greetings', i.e. if the file has a line:
Today i say Hellotoall present here. After replacement the line should be like: Today i say Greetings present here.
How can I achieve such a requirement with a better regex.

You need just "Hello(\\w)*".

isn't the output Greetingsoall? The match would be Hellot - so first thing is that you may want to replace + with *
As talex pointed out, there is sed syntax mixed in, which doesn't work with Java.
content.replaceAll("Hello\w*", "Greetings")

Stanford NER - Unable to identify Phone number

I am training my NER to the entity type Phonenumber whose part of speech is number. However when I test the same data that I have trained, the phone number is not identified by the classifier.
Is that because the part of speech(POS) of phone number is number(CD)?

You might want to use regexner instead for this use case.
Consider this sentence (put it in phone-number-example.txt):
You can reach the office at 555 555-5555.
If you make a regexner rules file like this (note each column is tab separated)
[0-9]{3}\W[0-9]{3}-[0-9]{4} PHONE_NUMBER MISC,NUMBER 1
And run this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping phone_number.rules -file phone-number-example.txt -outputFormat text
It will identify the phone number in the output NER tagging.
One issue to look out for. You will note the tokenizer turns "555 555-5555" into one token. The first column of the rule file is a regex that matches a token. The regexner patterns are a space separated list of patterns that match each token you want to ner tag.
So in this example, the rule I made has a "\W" to capture the space. The rule wasn't working when I used "\s", etc..so I think there is an issue with writing regexes for tokens that contain spaces. Typically tokens don't contain spaces for that matter.
So you might want to work around this by expanding on "\W" and excluding other characters that you don't want since "\W" just means non-word characters. Also, you can obviously make the pattern I just listed more complicated and capture the various phone number patterns.
More info on RegexNER can be found here:
http://nlp.stanford.edu/software/regexner.html

Configuring the tokanisation of the search term in an elasticsearch query

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.

The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

Java parsing text file

I need to write a parser for textfiles (at least 20 kb), and I need to determine if words out of a set of words appear in this textfile (about 400 words and numbers). So I am looking for the most efficient possibilitie to do this (if a match is found, i need to do some further processing of this and it's previous line).
What I currently do, is to exclude lines that do not contain any information for sure (kind of metadata lines) and then compare word by word - but i don't think that only comparing word by word is the most efficient possibility.
Can anyone please provide some tips/hints/ideas/...
Thank you very much

It depends on what you mean with "efficient".
If you want a very straightforward way to code it, keep in mind that the String object in java has method String.contains(CharSequence sequence).
Then, you could put the file content into a String and then iterate on your keywords you want to check to see if any of those appear in String, using the method contains().

How about the following:
Put all your keywords in a HashSet (Set<String> keywords;)
Read the file one line at once
For each line in file:
Tokenize to words
For each word in line:
If word is contained in keywords (keywords.containes(word))
Process actual line
If previous line is available
Process previous line
Keep track of previous line (prevLine = line;)

Lucene Index problems with "-" character

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.
It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.
The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.
I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer
here an example:
if I search for "gsx-*" I got a result, the indexed field contains
"SUZUKI GSX-R 1000 GSX-R1000 GSXR"
but if I search for "v-*" I got no result. The indexed field of the expected result contains:
"SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"
If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)
So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?

StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.
So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.
BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.

ClassicAnalyzer handles '-' as a useful, non-delimiter character. As I understand ClassicAnalyzer, it handles '-' like the pre-3.1 StandardAnalyzer because ClassicAnalyzer uses ClassicTokenizer which treats numbers with an embedded '-' as a product code, so the whole thing is tokenized as one term.
When I was at Regenstrief Institute I noticed this after upgrading Luke, as the LOINC standard medical terms (LOINC was initiated by R.I.) are identified by a number followed by a '-' and a checkdigit, like '1-8' or '2857-1'. My searches for LOINCs like '45963-6' failed using StandardAnalyzer in Luke 3.5.0, but succeeded with ClassicAnalyzer (and this was because we built the index with the 2.9.2 Lucene.NET).

(Based on Lucene 4.7) StandardTokenizer splits hyphenated words into two. for example "chat-room" into "chat","room" and index the two words separately instead of indexing as a single whole word. It is quite common for separate words to be connected with a hyphen: “sport-mad,” “camera-ready,” “quick-thinking,” and so on. A significant number are hyphenated names, such as “Emma-Claire.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words, that's why lucene keeps the hyphen out of the default definition.
To give support of hyphen in StandardAnalyzer, you have to make changes in StandardTokenizerImpl.java which is generated class from jFlex.
Refer this link for complete guide.
You have to add following line in SUPPLEMENTARY.jflex-macro which is included by StandardTokenizerImpl.jflex file.
MidLetterSupp = ( [\u002D] )
And After making changes provide StandardTokenizerImpl.jflex file as input to jFlex engine and click on generate. The output of that will be StandardTokenizerImpl.java
And using that class file rebuild the index.

The ClassicAnalzer is recommended to index text containing product codes like 'GSX-R1000'. It will recognize this as a single term and did not split up its parts. But for example the text 'Europe/Berlin' will be split up by the ClassicAnalzer into the words 'Europe' and 'Berlin'. This means if you have a text indexed by the ClassicAnalyzer containing the phrase
Europe/Berlin GSX-R1000
you can search for "europe", "berlin" or "GSX-R1000".
But be careful which analyzer you use for the search. I think the best choice to search a Lucene index is the KeywordAnalyzer. With the KeywordAnalyzer you can also search for specific fields in a document and you can build complex queries like:
(processid:4711) (berlin)
This query will search documents with the phrase 'berlin' but also a field 'processid' containing the number 4711.
But if you search the index for the phrase "europe/berlin" you will get no result! This is because the KeywordAnalyzer did not change your search phrase, but the phrase 'Europe/Berlin' was split up into two separate words by the ClassicAnalyzer. This means you have to search for 'europe' and 'berlin' separately.
To solve this conflict you can translate a search term, entered by the user, in a search query that fits you needs using the following code:
QueryParser parser = new QueryParser("content", new ClassicAnalyzer());
Query result = parser.parse(searchTerm);
searchTerm = result.toString("content");
This code will translate the serach pharse
Europe/Berlin
into
europe berlin
which will result in the expected document set.
Note: This will also work for more complex situations. The search term
Europe/Berlin GSX-R1000
will be translated into:
(europe berlin) GSX-R1000
which will search correctly for all phrases in combination using the KeyWordAnalyzer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.