How to group unknown text messages using an Algorithm?

How to group unknown text messages using an Algorithm? - java

Following is the sample data set that I need to group together, if you look closely they are mostly similar text lines but with very minute difference of having either a person id or ID .
Unexpected error:java.lang.RuntimeException:Data not found for person 1X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 2X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 31X9393912 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 36X9393912 . Clear set not defined . Dump
Exception in thread "main" javax.crypto.BadPaddingException: ID 1 Given final block not properly padded
Exception in thread "main" javax.crypto.BadPaddingException: ID 2 Given final block not properly padded
Unexpected error:java.lang.RuntimeException:Data not found for person 5 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 6 . Clear set not defined . Dump
Exception in thread "main" java.lang.NullPointerException at TripleDESTest.encrypt(TripleDESTest.java:18)
I want to group them so that final result is like
Unexpected error:java.lang.RuntimeException:Data not found - 6
Exception in thread "main" javax.crypto.BadPaddingException - 2
Exception in thread "main" java.lang.NullPointerException at - 1
Is there an existing API or algorithm available to handle such cases ?
Thanks in Advance.
Cheers
Shakti

The question is tagged as machine learning, so I am going to suggest classification approach.
You can tokenize each string, and use all tokens from training set as possible boolean features - an instance has the feature, if it contains this token.
Now, using this data, you can build (for instance) a C4.5 - a decision tree from the data. Make sure the tree use trimming once it is build, and minimum number of examples per leaf >1.
Once the tree is built, the "clustering" is done by the tree itself! Each leaf contains the examples which are considered similar to each other.
You can now extract this data by traversing the classification tree and extracting the samples stored in each leaf into its relevant cluster.
Notes:
This algorithm will fail for the sample data you provided because it cannot handle well if one msg is unique (the NPE in your example) - it will probably be in the same leaf as BadPaddingException.
No need to reinvent the wheel - you can use weka - an open source Machine Learning library in java, or other existing libraries for the algorithms
Instead of using the tokens as binary features, they can also be numerical features, you can use where is the token in the string, is it the 1st or 10th token?

I think that you should propably create a method that parses the text and and filters out the pattern you wish to remove... However I am not entirely sure of what you want to do....
I think that you what you want to do can probably be achieved through the StringTokenizer class...

If you know the format of the messages the easiest way is to use regular expression and count the matches.
Regular expressions are fully supported in Java and their usage is surely faster than a clustering algorithm.

Related

Not working xpath in selenium

The first xpath is working whereas the second not:
First:
"//*[#id='j_idt46:j_username']";
Second:
"//*[contains(#id,'username']";
Why?

To what could be figured out of the information provided, the way you are using contains is possibly inappropriate :
As mentioned by #TuringTux - //*[contains(#id,'username')] could be the possible change if the same lined goes as it is in your code.
Also a good practice to follow in //*[contains(#id,'username')] , would be to replace * by an element type in html.
And lastly there could be chances when you are trying to access elements using //*[contains(#id,'username')], you may be ending up getting a list of these similar WebElements while you might be trying to access only a single at the same time.

Machine Learning Classification of Lists of Strings in JAVA without any context surrounding them

I have several lists of Strings already classified like
<string> <tag>
088 9102355 PHONE NUMBER
091 910255 PHONE NUMBER
...
Alfred St STREET
German St STREET
...
RE98754TO IDENTIFIER
AUX9654TO IDENTIFIER
...
service open all day long DESCRIPTION
service open from 8 to 22 DESCRIPTION
...
jhon.smith#email.com EMAIL
jhon.smith#anothermail.com EMAIL
...
www.serviceSite.com URL
...
Florence CITY
...
with a lot of strings per tag and i have to make a java program which, given
a new List of String(supposed all of the same tag), assigns a probability for each tag to the list.
The program has to be completely language independent and all the knowledge has to came from the lists of tagged strings as the one described above.
I think that this problem can be solved with NER approaches (i.e machine learning algorithms like CRF) but those are usually for unstructured text like a chapter from a book, or a paragraph of a web page, and not for list of independent strings.
I Thought to use CRF (i.e Conditional Random Field) because I found a similar approach used in the Karma Data integration Tool as described in this Article, paragraph 3.1
where the "semantic types" are the my tags.
To tackle the program I have downloaded the Stanford Named Entity Recognizer (NER) and played a bit
with it's JAVA API through NERDemo.java finding two problems:
the training file for the CRFClassifier has to have one word per row, therefore I haven't found a way to classify groups of words with a single tag
I don't understand if I have to make one Classifier per tag or a single Classifier for all, because a single string could be classified with n different tags and it is the user that chooses between them. So I'm rather interested in the probability assigned by the classifiers instead of the exact class matching. Furthermore
i haven't any "no Tag" Strings so I don't know how the Classifier behaves without them to assign the probabilities.
Is this the right approach to the problem? Is There a way To use The Stanford NER
or another JAVA API with CRF or other suitable Machine Learning Algoritm to do it?
Update
I managed to train the CRF classifier first with each word classified independently with the tag and each group of words separated by two commas( classified as "no Tag"(0) ), then with the group of words as a single word with underscores replacing spaces but I have very disappointing results in the little test I made. I haven't quite get which features I have to include and which exclude from the ones described in the NERFeatureFactory javadoc considering they can't have anything to do with language.
Update 2
The test results are beginning to make sense, I've divided each string(tagging every Token) from the others with two new Lines, instead of the horrible "two commas labeled with 0", and I've used the Stanford PTBTokenizer instead of the one that I made. Moreover I've tuned the features, turning on the usePrev and useNext features and using suffix/prefix Ngrams up to 6 characters of length and other things.
The training file named training.tsv has this format:
rt05201201010to identifier
1442955884000 identifier
rt100005154602cv identifier
Alfred street
Street street
Robert street
Street street
and theese are the flags in the the propeties file:
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useTags=false
useWordPairs=false
useDisjunctive=true
useSequences=false
usePrevSequences=true
useNextSequences=true
# the next flag can have these values: IO, IOB1, IOB2, IOE1, IOE2, SBIEO
entitySubclassification=IO
printClassifier=HighWeight
cacheNGrams=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
However I found another problem, I managed to train only 39 labels with 100 strings each, though I have like 150 labels with more than 1000 string each, but even so it takes like 5 minutes to train and if I rise these numbers a bit it throws a Java Heap Out of Memory Error.
Is there a way to scale up to those numbers with a single classifier? Is it better to train 150 (or less, maybe one with two or three labels) little classifiers and combine them later? Do I need to train with 1000+ strings each label or can I stop to 100(maybe choosing them quite different from one another)?

The first thing you should be aware of is that (linear chain) CRF taggers are not designed for this purpose. They came as a very nice solution for context-based prediction, i.e. when you have words before and after named entities, and you look for clues in a limited window (e.g. 2 words before / after current word). This is why you had to insert double lines: to delimit sentences. They also provide coherence between tags affected to words, which is indeed a good thing in your case.
A CRF tagger should work, but with an extra cost in learning step which you could be avoided by using simpler (maximum entropy, SVM) but still accurate machine learning methods. In Java, for your task, wouldn't Weka be a better solution? I would also consider BIO tagging as not relevant in your case.
Whatever software / coding you use, it is not surprising that ngrams at character level gives good improvements, but I believe you may add dedicated features. For instance, since morphological clues are important (presence of an "#", upper case or digits characters), you may use codes (see ref [1]) that are a very convenient method to describe strings. You'll also most probably obtain better results by using lists of names (lexicon) that may be triggered as additional features.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)

Regex to validate url with port, clarification needed

I am trying to match a URL using the following regex, in Java
^http(s*):\\/\\/.+:[1-65535]/v2/.+/component/.+$
Test fails using URL: https://box:1234/v2/something/component/a/b
I suspect it's the number range that's causing it. Help me understand what am i missing here please?

See http://www.regular-expressions.info/numericranges.html. You can't just write [1-65535] to match 1 or 65535. That says any number 1-6, or 5 or 3.
The expression you need is quite verbose, in this case:
([1-9][0-9]{0,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])
(Credit to http://utilitymill.com/utility/Regex_For_Range)
Another issue is your http(s*). That needs to be https? because in its current form it might allow httpsssssssss://. If your regex takes public input, this is a concern.

^http(s*) is wrong, it would allow httpssssss://...
You need ^https?
This doesn't affect the given test though.

The group [1-65535] basically means number from 1 to 6 or 5 or 5 or 3 or 5.
that would even evaluate, but you need an + (or *) at the end of the group.
To match the port more precisely you could use [1-6][0-9]{0,4}?. That would get you really close, but also allow p.e. 69999 - the {m,n}? is used to specify how often a group can be used (m to n times)
Also take care of that (s*) thing the others pointed out!
That would result in:
^https?:\\/\\/.+:[1-6][0-9]{0,4}?/v2/.+/component/.+$

Mongodb java api: WriteResult#getN()

I'm writing some Java code using MongoDB with Java API and I'm unsure of some part of the Javadoc.
In a multi-thread context I use DBCollection.html#update(com.mongodb.DBObject, com.mongodb.DBObject) to update a unique document, but I saw that two threads could try to write concurrently. In this context, I observed that only one write was done, as Mongodb seems to use optimistic write lock, but I wanted to find out programmatically in which thread the write was the one who wrote, and which one was not. As a "no update" behavior was silent (I mean no exception or something), I searched into the API some way to answer my issue and after some tests found out this method: WriteResult#getN()
public int getN()
Gets the "n" field
Returns:
The description is, hum... not really exhaustive. My tests showed that the thread that win the write has a getN() that return 1, and the other 0.
So my question is: Could someone confirm this ?

From the GetLastError() documentation
The return value from the command is an object with various fields. The common fields are listed below; there may also be other fields.
ok - true indicates the getLastError command completed successfully. This does NOT indicate there wasn't a last error.
err - if non-null, indicates an error occurred. Value is a textual description of the error.
code - if set, indicates the error code which occurred. connectionId - the id of the connection
lastOp - the op-id from the last operation
For updates:
n - if an update was done, this is the number of documents updated.
So in this context, 'get "n" field' means get n which is the number of documents updated. Without "multi" being set to true it can only be either 0 or 1.

Find out next character for a given sequence of characters (logic test style) (Java)

I recently took a logic quiz/test with questions like: What is the next character for the sequence: a,c,b,d,c? Although not complicated I only managed to complete like half of them in the given time limit.
So I would like for my next try to use: either a script built by me or a tool from the Internet.
Do you have any ideas how to approach this using java? Are there any classes that I could use or have to build from scratch? I found a tutorial on Java Regex Pattern & Matcher but I'm pretty sure it's not what I am looking for.
Note: It's always a-z chars & usually sets of 6 (+/-1)

What is the legal alphabet for the sequence? Is it always a-z? If so, then predicting the sequence isn't that difficult. You could map the letters to 1-26 for a reasonable 'guesstimator'.
In this example:
1, 3, 2, 4, 3...
+2, -1, +2, -1...
You really need to qualify the question to determine how much modeling is required to solve the problem.

The Simple Problem
In your case, it appears you are picking the nth and n+2th letters, in turn (modulo the alphabet length) to continually generate the next letters in a sequence... The sequence might be staggered a little by some constant as well... But in either case, the exact solution should be precisely decoded by a human and implemented in any language.
However, other comments on your question identify that this problem hints at a full blown, much more interesting problem which is not easily solved by a human - but rather which requires hueristics. This prediction problem is relevant to bioinformaticians and artificial intelligence engineers, wherein we want to predict the next letter or word (I.e. From a text stream or Amino acid sequence ) in a string given the preceding word/letter sequence...
The FULL blown Problem
This is a classic problem in artificial intelligence which requires machine learning .
The particular type of problem would take, as input :
the preceding sequence.
And output :
A single, next character in the sequence.
There is an AminoAcid predictor algorithm on github , which we've designed to deal with thus problem using machine learning, that runs in Clojure (see the jayunit100/Rudolf project) , if you are interested in a full blown approach to solving thus problem over a 22 amino acid alphabet.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.