Apache open nlp maxent: is it possible to set 'probability distribution' label?
I have read football.dat, gameLocation.dat, and realTeam.data and tried CreateModel.java and Predict.java in the 'sports' package. The prediction results are classes probability distribution like lose[0.3686] win[0.4416] tie[0.1899], and labels of training examples at the end of lines are all single classes, like win.
Is it possible to set probability distribution labels like lose[0.3686] win[0.4416] tie[0.1899] in the training data? If not, beyond just setting the max probability tag as the label, what are proper ways to handle 'probability distribution' labels? For example, is duplicating examples with class labels proportional to probabilities a principled approach or not, or other systematic methods.
Related
I'm defining a lot of counters in my app (using java micrometer) and in order to trigger alerts I tag the counters which I want to monitor with "error":"alert" so a query like {error="alert"} will generate multiple range vectors:
error_counter_component1{error="alert", label2="random"}
error_counter_component2{error="alert", label2="random2"}
error_counter_component3{error="none", label2="random3"}
I don't control the name of the counters I can only add the label to the counters I want to use in my alert. The alert that I want to have is if all the counters labeled with error="alert" increase more then 3 in one hour so I could use this kind of query: increase({error="alert"}[1h]) > 3 but I get the fallowing error in Prometheus: Error executing query: vector cannot contain metrics with the same labelset
Is there a way to merge two range vectors or should I include some kind of tag in the name of the counter? Or should I have a single counter for errors and the tags should specify the source something like this:
errors_counter{source="component1", use_in_alert="yes"}
errors_counter{source="component2", use_in_alerts="yes"}
errors_counter{source="component3", use_in_alerts="no"}
The version with source="componentX" label is much more fitting to prometheus data model. This is assuming the error_counter metric is really one metric and other than source label value it will have same labels etc. (for example it is emitted by the same library or framework).
Adding stuff like use_in_alerts label is not a great solution. Such label does not identify time series.
I'd say put a list of components to alert on somewhere where your alerting queries are constructed and dynamically create separate alerting rules (without adding such label to raw data).
Other solution is to have a separate pseudo metric that will obnly be used to provide metadata about components, like:
component_alert_on{source="component2"} 1
and. combine it in alerting rule to only alert on components you need. It can be generated in any possible way, but one possibility is to have it added in static recording rule. This has the con of complicating alerting query somehow.
But of course use_in_alerts label will also probably work (at least while you are only alerting on this metric).
I am attempting to render around 500 points of data at a time for a hierarchical clustering operation as a demonstration using Java and OpenGL. In order to show it's steps I would like to color individual points so that they are easily distinguishable so that when the clusters merge it's obvious which is merging.
I have used this list. But after separating out hard to distinguish colors and color that are too light for my white background I'm left with less than 50.
Is there a method to create unique, easily distinguished colors? I would need around 500 generated. I'd prefer a method if possible so that I do not have to handcode (or awk/sed) a list of them.
After some experimentation this problem seemed fairly difficult and downright impossible. Before I move on to numbering each cluster and hoping the numbers render correctly I wanted to ask if this was possible, and additionally what the best method to achieve this would be.
I am using 10 folds cross validations technique to train 200K records. The target class index is like
Status {PASS,FAIL}
Pass has ~144K and Fail has ~6K instances.
while training the model using J48. Its not able to find the failures. The accuracy is 95% but most the cases its predicting just success. where as in our case, we need to find the failure which are actually happening.
So my question is mainly hypothetical analysis.
Does it really matter the distribution among class instances during training(in my case PASS,FAIL).
What could be possible values in weka J48 tree to train better as i see 2% failure in every 1000 records i pass. So, there will be increase in success if we increase the Success scenarios.
What should be the ratio among them in order to better train them.
There is nothing i could find in the API as far as ratio is concerned.
I am not adding the code because this is happening both with Java API as well as using weka GUI tool.
Many Thanks.
The problem here is that your dataset is very unbalanced. You do have a few options on how to help your classification task:
Generate synthetic instances for your minority class using an algorithm like SMOTE. This should increase your performance.
It's not possible in every case, but you could maybe try splitting your majority class into a couple of smaller classes. This would help the balance.
I believe Weka has a One Class Classifier. This allows to see decision boundary of the larger class and considers the minority class as an outlier allowing for hopefully better classifications. See here for Weka's implementation.
Edit:
You could also use a classifier that will weight classifications based on whether they are correct or not. Again, Weka has this as a meta classifier that can be applied to most base classifiers, see here again.
I am working on a project in Java that involves fitting a simple linear regression line through a rolling / sliding window of n data points. For each new point added the linear regression slope and intercept need to be re-calculated. We currently use org.apache.commons.math3.stat.regression.SimpleRegression to do this calculation, but it is expensive to re-calculate the entire window each time.
So I have two questions.
The SimpleRegression offered by apache commons has a removeData method for use in a "streaming mode". (See quotation below taken from API). However there is no other information on this "streaming mode"; proper implementation,accumulative error, etc... Does anyone have an example of how to use it properly in streaming mode? or can anyone point me to a better source of information?
"This method permits the use of SimpleRegression instances in streaming mode where the regression is applied to a sliding "window" of observations, however the caller is responsible for maintaining the set of observations in the window.
"
Are there any other libraries available that can do streaming linear regression in constant time? Surely Apache Commons SimpleRegression is not the only one... but that seems to be the only one I can find...
Thanks,
g
By looking at the source code I suppose this only means there is the method removeData which reverses the calculations done by addData. So the caller has to remember the data in the window and remove it one by one when the data "exits" the window.
I am wondering if there exist libraries that could help me draw such figures on screen quickly using JAVA.
The dataset and number of nodes etc need to be parametrized.
If no such libraries exist, which tools in Swing would get me started. I want a quick and dirty way to represent this information.
Edit :
Also it would help if you could tell me what to search on google to get results for such a tailored query.
You can call GraphViz from within Java, converting any Java-based tree structure into the necessary GraphViz formats, and then reading the resulting .png image back into Java. That is probably the easiest approach, in terms of code-to-write (credit goes to SyntaxT3rr0r for proposing it first).
Customizing JGraph would also work, but I doubt that any of the default node-types would cut it. There are examples in the manual covering how to code your own node types and representations. JGraph allows easy graphical editing of node labels and positions, has hierarchical layouts (the type you use for trees); and it supports "ports" of origin (and also destination) for your parent-child edges. You can try their editor demo (included in their default download) if you just want a quick test.