How could I new parameter to TFIDF in Solr? - java

I'm new in java, and my research about improvement TF IDF in Solr. My question is,
How to add new parameter (except Freq) in TF method in Solr ?
Should I do Overloading???
Thanks
#Override public float tf(float freq) {
return (float)Math.sqrt(freq);}

Yes.
Create your own custom similarity, which will allow you to use any parameter / calculation for each of the different parts of the current formula for the DefaultSimilarity.
If you want to go even deeper, look at Build your own custom lucene query and scorer.

Related

How can I use a custom data model with Deeplearning4j?

The base problem is trying to use a custom data model to create a DataSetIterator to be used in a deeplearning4j network.
The data model I am trying to work with is a java class that holds a bunch of doubles, created from quotes on a specific stock, such as timestamp, open, close, high, low, volume, technical indicator 1, technical indicator 2, etc.
I query an internet source, example, (also several other indicators from the same site) which provide json strings that I convert into my data model for easier access and to store in an sqlite database.
Now I have a List of these data models that I would like to use to train an LSTM network, each double being a feature. Per the Deeplearning4j documentation and several examples, the way to use training data is to use the ETL processes described here to create a DataSetIterator which is then used by the network.
I don't see a clean way to convert my data model using any of the provided RecordReaders without first converting them to some other format, such as a CSV or other file. I would like to avoid this because it would use up a lot of resources. It seems like there would be a better way to do this simple case. Is there a better approach that I am just missing?
Ethan!
First of all, Deeplearning4j uses ND4j as backend, so your data will have to eventually be converted into INDArray objects in order to be used in your model. If your trianing data is two array of doubles, inputsArray and desiredOutputsArray, you can do the following:
INDArray inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
INDArray desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});
And then you can train your model using those vectors directly:
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(inputs, desiredOutputs);
Alternatively you can create a DataSet object and used it for training:
DataSet ds = new DataSet(inputs, desiredOutputs);
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(ds);
But creating a custom iterator is the safest approach, specially in larger sets since it gives you more control over your data and keep things organized.
In your DataSetIterator implementation you must pass your data and in the implementation of the next() method you should return a DataSet object comprising the next batch of your training data. It would look like this:
public class MyCustomIterator implements DataSetIterator {
private INDArray inputs, desiredOutputs;
private int itPosition = 0; // the iterator position in the set.
public MyCustomIterator(float[] inputsArray,
float[] desiredOutputsArray,
int numSamples,
int inputDim,
int outputDim) {
inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});
}
public DataSet next(int num) {
// get a view containing the next num samples and desired outs.
INDArray dsInput = inputs.get(
NDArrayIndex.interval(itPosition, itPosition + num),
NDArrayIndex.all());
INDArray dsDesired = desiredOutputs.get(
NDArrayIndex.interval(itPosition, itPosition + num),
NDArrayIndex.all());
itPosition += num;
return new DataSet(dsInput, dsDesired);
}
// implement the remaining virtual methods...
}
The NDArrayIndex methods you see above are used to access parts of a INDArray. Then now you can use it for training:
MyCustomIterator it = new MyCustomIterator(
inputs,
desiredOutputs,
numSamples,
inputDim,
outputDim);
for (int epoch = 0; epoch < nEpochs; epoch++)
model.fit(it);
This example will be particularly useful to you, since it implements a LSTM network and it has a custom iterator implementation (which can be a guide for implementing the remaining methods). Also, for more information on NDArray, this is helpful. It gives detailed information on creating, modifying and accessing parts of an NDArray.
deeplearning4j creator here.
You should not in any but all very special setting create a data set iterator. You should be using datavec. We cover this in numerous places ranging from our data vec page to our examples:
https://deeplearning4j.konduit.ai/datavec/overview
https://github.com/eclipse/deeplearning4j-examples
Datavec is our dedicated library for doing data transformations. You create custom record readers for your use case. Deeplearning4j for legacy reasons has a few "special" iterators for certain datasets. Many of those came before datavec existed. We built datavec as a way of pre processing data.
Now you use the RecordReaderDataSetIterator, SequenceRecordReaderDataSetIterator (see our javadoc for more information) and their multi dataset equivalents.
If you do this, you don't have to worry about masking, thread safety, or anything else that involves fast loading of data.
As an aside, I would love to know where you are getting the idea to create your own iterator, we now have it right in our readme not to do that. If there's another place you were looking that is not obvious, we would love to fix that.
Edit:
I've updated the links to the new pages. This post is very old now.
Please see the new links here:
https://deeplearning4j.konduit.ai/datavec/overview
https://github.com/eclipse/deeplearning4j-examples

Edit AST by using visitors in Antlr

I am new to AntLR and I am struggling to do the following:
What I want to do is after I have parsed a source file (for which I have a valid grammar of course) and I have the AST in memory, to go and change some stuff and then print it back out though the visitor API.
e.g.
int foo() {
y = x ? 1 : 2;
}
and turn it into:
int foo() {
if (x) {
y = 1;
else {
y = 2;
}
}
Up to now I have the appropriate grammar to parse such syntax and I have also made some visitor methods that are getting called when I am on the correct position. What baffles me is that during visiting I can't change the text.
Ideally I would like to have something like this:
public Void visitTernExpr(SimpleCParser.TernExprContext ctx) {
ctx.setText("something");
return null;
}
and in my Main I would like to have this AST edited by different visitors that each one of them is specialised in something. Like this:
ANTLRInputStream input = new ANTLRInputStream(new FileInputStream(filename));
SimpleCLexer lexer = new SimpleCLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SimpleCParser parser = new SimpleCParser(tokens);
ProgramContext ctx = parser.program();
MyChecker1 mc1 = new MyChecker1();
mc1.visit(ctx);
MyChecker2 mc2 = new MyChecker2();
mc1.visit(ctx);
ctx.printToFile("myfile");
Is there any way of doing those stuff in AntLR or am I on a very wrong direction?
You can do this ANTLR by smashing the AST nodes and links. You'll have create all the replacement subtree nodes and splice them in place. Then you'll have to implement the "spit source text" tree walk; I suggest you investigate "string templates" for this purpose.
But ultimately you have to do a lot of work to achieve this effect. This is because the goal of the ANTLR tool is largely focused around "parsing", which pushes the rest on you.
If what you are want to do is to replace one set of syntax by another, what you really want is a program transformation system. These are tools that are designed to have all of the above built-in already so you don't have to reinvent it all. They also usually have source-to-source transformations, which make accomplishing tasks like the one you have shown much, much easier to implement.
To accomplish your example with our DMS program transformation engine, you'd write a transformation rule and then apply it:
rule replace_ternary_assignment_by_ifthenelse
(l: left_hand_side, c: expression, e1: expression, e2: expression):
statement -> statement
"\l = \c ? \e1 : \e2;"
=> " if (\c) \l = \e1; else \l = \e2 ";
DMS parses your code, builds ASTs, find matches for the rewrites, constructs/splices all those replacement nodes for you. Finally,
DMS has built-in prettyprinters to regenerate the text. The point
of all this is to let you get on with your task of modifying your
code, rather than creating a whole new engineering job before you can
do your task. Read my essay, "Life After Parsing", easily
found by my bio or with a google search for more on this topic.
[If you go to the DMS wikipedia page, you will amusingly find
the inverse of this transform used as an example].
I would use a listener, and yes you can modify the AST while you are walking through it.
You can create a new instance of the if/else context and then replace the ternary operator context with it. This is posible because you have a reference to the rule parent and an extensive API to handle every rule children.

Use of DerivativeStructure in Apache Commons Math

I am having a hard time understanding how to use DerivativeStructure in Apache Commons Math.
I have a Logit function for which I would like to get the first order derivative. Then I would like to get the value of that derivative on multiple distinct values.
Logit logit = new Logit(0.1, 10.0);
DerivativeStructure ds = // How to instanctiate?
DerivativeStructure dsRes = logit.value(ds);
// How to use dsRes to get the value of the derivative function applied on
// several values?
In addition, if there is any document describing how to use that DerivativeStructure, I am highly interested!
Thanks for your help.
In the Apache Commons Math User Guide, the section on Numerical analysis Differentiation, there is a reasonable introduction on how to apply the DerivativeStructure.

Weka's PCA is taking too long to run

I am trying to use Weka for feature selection using PCA algorithm.
My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensionality of the data using the following code:
AttributeSelection selector = new AttributeSelection();
PrincipalComponents pca = new PrincipalComponents();
Ranker ranker = new Ranker();
selector.setEvaluator(pca);
selector.setSearch(ranker);
Instances instances = SamplesManager.asWekaInstances(trainSet);
try {
selector.SelectAttributes(instances);
return SamplesManager.asSamplesList(selector.reduceDimensionality(instances));
} catch (Exception e ) {
...
}
However, It did not finish to run within 12 hours. It is stuck in the method selector.SelectAttributes(instances);.
My questions are:
Is so long computation time expected for weka's PCA? Or am I using PCA wrongly?
If the long run time is expected:
How can I tune the PCA algorithm to run much faster? Can you suggest an alternative? (+ example code how to use it)?
If it is not:
What am I doing wrong? How should I invoke PCA using weka and get my reduced dimensionality?
Update: The comments confirms my suspicion that it is taking much more time than expected.
I'd like to know: How can I get PCA in java - using weka or an alternative library.
Added a bounty for this one.
After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.
The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.
The code is more complex, but it essentially comes down to this:
Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);
However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.
It looks like you're using the default configuration for the PCA, which judging by the long runtime, it is likely that it is doing way too much work for your purposes.
Take a look at the options for PrincipalComponents.
I'm not sure if -D means they will normalize it for you or if you have to do it yourself. You want your data to be normalized (centered about the mean) though, so I would do this yourself manually first.
-R sets the amount of variance you want accounted for. Default is 0.95. The correlation in your data might not be good so try setting it lower to something like 0.8.
-A sets the maximum number of attributes to include. I presume the default is all of them. Again, you should try setting it to something lower.
I suggest first starting out with very lax settings (e.g. -R=0.1 and -A=2) then working your way up to acceptable results.
Best
for the construction of your covariance matrix, you can use the following formula which is also used by matlab. It is faster then the apache library.
Whereby Matrix is an m x n matrix. (m --> #databaseFaces)

Looking for an expression evaluator

I'm looking for an evaluator for simple condition expressions.
Expressions should include variables (read only), strings, numbers and some basic operators.
E.g. expressions something like this:
${a} == "Peter" && ( ${b} == null || ${c} > 10 )
So far i implemented a rather "magical" parser that returns an AST that i can evaluate, but i can't believe that i'm the first one to solve that problem.
What existing code could i use instead?
Have you looked at MVEL? They provide a getting started guide and performance analysis.
Here's one of their simple examples:
// The compiled expression is serializable and can be cached for re-use.
CompiledExpression compiled = MVEL.compileExpression("x * y");
Map vars = new HashMap();
vars.put("x", new Integer(5));
vars.put("y", new Integer(10));
// Executes the compiled expression
Integer result = (Integer) MVEL.executeExpression(compiled, vars);
assert result.intValue() == 50;
Also (answering my own question) MVEL seems to provide some support for bytecode generation.
Other alternatives, culling from the above answers and my own:
Java Expression Parser (JEP) -- and note there is an old version available for free
Apache Commons JEXL
With regard to Rhino, here's a dude who did some arithmetic evaluation in that context (looks messy)
Sounds like JEXL might work well for you. Check out its syntax reference.
What about SPEL (Spring Expression Lang); http://static.springsource.org/spring/docs/3.0.x/reference/expressions.html
Why don't you use Rhino? It's a JavaScript engine already present inside the JDK.
It can evaluate whatever you like to write in JS.. take a look here
This simple recursive descent parser evaluates constants as named functions having no parameters.
A very simple and easy to use alternative with a lot of built in excel functions for string, date and number formatting.
The library also allows easy addition of custom functions. A lot of examples available on the git page. A simple example using variables
ExpressionsEvaluator evalExpr = ExpressionsFactory.create("LEFT(City, 3)");
Map<String, Object> variables = new HashMap<String, Object>();
variables.put("City", "New York");
assertEquals("New", evalExpr.eval(variables));
Here is a little library I've worked on that supports expression evaluation (including variables, strings, boolean, etc...).
A little example :
String expression = "EXP(var)";
ExpressionEvaluator evaluator = new ExpressionEvaluator();
evaluator.putVariable(new Variable("var", VariableType.NUMBER, new BigDecimal(20)));
System.out.println("Value of exp(var) : " + evaluator.evaluate(expression).getValue());

Categories