How to get facet ranges in solr results? - java

Assume that I have a field called price for the documents in Solr and I have that field faceted. I want to get the facets as ranges of values (eg: 0-100, 100-500, 500-1000, etc). How to do it?
I can specify the ranges beforehand, but I also want to know whether it is possible to calculate the ranges (say for 5 values) automatically based on the values in the documents?

To answer your first question, you can get facet ranges by using the the generic facet query support. Here's an example:
http://localhost:8983/solr/select?q=video&rows=0&facet=true&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+*]
As for your second question (automatically suggesting facet ranges), that's not yet implemented. Some argue that this kind of querying would be best implemented on your application rather that letting Solr "guess" the best facet ranges.
Here are some discussions on the topic:
(Archived) https://web.archive.org/web/20100416235126/http://old.nabble.com/Re:-faceted-browsing-p3753053.html
(Archived) https://web.archive.org/web/20090430160232/http://www.nabble.com/Re:-Sorting-p6803791.html
(Archived) https://web.archive.org/web/20090504020754/http://www.nabble.com/Dynamically-calculated-range-facet-td11314725.html

I have worked out how to calculate sensible dynamic facets for product price ranges. The solution involves some pre-processing of documents and some post-processing of the query results, but it requires only one query to Solr, and should even work on old version of Solr like 1.4.
Round up prices before submission
First, before submitting the document, round up the the price to the nearest "nice round facet boundary" and store it in a "rounded_price" field. Users like their facets to look like "250-500" not "247-483", and rounding also means you get back hundreds of price facets not millions of them. With some effort the following code can be generalised to round nicely at any price scale:
public static decimal RoundPrice(decimal price)
{
if (price < 25)
return Math.Ceiling(price);
else if (price < 100)
return Math.Ceiling(price / 5) * 5;
else if (price < 250)
return Math.Ceiling(price / 10) * 10;
else if (price < 1000)
return Math.Ceiling(price / 25) * 25;
else if (price < 2500)
return Math.Ceiling(price / 100) * 100;
else if (price < 10000)
return Math.Ceiling(price / 250) * 250;
else if (price < 25000)
return Math.Ceiling(price / 1000) * 1000;
else if (price < 100000)
return Math.Ceiling(price / 2500) * 2500;
else
return Math.Ceiling(price / 5000) * 5000;
}
Permissible prices go 1,2,3,...,24,25,30,35,...,95,100,110,...,240,250,275,300,325,...,975,1000 and so forth.
Get all facets on rounded prices
Second, when submitting the query, request all facets on rounded prices sorted by price: facet.field=rounded_price. Thanks to the rounding, you'll get at most a few hundred facets back.
Combine adjacent facets into larger facets
Third, after you have the results, the user wants see only 3 to 7 facets, not hundreds of facets. So, combine adjacent facets into a few large facets (called "segments") trying to get a roughly equal number of documents in each segment. The following rather more complicated code does this, returning tuples of (start, end, count) suitable for performing range queries. The counts returned will be correct provided prices were been rounded up to the nearest boundary:
public static List<Tuple<string, string, int>> CombinePriceFacets(int nSegments, ICollection<KeyValuePair<string, int>> prices)
{
var ranges = new List<Tuple<string, string, int>>();
int productCount = prices.Sum(p => p.Value);
int productsRemaining = productCount;
if (nSegments < 2)
return ranges;
int segmentSize = productCount / nSegments;
string start = "*";
string end = "0";
int count = 0;
int totalCount = 0;
int segmentIdx = 1;
foreach (KeyValuePair<string, int> price in prices)
{
end = price.Key;
count += price.Value;
totalCount += price.Value;
productsRemaining -= price.Value;
if (totalCount >= segmentSize * segmentIdx)
{
ranges.Add(new Tuple<string, string, int>(start, end, count));
start = end;
count = 0;
segmentIdx += 1;
}
if (segmentIdx == nSegments)
{
ranges.Add(new Tuple<string, string, int>(start, "*", count + productsRemaining));
break;
}
}
return ranges;
}
Filter results by selected facet
Fourth, suppose ("250","500",38) was one of the resulting segments. If the user selects "$250 to $500" as a filter, simply do a filter query fq=price:[250 TO 500]

There may well be a better Solr-specific answer, but I work with straight Lucene, and since you're not getting much traction I'll take a stab. There, I'd create a populate a Filter with a FilteredQuery wrapping the original Query. Then I'd get a FieldCache for the field of interest. Enumerate the hits in the filter's bitset, and for each hit, you get the value of the field from the field cache, and add it to a SortedSet. When you've got all of the hits, divide the size of the set into the number of ranges you want (five to seven is a good number according the user interface guys), and rather than a single-valued constraint, your facets will be a range query with the lower and upper bounds of each of those subsets.
I'd recommend using some special-case logic for a small number of values; obviously, if you only have four distinct values, it doesn't make sense to try and make 5 range refinements out of them. Below a certain threshold (say 3*your ideal number of ranges), you just show the facets normally rather than ranges.

You can use solr facet ranges
http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range

Related

Optimization: Finding the best Simple Moving Average takes too much time

I've created a simple Spring-Application with a MySQL-DB.
In the DB there are 20 years of stock data (5694 lines):
The Goal is to find the Best Moving Average (N) for that 20 years of stock data. Inputs are the closing prices of every trading day.
The calculated average depends on the N. So p.e. if N=3 the average of a reference day t, is given by ((t-1)+(t-2)+(t-3)/N).
Output is the Best Moving Average (N) and the Result you made with all the buying & selling transactions of the Best N.
I did not find a proper algorithm in the Internet, so I implemented the following:
For every N (249-times) the program does the following steps:
SQL-Query: calculates averages & return list
#Repository
public interface StockRepository extends CrudRepository<Stock, Integer> {
/*
* This sql query calculate the moving average of the value n
*/
#Query(value = "SELECT a.date, a.close, Round( ( SELECT SUM(b.close) / COUNT(b.close) FROM stock AS b WHERE DATEDIFF(a.date, b.date) BETWEEN 0 AND ?1 ), 2 ) AS 'avg' FROM stock AS a ORDER BY a.date", nativeQuery = true)
List<AverageDTO> calculateAverage(int n);
Simulate buyings & sellings – > calculate result
Compare result with bestResult
Next N
#RestController
public class ApiController {
#Autowired
private StockRepository stockRepository;
#CrossOrigin(origins = "*")
#GetMapping("/getBestValue")
/*
* This function tries all possible values in the interval [min,max], calculate
* the moving avg and simulate the gains for each value to choose the best one
*/
public ResultDTO getBestValue(#PathParam("min") int min, #PathParam("max") int max) {
Double best = 0.0;
int value = 0;
for (int i = min; i <= max; i++) {
Double result = simulate(stockRepository.calculateAverage(i));
if (result > best) {
value = i;
best = result;
}
}
return new ResultDTO(value, best);
}
/*
* This function get as input the close and moving average of a stock and
* simulate the buying/selling process
*/
public Double simulate(List<AverageDTO> list) {
Double result = 0.0;
Double lastPrice = list.get(0).getClose();
for (int i = 1; i < list.size(); i++) {
if (list.get(i - 1).getClose() < list.get(i - 1).getAvg()
&& list.get(i).getClose() > list.get(i).getAvg()) {
// buy
lastPrice = list.get(i).getClose();
} else if (list.get(i - 1).getClose() > list.get(i - 1).getAvg()
&& list.get(i).getClose() < list.get(i).getAvg()) {
// sell
result += (list.get(i).getClose() - lastPrice);
lastPrice = list.get(i).getClose();
}
}
return result;
}
}
When I put Min=2 and Max=250 it takes 45 minutes to finish.
Since, I'm a beginner in Java & Spring I do not know how I can optimize it.
I'm happy for every input.
This problem is equivalent with finding the best moving N sum. Simply then divide by N. Having such a slice, then the next slice subtracts the first value and adds a new value to the end. This could lead to an algorithm for finding local growths with a[i + N] - a[i] >= 0.
However in this case a simple sequential ordered query with
double[] slice = new double[N];
double sum = 0.0;
suffices. (A skipping algorithm on a database is probably too complicated.)
Simply walk through the table keeping the slice as window, keeping N values and keys, and maintaining the maximum upto now.
Use the primitive type double instead of the object wrapper Double.
If the database transport is a serious factor, a stored procedure would do. Keeping a massive table as really many entities just for a running maximum is unfortunate.
It would be better to have a condensed table or better field with the sum of N values.

CPLEX maximization of revenue

I'm working on a project on cplex and this is the case:
it's a chemical plant where's produced and sold 2 final products
there are 3 reactors and each reactor can perform different tasks, one at a time
the objective function maximizes the total profit
the solutions present the value calculated for this objF and also shows the sequence of activation of each reactor, and the the profit from each cycle
Problem: Now I've been given the choice to add one more reactor (and it can be any of the 3, each with different prices) or not buy one at all.
The objective remains the same: to maximize the revenue, and I can't seem to put this decision into code, so I can obtain the best case scenario result, because:
profit and cost (of renewable resources (reactants)) depend on r produced and t time
the InitialStock depends on the amount of reactors as well, so it will depend on the decision of how many reactors are running, and this depends on the max revenue of each case
this is my first project :S
// Data Declaration
int MaxTime = ...;
range Time = 0..MaxTime;
{int} Tasks = ...;
{string} nrenuableR=...;
{string} renuableR=...;
{string} renuableRusedbyT[Tasks]=...;
{string} Resources= nrenuableR union renuableR;
int procTime[Tasks]= ...;
int minbatchsize[renuableR][Tasks] =...;
int maxbatchsize [renuableR][Tasks] =...;
int MaxAmountStock_nR[nrenuableR]=...;
int maxRenuableR[renuableR][Time] =...;
int InitialStock[Resources]=...;
int Profit[nrenuableR]=...;
float nRcosts[nrenuableR]=...;
int MaxTheta = ...;
range Theta=0..MaxTheta;
float Mu[Tasks][Resources][Theta] = ...;
float Nu[Tasks][Resources][Theta] = ...;
//Decision Variables
dvar boolean N[Tasks][Time];
dvar float+ Csi[Tasks][Time];
dvar int+ R[Resources][Time];
//Objective Function
dexpr float ObjFunction = sum (r in nrenuableR)(R[r][MaxTime] - InitialStock[r])*(Profit[r] - nRcosts[r]);
maximize ObjFunction;
//Contraints
subject to {
//Resources Capacity
forall (r in renuableR) forall(t in Time) R[r][t] <= maxRenuableR[r][t];
forall (r in nrenuableR) forall (t in Time) R[r][t] <= MaxAmountStock_nR[r];
//Batch Size + linking constraints
forall (k in Tasks, r in renuableRusedbyT[k], t in Time) minbatchsize[r][k] * N[k][t] <= Csi[k][t];
forall (k in Tasks, r in renuableRusedbyT[k], t in Time) maxbatchsize[r][k]*N[k][t] >= Csi[k][t];
//Resource Balance
forall(r in Resources) R[r][0] == InitialStock[r] + sum(k in Tasks) (Mu[k][r][0] * N[k][0] + Nu[k][r][0] * Csi[k][0]);
forall(r in Resources,t in Time: t>0) R[r][t] == R[r][t-1] + sum(k in Tasks,theta in Theta: t - theta >=0) (Mu[k][r][theta] * N[k][t - theta] + Nu[k][r][theta] * Csi[k][t - theta]);
}
I am not clear about the meaning of your decision variables, so I cannot give a detailed answer.
A general approach to extend the model is this:
Create a new decision variable IsUsed that is 1 for each reactor if and only if the respective reactor is used.
Add a constraint that says that if IsUsed is 0 for a reactor then the number of tasks performed on this reactor is 0.
Add to the objective function a term IsUsed * Cost for each reactor that models that fixed cost for opening a reactor.
For initial stock you can multiply the initial stock by IsUsed for each reactor. Then the initial stock is 0 if the reactor is not used and the orignal initial stock if the reactor is used.

Java Optimizing arithmetic and Assignment Operators for large input

I have a piece of code that must run extremely fast in terms of clock speed. The algorithm is already in O(N). It takes 2seconds, it needs to take 1s. For most A.length inputs ~ 100,000 it takes .3s unless a particular line of code is invoked an extreme number of times. (For an esoteric programming challenge)
It uses a calculation of the arithmetic series that 1,2,..N -> 1,3,4,10,15..
that can be represented by n*(n+1)/2
I loop through this equation hundreds of thousands of times.
I do not have access to the input, nor can I display it. The only information I am able to get returned is the time it took to run.
particularly the equation is:
s+=(n+c)-((n*(n+1))/2);
s and c can have values range from 0 to 1Billion
n can range 0 to 100,000
What is the most efficient way to write this statement in terms of clock speed?
I have heard division takes more time then multiplication, but beyond that I could not determine whether writing this in one line or multiple assignment lines was more efficient.
Dividing and multiplying versus multiplying and then dividing?
Also would creating custom integers types significantly help?
Edit as per request, full code with small input case (sorry if it's ugly, I've just kept stripping it down):
public static void main(String[] args) {
int A[]={3,4,8,5,1,4,6,8,7,2,2,4};//output 44
int K=6;
//long start = System.currentTimeMillis();;
//for(int i=0;i<100000;i++){
System.out.println(mezmeriz4r(A,K));
//}
//long end = System.currentTimeMillis();;
// System.out.println((end - start) + " ms");
}
public static int mezmeriz4r(int[]A,int K){
int s=0;
int ml=s;
int mxl=s;
int sz=1;
int t=s;
int c=sz;
int lol=50000;
int end=A.length;
for(int i=sz;i<end;i++){
if(A[i]>A[mxl]){
mxl=i;
}else if(A[i]<A[ml]){
ml=i;
}
if(Math.abs(A[ml]-A[mxl])<=K){
sz++;
if(sz>=lol)return 1000000000;
if(sz>1){
c+=sz;
}
}else{
if(A[ml]!=A[i]){
t=i-ml;
s+=(t+c)-((t*(t+1))/(short)2);
i=ml;
ml++;
mxl=ml;
}else{
t=i-mxl;
s+=(t+c)-((t*(t+1))/(short)2);
i=mxl;
mxl++;
ml=mxl;
}
c=1;
sz=0;
}
}
if(s>1000000000)return 1000000000;
return s+c;
}
Returned from Challenge:
Detected time complexity:
O(N)
test time result
example
example test 0.290 s. OK
single
single element 0.290 s. OK
double
two elements 0.290 s. OK
small_functional
small functional tests 0.280 s. OK
small_random
small random sequences length = ~100 0.300 s. OK
small_random2
small random sequences length = ~100 0.300 s. OK
medium_random
chaotic medium sequences length = ~3,000 0.290 s. OK
large_range
large range test, length = ~100,000 2.200 s. TIMEOUT ERROR
running time: >2.20 sec., time limit: 1.02 sec.
large_random
random large sequences length = ~100,000 0.310 s. OK
large_answer
test with large answer 0.320 s. OK
large_extreme
all maximal value = ~100,000 0.340 s. OK
With a little algebra, you can simply the expression (n+c)-((n*(n+1))/2) to c-((n*(n-1))/2) to remove an addition operation. Then you can replace the division by 2 with a bit-shift to the right by 1, which is faster than division. Try replacing
s+=(n+c)-((n*(n+1))/2);
with
s+=c-((n*(n-1))>>1);
I dont have access to validate all inputs. and time range. but this one runs O(N) for sure. and have improved. run and let me know your feedback.i will provide details if necessary
public static int solution(int[]A,int K){
int minIndex=0;
int maxIndex=0;
int end=A.length;
int slize = end;
int startIndex = 0;
int diff = 0;
int minMaxIndexDiff = 0;
for(int currIndex=1;currIndex<end;currIndex++){
if(A[currIndex]>A[maxIndex]){
maxIndex=currIndex;
}else if(A[currIndex]<A[minIndex]){
minIndex=currIndex;
}
if( (A[maxIndex]-A[minIndex]) >K){
minMaxIndexDiff= currIndex- startIndex;
if (minMaxIndexDiff > 1){
slize+= ((minMaxIndexDiff*(minMaxIndexDiff-1)) >> 1);
if (diff > 0 ) {
slize = slize + (diff * minMaxIndexDiff);
}
}
if (minIndex == currIndex){
diff = currIndex - (maxIndex + 1);
}else{
diff = currIndex - (minIndex + 1);
}
if (slize > 1000000000) {
return 1000000000;
}
minIndex = currIndex;
maxIndex = currIndex;
startIndex = currIndex;
}
}
if ( (startIndex +1) == end){
return slize;
}
if (slize > 1000000000) {
return 1000000000;
}
minMaxIndexDiff= end- startIndex;
if (minMaxIndexDiff > 1){
slize+= ((minMaxIndexDiff*(minMaxIndexDiff-1)) >> 1);
if (diff > 0 ) {
slize = slize + (diff * minMaxIndexDiff);
}
}
return slize;
}
Get rid of the System.out.println() in the for loop :) you will be amazed how much faster your calculation will be
Nested assignments, i. e. instead of
t=i-ml;
s+=(t+c)-((t*(t+1))/(short)2);
i=ml;
ml++;
mxl=ml;
something like
s+=((t=i-ml)+c);
s-=((t*(t+1))/(short)2);
i=ml;
mxl=++ml;
sometimes occurs in OpenJDK sources. It mainly results in replacing *load bytecode instructions with *dups. According to my experiments, it really gives a very little speedup, but it is ultra hadrcore, I don't recommend to write such code manually.
I would try the following and profile the code after each change to check if there is any gain in speed.
replace:
if(Math.abs(A[ml]-A[mxl])<=K)
by
int diff = A[ml]-A[mxl];
if(diff<=K && diff>=-K)
replace
/2
by
>>1
replace
ml++;
mxl=ml;
by
mxl=++ml;
Maybe avoid array access of the same element (internal boundary checks of java may take some time)
So staore at least A[i] in a local varianble.
I would create a C version first and see, how fast it can go with "direct access to the metal". Chances are, you are trying to optimize calculation which is already optimized to the limit.
I would try to elimnate this line if(Math.abs(A[ml]-A[mxl])<=
by a faster self calculated abs version, which is inlined, not a method call!
The cast to (short) does not help,
but try the right shift operator X >>1 instead x / 2
removing the System.out.println() can speed up by factor of 1000.
But be carefull otherwise your whole algorithm can be removed by the VM becasue you dont use it.
Old code:
for(int i=0;i<100000;i++){
System.out.println(mezmeriz4r(A,K));
}
New code:
int dummy = 0;
for(int i=0;i<100000;i++){
dummy = mezmeriz4r(A,K);
}
//Use dummy otherwise optimisation can remove mezmeriz4r
System.out.print("finished: " + dummy);

Server side sorting on huge data

As of now we are providing client side sorting on Dojo datagrid. Now we need to enhance server side sorting means sorting should apply to all pages on grid. We have 4 tables joined on main table and has 2 lac records as of now and it may increase. When execute SQL it takes 5-8 mins time to fetch all records to my java code and where I need to apply some calculations over them and I am providing custom sort using Comparators. We have each comparator to represent each column.
My worry is how to get the whole data to service layer code within short time? Is there a way to increase execution speed through data source configuration?
return new Comparator<QueryHS>() {
public int compare(QueryHS object1, QueryHS object2) {
int tatAbs = object1.getTatNb().intValue() - object1.getExternalUnresolvedMins().intValue();
String negative = "";
if (tatAbs < 0) {
negative = "-";
}
String tatAbsStr = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs % 60)), 2);
// object1.setTatNb(tatAbs);
object1.setAbsTat(tatAbsStr.trim());
int tatAbs2 = object2.getTatNb().intValue() - object2.getExternalUnresolvedMins().intValue();
negative = "";
if (tatAbs2 < 0) {
negative = "-";
}
String tatAbsStr2 = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 % 60)), 2);
// object2.setTatNb(tatAbs2);
object2.setAbsTat(tatAbsStr2.trim());
if(tatAbs > tatAbs2)
return 1;
if(tatAbs < tatAbs2)
return -1;
return 0;
}
};
You should not fetch all the 2 lac record from Database into your application. You should only fetch what is needed.
As you have said you have 4 tables joined on main table, you must have Hibernate entity classes for them with the corresponding mapping among them. Use pagination technique to fetch only the number of rows that you are showing to the user. Hibernate knows the tricks to make this work efficiently on your particular database.
You can even use aggregate functions: count(), min(), max(), sum(), and avg() with your HQL to fetch the relevant data.

Extracting Values used for Normalization in Weka Multilayer Perceptron

I have a machine learning scheme in which I am using the java classes from Weka to implement machine learning in a matlab script. I am then uploading the model for the classifier to a database, since I need to perform the classification on a different machine in a different language (obj-c). The evaluation of the network was fairly straightforward to program, but I need the values that WEKA used to normalize the data set before training so I can use them in the evaluation of the network later. Does anyone know how to get the normalization factors that weka would use for training a Multilayer Perceptron network? I would prefer the answer to be in Java.
After some digging through the WEKA source code and documentation... this is what I've come up with. Even though there is a filter in WEKA called "Normalize", the Multilayer Perceptron doesn't use it, instead it uses a bit of code internally that looks like this.
m_attributeRanges = new double[inst.numAttributes()];
m_attributeBases = new double[inst.numAttributes()];
for (int noa = 0; noa < inst.numAttributes(); noa++) {
min = Double.POSITIVE_INFINITY;
max = Double.NEGATIVE_INFINITY;
for (int i=0; i < inst.numInstances();i++) {
if (!inst.instance(i).isMissing(noa)) {
value = inst.instance(i).value(noa);
if (value < min) {
min = value;
}
if (value > max) {
max = value;
}
}
}
m_attributeRanges[noa] = (max - min) / 2;
m_attributeBases[noa] = (max + min) / 2;
if (noa != inst.classIndex() && m_normalizeAttributes) {
for (int i = 0; i < inst.numInstances(); i++) {
if (m_attributeRanges[noa] != 0) {
inst.instance(i).setValue(noa, (inst.instance(i).value(noa)
- m_attributeBases[noa]) /
m_attributeRanges[noa]);
}
else {
inst.instance(i).setValue(noa, inst.instance(i).value(noa) -
m_attributeBases[noa]);
}
So the only values that I should need to transmit to the other system I'm trying to use to evaluate this network would be the min and the max. Luckily for me, there turned out to be a method on the filter weka.filters.unsupervised.attribute.Normalize that returns a double array of the mins and the maxes for a processed dataset. All I had to do then was tell the multilayer perceptron to not automatically normalize my data, and to process it separately with the filter so I could extract the mins and maxes to send to the database along with the weights and everything else.

Categories