Server side sorting on huge data

Server side sorting on huge data - java

As of now we are providing client side sorting on Dojo datagrid. Now we need to enhance server side sorting means sorting should apply to all pages on grid. We have 4 tables joined on main table and has 2 lac records as of now and it may increase. When execute SQL it takes 5-8 mins time to fetch all records to my java code and where I need to apply some calculations over them and I am providing custom sort using Comparators. We have each comparator to represent each column.
My worry is how to get the whole data to service layer code within short time? Is there a way to increase execution speed through data source configuration?
return new Comparator<QueryHS>() {
public int compare(QueryHS object1, QueryHS object2) {
int tatAbs = object1.getTatNb().intValue() - object1.getExternalUnresolvedMins().intValue();
String negative = "";
if (tatAbs < 0) {
negative = "-";
}
String tatAbsStr = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs % 60)), 2);
// object1.setTatNb(tatAbs);
object1.setAbsTat(tatAbsStr.trim());
int tatAbs2 = object2.getTatNb().intValue() - object2.getExternalUnresolvedMins().intValue();
negative = "";
if (tatAbs2 < 0) {
negative = "-";
}
String tatAbsStr2 = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 % 60)), 2);
// object2.setTatNb(tatAbs2);
object2.setAbsTat(tatAbsStr2.trim());
if(tatAbs > tatAbs2)
return 1;
if(tatAbs < tatAbs2)
return -1;
return 0;
}
};

You should not fetch all the 2 lac record from Database into your application. You should only fetch what is needed.
As you have said you have 4 tables joined on main table, you must have Hibernate entity classes for them with the corresponding mapping among them. Use pagination technique to fetch only the number of rows that you are showing to the user. Hibernate knows the tricks to make this work efficiently on your particular database.
You can even use aggregate functions: count(), min(), max(), sum(), and avg() with your HQL to fetch the relevant data.

Related

Optimization: Finding the best Simple Moving Average takes too much time

I've created a simple Spring-Application with a MySQL-DB.
In the DB there are 20 years of stock data (5694 lines):
The Goal is to find the Best Moving Average (N) for that 20 years of stock data. Inputs are the closing prices of every trading day.
The calculated average depends on the N. So p.e. if N=3 the average of a reference day t, is given by ((t-1)+(t-2)+(t-3)/N).
Output is the Best Moving Average (N) and the Result you made with all the buying & selling transactions of the Best N.
I did not find a proper algorithm in the Internet, so I implemented the following:
For every N (249-times) the program does the following steps:
SQL-Query: calculates averages & return list
#Repository
public interface StockRepository extends CrudRepository<Stock, Integer> {
/*
* This sql query calculate the moving average of the value n
*/
#Query(value = "SELECT a.date, a.close, Round( ( SELECT SUM(b.close) / COUNT(b.close) FROM stock AS b WHERE DATEDIFF(a.date, b.date) BETWEEN 0 AND ?1 ), 2 ) AS 'avg' FROM stock AS a ORDER BY a.date", nativeQuery = true)
List<AverageDTO> calculateAverage(int n);
Simulate buyings & sellings – > calculate result
Compare result with bestResult
Next N
#RestController
public class ApiController {
#Autowired
private StockRepository stockRepository;
#CrossOrigin(origins = "*")
#GetMapping("/getBestValue")
/*
* This function tries all possible values in the interval [min,max], calculate
* the moving avg and simulate the gains for each value to choose the best one
*/
public ResultDTO getBestValue(#PathParam("min") int min, #PathParam("max") int max) {
Double best = 0.0;
int value = 0;
for (int i = min; i <= max; i++) {
Double result = simulate(stockRepository.calculateAverage(i));
if (result > best) {
value = i;
best = result;
}
}
return new ResultDTO(value, best);
}
/*
* This function get as input the close and moving average of a stock and
* simulate the buying/selling process
*/
public Double simulate(List<AverageDTO> list) {
Double result = 0.0;
Double lastPrice = list.get(0).getClose();
for (int i = 1; i < list.size(); i++) {
if (list.get(i - 1).getClose() < list.get(i - 1).getAvg()
&& list.get(i).getClose() > list.get(i).getAvg()) {
// buy
lastPrice = list.get(i).getClose();
} else if (list.get(i - 1).getClose() > list.get(i - 1).getAvg()
&& list.get(i).getClose() < list.get(i).getAvg()) {
// sell
result += (list.get(i).getClose() - lastPrice);
lastPrice = list.get(i).getClose();
}
}
return result;
}
}
When I put Min=2 and Max=250 it takes 45 minutes to finish.
Since, I'm a beginner in Java & Spring I do not know how I can optimize it.
I'm happy for every input.

This problem is equivalent with finding the best moving N sum. Simply then divide by N. Having such a slice, then the next slice subtracts the first value and adds a new value to the end. This could lead to an algorithm for finding local growths with a[i + N] - a[i] >= 0.
However in this case a simple sequential ordered query with
double[] slice = new double[N];
double sum = 0.0;
suffices. (A skipping algorithm on a database is probably too complicated.)
Simply walk through the table keeping the slice as window, keeping N values and keys, and maintaining the maximum upto now.
Use the primitive type double instead of the object wrapper Double.
If the database transport is a serious factor, a stored procedure would do. Keeping a massive table as really many entities just for a running maximum is unfortunate.
It would be better to have a condensed table or better field with the sum of N values.

Google Big table Filter with Java

I want to filter all rows that match this condition: with input value x, return all records between two quantifier values in Java.
Example: With input value x = 15,
a record with quantifier q1 = 10 and q2 = 20 will match,
a record with quantifier q1 = 1 and q2 = 10 will not match

You are trying to filter rows that contain min numerical qualifier that is < x, as well as containing a max numerical qualifier that is > x , and then maybe you are trying to filter data from those rows, that are between those qualifiers.
This is pretty much the opposite sort of access pattern that one tries to achieve when setting up a BigTable. This has a code smell. Having said that, you can succesfully achieve this sort of query, using a combination of filters. However, these filters cannot be chained together, as far as I can tell.
First, use a filter to get keys with a cq < x. Next, send a query to bigtable per each key from the first filter, and filter on key as well as filter such that cq > x. This is an optimized way. An even more optimized way might be to limit the first filter to 1 element (i.e. get the min element), and only query without a limit on the lessThen portion, after the second step.
My implementation below is slightly more naive, in that the second step filters only on cq > x, and not keys from the first step. But the gist is the same:
val x = "15"
val a = new mutable.HashMap[ByteString, Row]
val b = new mutable.HashMap[ByteString, Row]
val c = new mutable.HashMap[ByteString, Row]
dataClient.readRows( Query.create(tableId)
.filter(Filters.FILTERS.qualifier().rangeWithinFamily("cf").startClosed(Int.MinValue.toString.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()).endOpen(x.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()
)))
.forEach(r => a.put(r.getKey, r))
dataClient.readRows(Query.create(tableId)
.filter(Filters.FILTERS.qualifier().rangeWithinFamily("cf").startOpen(x).endClosed(Int.MaxValue.toString.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()))
)
.forEach(r => b.put(r.getKey, r))
dataClient.readRows(Query.create(tableId)
.filter(Filters.FILTERS.qualifier().exactMatch(x)))
.forEach(r => c.put(r.getKey, r))
val all_cells = a.keys.toSet.intersect(b.keys.toSet).flatMap(k => a.get(k).map(_.getCells).get.toArray.toSeq ++ b.get(k).map(_.getCells).get.toArray.toSeq
++ c.get(k).map(_.getCells).get.toArray.toSeq)
Can you tell me more about your use case?

It is possible to create a filter on a range of values, but it will depend on how you are encoding them. If they are encoded as strings, you would use the ValueRange filter like so:
Filter filter = FILTERS.value().range().startClosed("10").endClosed("20");
Then perform your read with the filter
try (BigtableDataClient dataClient = BigtableDataClient.create(projectId, instanceId)) {
Query query = Query.create(tableId).filter(filter);
ServerStream<Row> rows = dataClient.readRows(query);
for (Row row : rows) {
printRow(row);
}
} catch (IOException e) {
System.out.println(
"Unable to initialize service client, as a network error occurred: \n" + e.toString());
}
You can also pass bytes to the range, so if your numbers are encoded in some way, you could encode them as bytes in the same way and pass that into startClosed and endClosed.
You can read more about filters in the Cloud Bigtable Documentation.

Extracting cassandra's bloom filter

I have a cassandra server that is queried by another service and I need to reduce the amount of queries.
My first thought was to create a bloom filter of the whole database every couple of minutes and send it to the service.
but as I have a couple of hundreds of gigabytes in the database (which is expected to grow to a couple of terabytes), it doesn't seem like a good idea overloading the database every few minutes.
After a while of searching for a better solution, I remembered that cassandra maintains its own bloom filter.
Is it possible to copy the *-Filter.db files and use them in my code instead of creating my own bloom filter?

I have Created a table test
CREATE TABLE test (
a int PRIMARY KEY,
b int
);
Inserted 1 row
INSERT INTO test(a,b) VALUES(1, 10);
After flush data to disk. we can use the *-Filter.db file. For my case it was la-2-big-Filter.db
Here is the sample code to check if a partition key exist
Murmur3Partitioner partitioner = new Murmur3Partitioner();
try (DataInputStream in = new DataInputStream(new FileInputStream(new File("la-2-big-Filter.db"))); IFilter filter = FilterFactory.deserialize(in, true)) {
for (int i = 1; i <= 10; i++) {
DecoratedKey decoratedKey = partitioner.decorateKey(Int32Type.instance.decompose(i));
if (filter.isPresent(decoratedKey)) {
System.out.println(i + " is present ");
} else {
System.out.println(i + " is not present ");
}
}
}
Output :
1 is present
2 is not present
3 is not present
4 is not present
5 is not present
6 is not present
7 is not present
8 is not present
9 is not present
10 is not present

OutOfMemoryError: Java heap space

I'm having a problem with a java OutOfMemoryError. The program basically looks at mysql tables that are running on mysql workbench, and queries them to get out certain information, and then puts them in CSV files.
The program works just fine with a smaller data set, but once I use a larger data set (hours of logging information as opposed to perhaps 40 minutes) I get this error, which to me says that the problem comes from having a huge data set and the information not being handled too well by the program. Or it not being possible to handle this amount of data in the way that I have.
Setting Java VM arguments to -xmx1024m worked for a slightly larger data set but i need it to handle even bigger ones but it gives the error.
Here is the method which I am quite sure is the cause of the program somewhere:
// CSV is csvwriter (external lib), sment are Statements, rs is a ResultSet
public void pidsforlog() throws IOException
{
String[] procs;
int count = 0;
String temp = "";
System.out.println("Commence getting PID's out of Log");
try {
sment = con.createStatement();
sment2 = con.createStatement();
String query1a = "SELECT * FROM log, cpuinfo, memoryinfo";
rs = sment.executeQuery(query1a);
procs = new String[countThrough(rs)];
// SIMPLY GETS UNIQUE PROCESSES OUT OF TABLES AND STORES IN ARRAY
while (rs.next()) {
temp = rs.getString("Process");
if(Arrays.asList(procs).contains(temp)) {
} else {
procs[count] = temp;
count++;
}
}
// BELIEVE THE PROBLEM LIES BELOW HERE. SIZE OF THE RESULTSET TOO BIG?
for(int i = 0; i < procs.length; i++) {
if(procs[i] == null) {
} else {
String query = "SELECT DISTINCT * FROM log, cpuinfo, memoryinfo WHERE log.Process = " + "'" + procs[i] + "'" + " AND cpuinfo.Process = " + "'" + procs[i] + "'" + " AND memoryinfo.Process = " + "'" + procs[i] + "' AND log.Timestamp = cpuinfo.Timestamp = memoryinfo.Timestamp";
System.out.println(query);
rs = sment.executeQuery(query);
writer = new CSVWriter(new FileWriter(procs[i] + ".csv"), ',');
writer.writeAll(rs, true);
writer.flush();
}
}
writer.close();
} catch (SQLException e) {
notify("Error pidslog", e);
}
}; // end of method
Please feel free to ask if you want source code or more information as I'm desperate to get this fixed!
Thanks.

SELECT * FROM log, cpuinfo, memoryinfo will sure give a huge result set. It will give a cartesian product of all rows in all 3 tables.
Without seeing the table structure (or knowing the desired result) it's hard to pinpoint a solution, but I suspect that you either want some kind of join conditions to limit the result set, or use a UNION a'la;
SELECT Process FROM log
UNION
SELECT Process FROM cpuinfo
UNION
SELECT Process FROM memoryinfo
...which will just give you all distinct values for Process in all 3 tables.
Your second SQL statement also looks a bit strange;
SELECT DISTINCT *
FROM log, cpuinfo, memoryinfo
WHERE log.Process = #param1
AND cpuinfo.Process = #param1
AND memoryinfo.Process = #param1
AND log.Timestamp = cpuinfo.Timestamp = memoryinfo.Timestamp
Looks like you're trying to select from all 3 logs simultaneously, but ending up with another cartesian product. Are you sure you're getting the result set you're expecting?

You could limit the result returned by your SQL queryes with the LIMIT estatementet.
For example:
SELECT * FROM `your_table` LIMIT 100
This will return the first 100 results
SELECT * FROM `your_table` LIMIT 100, 200
This will return results from 100 to 200
Obviously you can iterate with those values so you get to all the elements on the data base no matter how many there are.

I think your are loading too many data at the same in the memory. try to use offset and limit in your sql statement so that you can avoid this problem

Your Java code is doing things that the database could do more efficiently. From query1a, it looks like all you really want is the unique processes. select distinct Process from ... should be sufficient to do that.
Then, think carefully about what table or tables are needed in that query. Do you really need log, cpuinfo, and memoryinfo? As Joachim Isaksson mentioned, this is going to return the Cartesian product of those three tables, giving you x * y * z rows (where x, y, and z are the row counts in each of those three tables) and a + b + c columns (where a, b, and c are the column counts in each of the tables). I doubt that's what you want or need. I assume you could get those unique processes from one table, or a union (rather than join) of the three tables.
Lastly, your second loop and query are essentially doing a join, something again better and more efficiently left to the database.

Like others said, fetching the data in smaller chunks might resolve the issue.
This is one of the other threads in stackoverflow that talks about this issue:
How to read all rows from huge table?

How to get facet ranges in solr results?

Assume that I have a field called price for the documents in Solr and I have that field faceted. I want to get the facets as ranges of values (eg: 0-100, 100-500, 500-1000, etc). How to do it?
I can specify the ranges beforehand, but I also want to know whether it is possible to calculate the ranges (say for 5 values) automatically based on the values in the documents?

To answer your first question, you can get facet ranges by using the the generic facet query support. Here's an example:
http://localhost:8983/solr/select?q=video&rows=0&facet=true&facet.query=price:[*+TO+500]&facet.query=price:[500+TO+*]
As for your second question (automatically suggesting facet ranges), that's not yet implemented. Some argue that this kind of querying would be best implemented on your application rather that letting Solr "guess" the best facet ranges.
Here are some discussions on the topic:
(Archived) https://web.archive.org/web/20100416235126/http://old.nabble.com/Re:-faceted-browsing-p3753053.html
(Archived) https://web.archive.org/web/20090430160232/http://www.nabble.com/Re:-Sorting-p6803791.html
(Archived) https://web.archive.org/web/20090504020754/http://www.nabble.com/Dynamically-calculated-range-facet-td11314725.html

I have worked out how to calculate sensible dynamic facets for product price ranges. The solution involves some pre-processing of documents and some post-processing of the query results, but it requires only one query to Solr, and should even work on old version of Solr like 1.4.
Round up prices before submission
First, before submitting the document, round up the the price to the nearest "nice round facet boundary" and store it in a "rounded_price" field. Users like their facets to look like "250-500" not "247-483", and rounding also means you get back hundreds of price facets not millions of them. With some effort the following code can be generalised to round nicely at any price scale:
public static decimal RoundPrice(decimal price)
{
if (price < 25)
return Math.Ceiling(price);
else if (price < 100)
return Math.Ceiling(price / 5) * 5;
else if (price < 250)
return Math.Ceiling(price / 10) * 10;
else if (price < 1000)
return Math.Ceiling(price / 25) * 25;
else if (price < 2500)
return Math.Ceiling(price / 100) * 100;
else if (price < 10000)
return Math.Ceiling(price / 250) * 250;
else if (price < 25000)
return Math.Ceiling(price / 1000) * 1000;
else if (price < 100000)
return Math.Ceiling(price / 2500) * 2500;
else
return Math.Ceiling(price / 5000) * 5000;
}
Permissible prices go 1,2,3,...,24,25,30,35,...,95,100,110,...,240,250,275,300,325,...,975,1000 and so forth.
Get all facets on rounded prices
Second, when submitting the query, request all facets on rounded prices sorted by price: facet.field=rounded_price. Thanks to the rounding, you'll get at most a few hundred facets back.
Combine adjacent facets into larger facets
Third, after you have the results, the user wants see only 3 to 7 facets, not hundreds of facets. So, combine adjacent facets into a few large facets (called "segments") trying to get a roughly equal number of documents in each segment. The following rather more complicated code does this, returning tuples of (start, end, count) suitable for performing range queries. The counts returned will be correct provided prices were been rounded up to the nearest boundary:
public static List<Tuple<string, string, int>> CombinePriceFacets(int nSegments, ICollection<KeyValuePair<string, int>> prices)
{
var ranges = new List<Tuple<string, string, int>>();
int productCount = prices.Sum(p => p.Value);
int productsRemaining = productCount;
if (nSegments < 2)
return ranges;
int segmentSize = productCount / nSegments;
string start = "*";
string end = "0";
int count = 0;
int totalCount = 0;
int segmentIdx = 1;
foreach (KeyValuePair<string, int> price in prices)
{
end = price.Key;
count += price.Value;
totalCount += price.Value;
productsRemaining -= price.Value;
if (totalCount >= segmentSize * segmentIdx)
{
ranges.Add(new Tuple<string, string, int>(start, end, count));
start = end;
count = 0;
segmentIdx += 1;
}
if (segmentIdx == nSegments)
{
ranges.Add(new Tuple<string, string, int>(start, "*", count + productsRemaining));
break;
}
}
return ranges;
}
Filter results by selected facet
Fourth, suppose ("250","500",38) was one of the resulting segments. If the user selects "$250 to $500" as a filter, simply do a filter query fq=price:[250 TO 500]

There may well be a better Solr-specific answer, but I work with straight Lucene, and since you're not getting much traction I'll take a stab. There, I'd create a populate a Filter with a FilteredQuery wrapping the original Query. Then I'd get a FieldCache for the field of interest. Enumerate the hits in the filter's bitset, and for each hit, you get the value of the field from the field cache, and add it to a SortedSet. When you've got all of the hits, divide the size of the set into the number of ranges you want (five to seven is a good number according the user interface guys), and rather than a single-valued constraint, your facets will be a range query with the lower and upper bounds of each of those subsets.
I'd recommend using some special-case logic for a small number of values; obviously, if you only have four distinct values, it doesn't make sense to try and make 5 range refinements out of them. Below a certain threshold (say 3*your ideal number of ranges), you just show the facets normally rather than ranges.

You can use solr facet ranges
http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Server side sorting on huge data - java

Related

Optimization: Finding the best Simple Moving Average takes too much time

Google Big table Filter with Java

Extracting cassandra's bloom filter

OutOfMemoryError: Java heap space

How to get facet ranges in solr results?

Categories

Resources