I have a dataset of pairs that represent HTTP request samples
(request time, request duration)
By using apache math EmpiricalDistribution I can calculate the average request count over time like this
double[] input = [request times from the pairs];
EmpiricalDistribution dist = new EmpiricalDistribution((number of buckets));
dist.load(input);
dist.getBinStats()
.stream()
.filter(stat -> stat.getN() > 0)
.forEach(stat -> (logic to store the data in proper format);
This way I can store the data in a chart-like way and plot it later on. Now what I can't do is calculate the average request duration over time.
In systems like Prometheus, this is done by using queries like
rate(http_server_requests_seconds_sum[5m])
/
rate(http_server_requests_seconds_count[5m])
I want to achieve the same thing (if possible) in my java code
Related
I am trying to make a service that will calculate statistics for each month.
I did smth like this:
public Map<String, BigDecimal> getStatistic() {
List<Order> orders = orderService.findAll(Sort.by(Sort.Direction.ASC, "creationDate")).toList();
SortedMap<String, BigDecimal> statisticsMap = new TreeMap<>();
MathContext mc = new MathContext(3);
for (Order order : orders) {
List<FraudDishV1Response> dishesOfOrder = order.getDishIds()
.stream()
.map(dishId -> dishV1Client.getDishById(dishId))
.collect(Collectors.toList());
BigDecimal total = calculateTotal(dishesOfOrder);
String date = order.getCreatedDate().format(DateTimeFormatter.ofPattern("yyyy-MM"));
statisticsMap.merge(date, total, (a, b) -> a.add(b, mc));
}
return statisticsMap;
}
But it takes a long time if there are lots of etries in the database.
Are there any best practices for working with statistics in REST API applications?
And also I'd like to know if it is a good way to save the statistics in a separate repository? It will save time for calculating statistics, but during creating a record in the database, you will also have to update the statistics db.
With your approach you'll eventually run out of memory while trying to load huge amount of data from database. You could do processing in batches but then again it will only get you so far. Ideally any kind of statistical data or on demand reporting would be served by long running scheduled jobs which will periodically do processing in the background and generate the desired data for you. You could dump the result in a table and then serve it from there via an API.
Another approach is to do real time processing. If you could develop a streaming pipeline in your application then I would highly suggest you may explore Apache Flink project.
Well, I did't stop and made several solutions step by step...
Step 1: Use streams. Before that, calculating statistics for 10,000 OrderEntities records took 18 seconds. Now it has accelerated to 14 seconds.
Step 2: Using parallelStream instead of streams. Parallel streams accelerated the calculation of statistics to 6 seconds! I was even surprised.
public SortedMap<String, BigDecimal> getStatisticsByParallelStreams() {
List<OrderEntity> orders = new ArrayList<>();
orderService.findAll(Sort.by(Sort.Direction.ASC, "createdDate")).forEach(orders::add);
MathContext mc = new MathContext(3);
return orders.stream().collect(Collectors.toMap(
order -> order.getCreatedDate().format(DateTimeFormatter.ofPattern("yyyy-MM")),
order -> calculateTotal(order.getDishIds()
.parallelStream()
.map(dishId -> dishV1Client.getDishById(dishId))
.collect(Collectors.toList())),
(a, b) -> a.add(b, mc),
TreeMap::new
));
}
Step 3: Optimizing requests to another microservice. I connected the JProfiler to the app and I have found out that I offen do extra requests to the another microservice. After it firstly I made a request to receive all Dishes, and then during calculating statistics, I use a recieved List of Dishes.
And thus I speeded it up to 1.5 seconds!:
public SortedMap<String, BigDecimal> getStatisticsByParallelStreams() {
List<OrderEntity> orders = new ArrayList<>();
orderService.findAll(Sort.by(Sort.Direction.ASC, "createdDate")).forEach(orders::add);
List<FraudDishV1Response> dishes = dishV1Client.getDishes();
MathContext mc = new MathContext(3);
return orders.stream().collect(Collectors.toMap(
order -> order.getCreatedDate().format(DateTimeFormatter.ofPattern("yyyy-MM")),
order -> calculateTotal(order.getDishIds()
.parallelStream()
.map(dishId -> getDishResponseById(dishes, dishId))
.collect(Collectors.toList())),
(a, b) -> a.add(b, mc),
TreeMap::new
));
}
I am trying to aggregate streaming data for each hour(like 12:00 to 12:59 and 01:00 to 01:59) in DataFlow/Apache Beam Job.
Following is my use case
Data is streaming from pubsub, It has a timestamp(order date). I want to count no of orders in each hour i am getting, Also i want to allow delay of 5 hours. Following is my sample code that I am using
LOG.info("Start Running Pipeline");
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<String> directShipmentFeedData = pipeline.apply("Get Direct Shipment Feed Data", PubsubIO.readStrings().fromSubscription(directShipmentFeedSubscription));
PCollection<String> tibcoRetailOrderConfirmationFeedData = pipeline.apply("Get Tibco Retail Order Confirmation Feed Data", PubsubIO.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));
PCollection<String> flattenData = PCollectionList.of(directShipmentFeedData).and(tibcoRetailOrderConfirmationFeedData)
.apply("Flatten Data from PubSub", Flatten.<String>pCollections());
flattenData
.apply(ParDo.of(new DataParse())).setCoder(SerializableCoder.of(SalesAndUnits.class))
// Adding Window
.apply(
Window.<SalesAndUnits>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardMinutes(1)))
)
// Data Enrich with Dimensions
.apply(ParDo.of(new DataEnrichWithDimentions()))
// Group And Hourly Sum
.apply(new GroupAndSumSales())
.apply(ParDo.of(new SQLWrite())).setCoder(SerializableCoder.of(SalesAndUnits.class));
pipeline.run();
LOG.info("Finish Running Pipeline");
I'd the use a window with the requirements you have. Something along the lines of
Window.into(
FixedWindows.of(Duration.standardHours(1))
).withAllowedLateness(Duration.standardHours(5)))
Possibly followed by a count as that's what I understood you need.
Hope it helps
I'm trying to calculate a 2 tailed Student Distribution using commons-math. I'm using Excel to compare values and validate if my results are correct.
So Using excel to calculate TDIST(x, df, t) with x = 5.968191467, df = 8, tail t = 2
=TDIST(ABS(5.968191467),8,2)
And get the Result: 0.000335084
Using commons Math like so :
TDistribution tDistribution = new TDistribution(8);
System.out.println(BigDecimal.valueOf(tDistribution.density(5.968191467)));
I get Result : 0.00018738010608336254
What should I be using to get the result exactly like the TDIST value?
To replicate your formula in Excel you can use CDF:
2*(1.0 - tDistribution.cumulativeProbability(5.968191467))
The right formula for a general x is:
2*(1.0 - tDistribution.cumulativeProbability(Math.abs(x)))
(thanks to ryuichiro). Do not forget to add the absolute value, because TDIST for 2 degrees of freedom in Excel is a symmetrical formula, that is
TDIST(-x,df,2) = TDIST(x,df,2)
and the one of ryuchiro would not work for negative x's. Check also the docs or this.
I want to implement Apache Spark's ALS machine learning algorithm. I found that best model should be chosen to get best results. I have split the training data into three sets Training, Validation and Test as suggest on forums.
I've found following code sample to train model on these sets.
val ranks = List(8, 12)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}
val testRmse = computeRmse(bestModel.get, test, numTest)
This code trains model for each combination of rank and lambda and compares rmse (root mean squared error) with validation set. These iterations gives a better model which we can say is represented by (rank,lambda) pair. But it doesn't do much after that on test set.
It just computes the rmse with `test' set.
My question is how it can be further tuned with test set data.
No, one would never fine tune the model using test data. If you do that, it stops being your test data.
I'd recommend this section of Prof. Andrew Ng's famous course that discusses the model training process: https://www.coursera.org/learn/machine-learning/home/week/6
Depending on your observation of the error values with validation data set, you might want to add/remove features, get more data or make changes in the model, or maybe even try a different algorithm altogether. If the cross-validation and the test rmse look reasonable, then you are done with the model and you could use it for the purpose (some prediction, I would assume) that made you build it in the first place.
I'm trying to create simple real-time report using ZoomData.
I create DataSource (Upload API) in ZoomData admin interface & add visualization to it (vertical bars).
Also I disable else visualizations for this DS.
My DS has 2 fields:
timestamp - ATTRIBUTE
count - INTEGER AVG
In visualization
group by: timestamp
group by sort: count
y axis: count avg
colors: count avg
Every second i send post request to zoomdata server to add info in DS.
I do it from java (also trying to send from postman).
My problem is: data came from post and added to DS but visualization properties become to default
group by sort: volume
y axis: volume
colors: volume
but group by stay timestamp
I can't understand why visualization properties always change after data came in POST request.