How to shift data in a distribution in Java

How to shift data in a distribution in Java - java

I am designing a software in Java, one of its functionalities is calculating the cumulative distribution of certain value in the distribution.
For example: The average marriage age in a country 28 old (which is the mean in the distribution), the distribution that i am using is chi-square (class ChiSquaredDistribution) with degree of freedom(3), since it resembles age at marriage distribution in the real world.
My goal is: if the user type their age, the output would be an approximate percentage of them getting married at that age (one year boundary) based on that distribution. something like: input : 30 years >>> output : 5.1%, input : 28 years>>> output :6%, input : 56 years>>> output :0.8%. The input is int, output is double
the problem is, the distribution starts at (0), and the mean is i believe (3) by default, the following code i wrote displays marriage probability from the age 0 to 70, my question is how to shift it to 18 and over, with the mean of the average age at marriage ?
ChiSquaredDistribution x = new ChiSquaredDistribution(3);
Random r = new Random();
for (int UserAtAge=0; UserAtAge<70; UserAtAge++) {
System.out.println((x.cumulativeProbability(UserAtAge+1)-x.cumulativeProbability(UserAtAge))*100);
}
Two images attached for current results, and the intended results. Any code and help would highly be appreciated.
See the current results and the desired results

Shift your distribution by subtracting 18 from each value, so 18 maps to 0, 28 maps to 10, 70 maps to 52, etc. The mean of an unshifted chi-square is its degrees of freedom. Using a chi-square(3) would yield a mean of 21 for the shifted data, so you'll want to bump that up to a chi-square(10) to yield a mean of 28 with the shift.
With some cleanup (lower-case start for local variables, r was unused), the shifted version is:
ChiSquaredDistribution x = new ChiSquaredDistribution(10);
for (int userAge=18; userAge<71; userAge++) {
System.out.println((x.cumulativeProbability(userAge + 1 - 18) - x.cumulativeProbability(userAge - 18)) * 100);
}

Related

Competitive Coding - Clearing all levels with minimum cost : Not passing all test cases

I was solving problems on a competitive coding website when I came across this. The problem states that:
In this game there are N levels and M types of available weapons. The levels are numbered from 0 to N-1 and the weapons are numbered from 0 to M-1 . You can clear these levels in any order. In each level, some subset of these M weapons is required to clear this level. If in a particular level, you need to buy x new weapons, you will pay x^2 coins for it. Also note that you can carry all the weapons you have currently to the next level . Initially, you have no weapons. Can you find out the minimum coins required such that you can clear all the levels?
Input Format
The first line of input contains 2 space separated integers:
N = the number of levels in the game
M = the number of types of weapons
N lines follows. The ith of these lines contains a binary string of length M. If the jth character of
this string is 1 , it means we need a weapon of type j to clear the ith level.
Constraints
1 <= N <=20
1<= M <= 20
Output Format
Print a single integer which is the answer to the problem.
Sample TestCase 1
Input
1 4
0101
Output
4
Explanation
There is only one level in this game. We need 2 types of weapons - 1 and 3. Since, initially Ben
has no weapons he will have to buy these, which will cost him 2^2 = 4 coins.
Sample TestCase 2
Input
3 3
111
001
010
Output
3
Explanation
There are 3 levels in this game. The 0th level (111) requires all 3 types of weapons. The 1st level (001) requires only weapon of type 2. The 2nd level requires only weapon of type 1. If we clear the levels in the given order(0-1-2), total cost = 3^2 + 0^2 + 0^2 = 9 coins. If we clear the levels in the order 1-2-0, it will cost = 1^2 + 1^2 + 1^2 = 3 coins which is the optimal way.
Approach
I was able to figure out that we can calculate the minimum cost by traversing the Binary Strings in a way that we purchase minimum possible weapons at each level.
One possible way could be traversing the array of binary strings and calculating the cost for each level while the array is already arranged in the correct order. The correct order should be when the Strings are already sorted i.e. 001, 010, 111 as in case of the above test case. Traversing the arrays in this order and summing up the cost for each level gives the correct answer.
Also, the sort method in java works fine to sort these Binary Strings before running a loop on the array to sum up cost for each level.
Arrays.sort(weapons);
This approach work fine for some of the test cases, however more than half of the test cases are still failing and I can't understand whats wrong with my logic. I am using bitwise operators to calculate the number of weapons needed at each level and returning their square.
Unfortunately, I cannot see the test cases that are failing. Any help is greatly appreciated.

This can be solved by dynamic programming.
The state will be the bit mask of weapons we currently own.
The transitions will be to try clearing each of the n possible levels in turn from the current state, acquiring the additional weapons we need and paying for them.
In each of the n resulting states, we take the minimum cost of the current way to achieve it and all previously observed ways.
When we already have some weapons, some levels will actually require no additional weapons to be bought; such transitions will automatically be disregarded since in such case, we arrive at the same state having paid the same cost.
We start at the state of m zeroes, having paid 0.
The end state is the bitwise OR of all the given levels, and the minimum cost to get there is the answer.
In pseudocode:
let mask[1], mask[2], ..., mask[n] be the given bit masks of the n levels
p2m = 2 to the power of m
f[0] = 0
all f[1], f[2], ..., f[p2m-1] = infinity
for state = 0, 1, 2, ..., p2m-1:
current_cost = f[state]
current_ones = popcount(state) // popcount is the number of 1 bits
for level = 1, 2, ..., n:
new_state = state | mask[level] // the operation is bitwise OR
new_cost = current_cost + square (popcount(new_state) - current_ones)
f[new_state] = min (f[new_state], new_cost)
mask_total = mask[1] | mask[2] | ... | mask[n]
the answer is f[mask_total]
The complexity is O(2^m * n) time and O(2^m) memory, which should be fine for m <= 20 and n <= 20 in most online judges.

The dynamic optimization idea by #Gassa could be extended by using A* by estimating min and max of the remaining cost, where
minRemaining(s)=bitCount(maxState-s)
maxRemaining(s)=bitCount(maxState-s)^2
Start with a priority queue - and base it on cost+minRemaining - with the just the empty state, and then replace a state from this queue that has not reached maxState with at most n new states based the n levels:
Keep track bound=min(cost(s)+maxRemaining(s)) in queue,
and initialize all costs with bitCount(maxState)^2+1
extract state with lowest cost
if state!=maxState
remove state from queue
for j in 1..n
if (state|level[j]!=state)
cost(state|level[j])=min(cost(state|level[j]),
cost(state)+bitCount(state|level[j]-state)^2
if cost(state|level[j])+minRemaining(state|level[j])<=bound
add/replace state|level[j] in queue
else break
The idea is to skip dead-ends. So consider an example from a comment
11100 cost 9 min 2 max 4
11110 cost 16 min 1 max 1
11111 cost 25 min 0 max 0
00011 cost 4 min 3 max 9
bound 13
remove 00011 and replace with 11111 (skipping 00011 since no change)
11111 cost 13 min 0 max 0
11100 cost 9 min 2 max 4
11110 cost 16 min 1 max 1
remove 11100 and replace with 11110 11111 (skipping 11100 since no change):
11111 cost 13 min 0 max 0
11110 cost 10 min 1 max 1
bound 11
remove 11110 and replace with 11111 (skipping 11110 since no change)
11111 cost 11 min 0 max 0
bound 11
Number of operations should be similar to dynamic optimization in the worst case, but in many cases it will be better - and I don't know if the worst case can occur.

The logic behind this problem is that every time you have to find the minimum count of set bits corresponding to a binary string which will contain the weapons so far got in the level.
For ex :
we have data as
4 3
101-2 bits
010-1 bits
110-2 bits
101-2 bits
now as 010 has min bits we compute cost for it first then update the current pattern (by using bitwise OR) so current pattern is 010
next we find the next min set bits wrt to current pattern
i have used the logic by first using XOR for current pattern and the given number then using AND with the current number(A^B)&A
so the bits become like this after the operation
(101^010)&101->101-2 bit
(110^010)&110->100-1 bit
now we know the min bit is 110 we pick it and compute the cost ,update the pattern and so on..
This method returns the cost of a string with respect to current pattern
private static int computeCost(String currPattern, String costString) {
int a = currPattern.isEmpty()?0:Integer.parseInt(currPattern, 2);
int b = Integer.parseInt(costString, 2);
int cost = 0;
int c = (a ^ b) & b;
cost = (int) Math.pow(countSetBits(c), 2);
return cost;
}

Most efficient data type for an input of fixed numbers

I have some JSON coming to my Java program. It has a particular field with six fixed numbers: 0, 30, 60, 120, 240 or 480. Is it possible in Java to choose a better data type than short? Maybe by using an enum in some form or by representing input in bits by taking advantage of knowing the fixed input in advance?
Regarding enums, they seem to be made for a different use case, from Oracle Java docs for enum, it looks like if I use an enum, it will still end up creating an int internally, so I don't see any advantage ultimately in speed or memory. Is there anything I am missing?
I tried to google but couldn't get an appropriate answer yet.

First, observe that the numbers from your example follow a certain pattern - they are constructed from powers of two multiplied by 30:
0 - 0 - 0*30
1 - 20 - 1*30
2 - 21 - 2*30
4 - 22 - 4*30
8 - 23 - 8*30
16 - 24 - 16*30
If you store a small number between 0 and 5, inclusive, you can compute the target number back either with a look-up table, or with a simple bit shifting expression:
byte b = ... // Store the value in a variable of type "byte"
int num = b!=0 ? 30*(1<<(b-1)): 0;
Note: Since enum is a full-blown class, it would generally use as much or more space than a primitive number.

Hashing Functions and Hash Tables

Suppose we have a set of keys: <54, 18, 10, 25, 28, 36, 38, 41, 12,
90>. Use the hashing function key % N to map each key into the
following array. If there is a collision, use the separate chaining
technique.
And below is just the pictorial of the array with the array labelled A and it is of size 13 so the picture is the array cells listed 0-12. N=13.
My understanding so far of hashing for this problem is that I need to arrange the keys given into the array using the function key % 13 (N being equal to 13). But my book doesn't give examples of different functions. The only example it uses is an alphabetizing one with first letters of last names.
Can anyone give me a brief explanation without just giving me the answer?

As you mentioned,your hash function is h=key%13;
Suppose there is Memory location starting from address 0 to 20.
So apply this function for every element in your array.
1) h1= 54 % 13 = 2 => This will go to the 2nd address location.
2) h2= 18 % 13 = 5 => This will go to the 5th location.
3) h3= 10 % 13 = 10 => This will go to 10th location.
4) h4= 25 % 13 = 14 => This will go to 14th location.
5) h5= 28 % 13 = 2 => Here Collision occurred as 54 is already present at 2nd location.
Now Solution is to use Separate Chaining.
Separate Chaining means just adding this current element to the next location in the Linked List of 2nd Location. Means a new Linked List is meantained at every location when there is Collision.
Below is pictorial ex. of separate chaining.
Hope You got a answer.In above figure elements are different but it will work same.
For more details go to this link : enter link description here

You appear to understanding the general process of inserting a value into a hash. All you need to do is relate your textbook example to your homework assignment question.
Determine which bucket you need to put the value in based on the hashing function. In the textbook example, the hashing function is taking the first letter of the last name. In your assignment, the hashing function is N % 13.
Resolve any collisions and perform the actual insertion. You don't mention what your textbook example uses as a collision resolution strategy, but your assignment asks you to use separate chaining.

How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

I want to calculate Centered Moving average of a set of Data .
Example Input format :
quarter | sales
Q1'11 | 9
Q2'11 | 8
Q3'11 | 9
Q4'11 | 12
Q1'12 | 9
Q2'12 | 12
Q3'12 | 9
Q4'12 | 10
Mathematical Representation of data and calculation of Moving average and then centered moving average
Period Value MA Centered
1 9
1.5
2 8
2.5 9.5
3 9 9.5
3.5 9.5
4 12 10.0
4.5 10.5
5 9 10.750
5.5 11.0
6 12
6.5
7 9
I am stuck with the implementation of RecordReader which will provide mapper sales value of a year i.e. of four quarter.

This is actually totally doable in the MapReduce paradigm; it does not have to be though of as a 'sliding window'. Instead think of the fact that each data point is relevant to a max of four MA calculations, and remember that each call to the map function can emit more than one key-value pair. Here is pseudo-code:
First MR job:
map(quarter, sales)
emit(quarter - 1.5, sales)
emit(quarter - 0.5, sales)
emit(quarter + 0.5, sales)
emit(quarter + 1.5, sales)
reduce(quarter, list_of_sales)
if (list_of_sales.length == 4):
emit(quarter, average(list_of_sales))
endif
Second MR job:
map(quarter, MA)
emit(quarter - 0.5, MA)
emit(quarter + 0.5, MA)
reduce(quarter, list_of_MA)
if (list_of_MA.length == 2):
emit(quarter, average(list_of_sales))
endif

In best of my understanding moving average is not nicely maps to MapReduce paradigm since its calculation is essentially "sliding window" over sorted data, while MR is processing of non-intersected ranges of sorted data.
Solution I do see is as following:
a) To implement custom partitioner to be able to make two different partitions in two runs. In each run
your reducers will get different ranges of data and calculate moving average where approprieate
I will try to illustrate:
In first run data for reducers should be:
R1: Q1, Q2, Q3, Q4
R2: Q5, Q6, Q7, Q8
...
here you will cacluate moving average for some Qs.
In next run your reducers should get data like:
R1: Q1...Q6
R2: Q6...Q10
R3: Q10..Q14
And caclulate the rest of moving averages.
Then you will need to aggregate results.
Idea of custom partitioner that it will have two modes of operation - each time dividing into equal ranges but with some shift. In a pseudocode it will look like this :
partition = (key+SHIFT) / (MAX_KEY/numOfPartitions) ;
where:
SHIFT will be taken from the configuration.
MAX_KEY = maximum value of the key. I assume for simplicity that they start with zero.
RecordReader, IMHO is not a solution since it is limited to specific split and can not slide over split's boundary.
Another solution would be to implement custom logic of splitting input data (it is part of the InputFormat). It can be done to do 2 different slides, similar to partitioning.

Interpolating Large Datasets On the Fly

Interpolating Large Datasets
I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.
I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.
What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?
Cheers, Karl

There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.
Summarizing a trend over time
Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):
> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)
The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.
It's also straightforward to access the results of the smoothing. Here's how to do that:
> data = lowess( cars$speed, cars$dist )
> data
$x
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25
$y
[1] 4.965459 4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698
The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.

One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:
SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)
Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.

If you wanted to write your own, one obvious solution would be to break your record set into fixed number-of-points chunks, for which the value would be the average (mean, median, ... pick one). This has the probable advantage of being the fastest, and shows overall trends.
But it lacks the drama of price ticks. A better solution would probably involve looking for the inflection points, then selecting among them using sliding windows. This has the advantage of better displaying the actual events of the day, but will be slower.

Something like RRDTool would do what you need automatically - the tutorial should get you started, and drraw will graph the data.
I use this at work for things like error graphs, I don't need 1-minute resolution for a 6-month time period, only for the most recent few hours. After that I have 1-hour resolution for a few days, then 1-day resolution for a few months.

The naive approach is simply calculating an average per timeinterval corresponding to a pixel.
http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg
This does not show flunctuations. I would suggest also calculating the standard deviation in each time interval and plot that too (essentially making each pixel higher than one single pixel). I could not locate an example, but I know that Gnuplot can do this (but is not written in Java).

How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:
class MedianEnumeration implements Enumeration<Double>
{
private Enumeration<Double> frameEnum;
private int frameSize;
MedianEnumeration(Enumeration<Double> e, int len) {
frameEnum = e;
frameSize = len;
}
public boolean hasMoreElements() {
return frameEnum.hasMoreElements();
}
public Double nextElement() {
Double sum = frameEnum.nextElement();
int i;
for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
sum += (Double)frameEnum.nextElement();
}
return (sum / i);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.