How to make a matrix multiplication faster and managable? - java

I am trying to multiply 2 large matrices in a most efficient way. Particularly, at one hand I have one matrix having dimensions (8.000 X 20.000) and on the other hand, I have the one with dimensions (35.000.000 X 20.000). Both these matrices have the columns with the identical values that is, 20.000 columns are in order and identical for both. Both matrices are too sparse and have boolean (binary) values. By multiplying them, I am trying to grab total commons for each row value.
I applied MATLAB to this end but it was not possible to multiply them due to out of memory issue. So I partitioned the larger matrix and made smaller chunks. Let's say 1.000.000 X 200.
After applying this separation process, I managed to multiply but it took about 5 hours to process even though in matlab this multiplying process is multi-threaded automatically.
I retrieved these matrices in my java script. I was wondering, if there might be faster way for the process. For example, would it make sense to apply Hadoop in java and do the process within java? Or is there any other suggestion?
Thanks in advance.

Related

Unexpected deviations of the linear search graph on an ordered table

I have implemented a simple linear search and shown the results on the graph with StdDraw library. I have ran the search for a randomly generated number on tables of sizes from 1000 to 100000 elements, incrementing the size by 1 each time. The points on the graph represent the average time it took to find a random number in the given table, approximated in 1000 runs on the same table size.
However there are big deviations visible on the graph which I do not know how to explain. Is it possible that this is due to the interference of other background tasks requesting CPU processing? Could it be that the spikes are because of poorly generated pseudorandom integers, because the nextInt() method is called in a really tiny time slice resulting in similar (very big or very low) random integers?
(The red line represents the linear search and the blue one binary search. Ignore the latter)

Android: Storing large fixed arrays in android studio

I am trying to use a 2d array to do some mapping in android. Basically the task is to convert polar coordinates to cartesian coordinates(both coordinates have to be integers). This conversion is used to pick a value at the calculated X and Y coordinates.
I have an r and that input that needs to return x and y. (int)(R + rcostheta) and (int)(R - rsintheta) will give me the required values (since my original cartesian matrix is 2R * 2R). But calculating them repeatedly is causing a lot of overhead. So i decided to pre-calculate the values of these so that i can avoid calculations. (And these calculations do take around 2-3 second overhead when I run my code)
i.e., instead of using Value[x][y], I can use
Value [ X[r][theta] ] [ Y[r][theta] ]
However this conversion matrix/2d array is pretty large. It has 128 x 960 elements (my applications has 128 radius and 960 segments). I keep getting a "code too large" error.
Can you suggest an easy way to implement this? i.e, to store this mapping matrix somewhere where it can be referenced without too much overhead.
Some people suggest using a database. But since I am new to android programming it looks a little bit scary. Surely it can be done a lot simpler?
Currently I am storing my two mapping matrices as
int[][] X = new int[][]{
{229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356},
{next line},
.
.
.
{last line};
And using it as
getBinary(tempBitmap, X[j][i],Y[j][i]);
This should work if the X and Y matrices were smaller. But since they are gigantic, I cant do it. Please suggest a way to go about doing this.
The computation doesn't look like it should take a long time - is it because you are you doing a large number of these calculations?
If a typical run of your application only uses a relatively small number of points, but does the conversion on them many times, then it would probably be best to do the calculation the first time that it's needed, and cache the result (in memory, if there isn't going to be a huge number of them).
If that's not going to work, then the next logical thing to do would indeed be to put the values into the database. This is a very common thing to do in an Android application, and there are a large number of tutorials available on the internet, in books, etc.

Sampling numerical arrays in java

I have a data set of time series data I would like to display on a line graph. The data is currently stored in an oracle table and the data is sampled at 1 point / second. The question is how do I plot the data over a 6 month period of time? Is there a way to down sample the data once it has been returned from oracle (this can be done in various charts, but I don't want to move the data over the network)? For example, if a query returns 10K points, how can I down sample this to 1K points and still have the line graph and keep the visual characteristics (peaks/valley)of the 10K points?
I looked at apache commons but without know exactly what the statistical name for this is I'm a bit at a loss.
The data I am sampling is indeed time series data such as page hits.
It sounds like what you want is to segment the 10K data points into 1K buckets -- the value of each one of these buckets may be any statistic computation that makes sense for your data (sorry, without actual context it's hard to say) For example, if you want to spot the trend of the data, you might want to use Median Percentile to summarize the 10 points in each bucket. Apache Commons Math have helper functions for that. Then, with the 1K downsampled datapoints, you can plot the chart.
For example, if I have 10K data points of page load times, I might map that to 1K data points by doing a median on every 10 points -- that will tell me the most common load time within the range -- and point that. Or, maybe I can use Max to find the maximum load time in the period.
There are two options: you can do as #Adrian Pang suggests and use time bins, which means you have bins and hard boundaries between them. This is perfectly fine, and it's called downsampling if you're working with a time series.
You can also use a smooth bin definition by applying a sliding window average/function convolution to points. This will give you a time series at the same sampling rate as your original, but much smoother. Prominent examples are the sliding window average (mean/median of all points in the window, equally weighted average) and Gaussian convolution (weighted average where the weights come from a Gaussian density curve).
My advice is to average the values over shorter time intervals. Make the length of the shorter interval dependent on the overall time range. If the overall time range is short enough, just display the raw data. E.g.:
overall = 1 year: let subinterval = 1 day
overall = 1 month: let subinterval = 1 hour
overall = 1 day: let subinterval = 1 minute
overall = 1 hour: no averaging, just use raw data
You will have to make some choices about where to shift from one subinterval to another, e.g., for overall = 5 months, is subinterval = 1 day or 1 hour?
My advice is to make a simple scheme so that it is easy for others to comprehend. Remember that the purpose of the plot is to help someone else (not you) understand the data. A simple averaging scheme will help get you to that goal.
If all you need is reduce the points of your visuallization without losing any visuall information, I suggest to use the code here. The tricky part of this approach is to find the correct threshold. Where threshold is the amount of data point you target to have after the downsampling. The less the threshold the more visual information you lose. However from 10K to 1K, is feasible, since I have tried it with a similar amount of data.
As a side note you should have in mind
The quality of your visualization depends one the amount of points and the size (in pixels) of your charts. Meaning that for bigger charts you need more data.
Any further analysis many not return the corrected results if it is applied at the downsampled data. Or at least I haven't seen anyone prooving the opposite.

matrix computation using hadoop mapreduce

I have a matrix with around 10000 rows. I wrote a code that should take one row in each iteration, do some long matrix computations and return one double number per each row of matrix. Since the number of operation per each row is too much, running the code takes long time. I'm thinking to implement it using MapReduce but I'm not sure it is possible or not. The main idea is splitting matrix rows into different nodes, running the jobs independently and combining the outputs together and returns an a list of numbers. Based on my understanding, just a mapper can do this job. Am I right? Is it possible? or any better idea? Thanks in advance. By the way the code is in Java.
This seems possible - some points for consideration:
You might want to run an identity mapper (one which passes each input record to the reducer) and do the row calculation in the reducer. Doing the calculation map-side will probably still cause all the calculations to be done on a single node (it's feasible that your 10000 row matrix is smaller than the input split size).
You'll want to run a large number of reducers to ensure the job is parallellized across your cluster nodes. The default partitioner will handle sending the input rows to different reducers (assuming your rows are not fixed width, in which case you should run a custom mapper that uses a counter as the output keys, instead of the default byte offset of the input row).
To bring all the results back together you'll need to run a second MR job with a single reducer

How to Implement K-Means Clustering Algorithm for MFCC Features?

I got the features of some sound variables with MFCC Algorithm. I want to cluster them with K-Means. I have 70 frames and every frame has 9 cepstral coefficients for one voice sample. It means that I have something like a 70*9 size matrix.
Let's assume that A, B and C are the voice records so
A is:
List<List<Double>> -> 70*9 array (I can use Vector instead of List)
and also B and C has same lengths too.
I don't want to cluster each frame, I want to cluster each frame block(at my example one group has 70 frames).
How can I implement it with K-Means at Java?
Here's where your knowledge of the problem domain becomes crucial. You might just use a distance between the 70*9 matrices but you can probably better. I don't know the particular features you mention, but some generic examples might be average, standard deviation of the 70 values per feature. You're basically looking to reduce the num of dimensions, both to improve speed but also to make the measure robust against sImple transformations, like offsetting all values by one step
K-Means has some pretty tough assumptions on your data. I'm not convinced that your data is appropriate to run k-means on it.
K-means is designed for Euclidean distance, and there might be a more appropriate distance measure for your data.
K-means needs to be able to compute sensible means, which may not be appropriate on your data
Many distance functions (and algorithms!) don't work well at 70*9 dimensions ("curse of dimensionality")
You need to know k beforehand.
Side note: keep away from Java generics for primitive type such as Double. It kills performance. Use double[][].

Categories