TextRank Run time - java

I implemented textrank in java but it seems pretty slow. Does anyone know about its expected performance?
If it's not expected to be slow, could any of the following be the problem:
1) It didn't seem like there was a way to create an edge and add a weight to it at the same in JGraphT time so I calculate the weight and if it's > 0, I add an edge. I later recalculate the weights to add them while looping through the edges. Is that a terrible idea?
2) I'm using JGraphT. Is that a slow library?
3) Anything else I could do to make it faster?

It depends what you mean by "pretty slow". A bit of googling found this paragraph:
"We calculated the total time for RAKE and TextRank (as an average over 100iterations) to extract keywords from the Inspec testing set of 500 abstracts, afterthe abstracts were read from files and loaded in memory. RAKE extracted key-words from the 500 abstracts in 160 milliseconds. TextRank extracted keywordsin 1002 milliseconds, over 6 times the time of RAKE."
(See http://www.scribd.com/doc/51398390/11/Evaluating-ef%EF%AC%81ciency for the context.)
So from this, I infer that a decent TextRank implementation should be capable of extracting keywords from ~500 abstracts in ~1second.

Related

Cross Correlation: Android AudioRecord create sample data for TDoA

On one side with my Android smartphone I'm recording an audio stream using AudioRecord.read(). For the recording I'm using the following specs
SampleRate: 44100 Hz
MonoChannel
PCM-16Bit
size of the array I use for AudioRecord.read(): 100 (short array)
using this small size allows me to read every 0.5ms (mean value), so I can use this timestamp later for the multilateration (at least I think so :-) ). Maybe this will be obsolete if I can use cross correlation to determine the TDoA ?!? (see below)
On the other side I have three speaker emitting different sounds using the WebAudio API and the the following specs
freq1: 17500 Hz
freq2: 18500 Hz
freq3: 19500 Hz
signal length: 200 ms + a fade in and fade out of the gain node of 5ms, so in sum 210ms
My goal is to determine the time difference of arrival (TDoA) between the emitted sounds. So in each iteration I read 100 byte from my AudioRecord buffer and then I want to determine the time difference (if I found one of my sounds). So far I've used a simple frequency filter (using fft) to determine the TDoA, but this is really inacurrate in the real world.
So far I've found out that I can use cross correlation to determine the TDoA value even better (http://paulbourke.net/miscellaneous/correlate/ and some threads here on SO). Now my problem: at the moment I think I have to correlate the recorded signal (my short array) with a generated signal of each of my three sounds above. But I'm struggling to generate this signal. Using the code found at (http://repository.tudelft.nl/view/ir/uuid%3Ab6c16565-cac8-448d-a460-224617a35ae1/ section B1.1. genTone()) does not clearly solve my problem because this will generate an array way bigger than my recorded samples. And so far I know the cross correlation needs two arrays of the same size to work. So how can I generate a sample array?
Another question: is the thinking of how to determine the TDoA so far correct?
Here are some lessons I've learned the past days:
I can either use cross correlation (xcorr) or a frequency recognition technique to determine the TDoA. The latter one is far more imprecise. So i focus on the xcorr.
I can achieve the TDoA by appling the xcorr on my recorded signal and two reference signals. E.g. my record has a length of 1000 samples. With the xcorr I recognize sound A at sample 500 and sound B at sample 600. So I know they have a time difference of 100 sample (that can be converted to seconds depending on the sample rate).
Therefor I generate a linear chirp (chirps a better than simple sin waves (see literature)) using this code found on SO. For an easy example and to check if my experiment seems to work I save my record as well as my generated chirp sounds as .wav files (there are plenty of code example how to do this). Then I use MatLab as an easy way to calculate the xcorr: see here
Another point: "input of xcorr has to be the same size?" I'm quite not sure about this part but I think this has to be done. We can achieve this by zero padding the two signals to the same length (preferable a power of two, so we can use the efficient Radix-2 implementation of FFT) and then use the FFT to calculate the xcorr (see another link from SO)
I hope this so far correct and covers some questions of other people :-)

Sampling numerical arrays in java

I have a data set of time series data I would like to display on a line graph. The data is currently stored in an oracle table and the data is sampled at 1 point / second. The question is how do I plot the data over a 6 month period of time? Is there a way to down sample the data once it has been returned from oracle (this can be done in various charts, but I don't want to move the data over the network)? For example, if a query returns 10K points, how can I down sample this to 1K points and still have the line graph and keep the visual characteristics (peaks/valley)of the 10K points?
I looked at apache commons but without know exactly what the statistical name for this is I'm a bit at a loss.
The data I am sampling is indeed time series data such as page hits.
It sounds like what you want is to segment the 10K data points into 1K buckets -- the value of each one of these buckets may be any statistic computation that makes sense for your data (sorry, without actual context it's hard to say) For example, if you want to spot the trend of the data, you might want to use Median Percentile to summarize the 10 points in each bucket. Apache Commons Math have helper functions for that. Then, with the 1K downsampled datapoints, you can plot the chart.
For example, if I have 10K data points of page load times, I might map that to 1K data points by doing a median on every 10 points -- that will tell me the most common load time within the range -- and point that. Or, maybe I can use Max to find the maximum load time in the period.
There are two options: you can do as #Adrian Pang suggests and use time bins, which means you have bins and hard boundaries between them. This is perfectly fine, and it's called downsampling if you're working with a time series.
You can also use a smooth bin definition by applying a sliding window average/function convolution to points. This will give you a time series at the same sampling rate as your original, but much smoother. Prominent examples are the sliding window average (mean/median of all points in the window, equally weighted average) and Gaussian convolution (weighted average where the weights come from a Gaussian density curve).
My advice is to average the values over shorter time intervals. Make the length of the shorter interval dependent on the overall time range. If the overall time range is short enough, just display the raw data. E.g.:
overall = 1 year: let subinterval = 1 day
overall = 1 month: let subinterval = 1 hour
overall = 1 day: let subinterval = 1 minute
overall = 1 hour: no averaging, just use raw data
You will have to make some choices about where to shift from one subinterval to another, e.g., for overall = 5 months, is subinterval = 1 day or 1 hour?
My advice is to make a simple scheme so that it is easy for others to comprehend. Remember that the purpose of the plot is to help someone else (not you) understand the data. A simple averaging scheme will help get you to that goal.
If all you need is reduce the points of your visuallization without losing any visuall information, I suggest to use the code here. The tricky part of this approach is to find the correct threshold. Where threshold is the amount of data point you target to have after the downsampling. The less the threshold the more visual information you lose. However from 10K to 1K, is feasible, since I have tried it with a similar amount of data.
As a side note you should have in mind
The quality of your visualization depends one the amount of points and the size (in pixels) of your charts. Meaning that for bigger charts you need more data.
Any further analysis many not return the corrected results if it is applied at the downsampled data. Or at least I haven't seen anyone prooving the opposite.

Drools Planner rule profiling

we are using Drools Planner 5.4.0.Final.
We want to profile our java application to understand if we can improve performance.
Is there a way to profile how much time a rule needs to be evaluated?
We use a lot of eval(....) and our "average calculate count per second" is nearly 37. Removing all eval(...) our "average calculate count per second" remains the same.
We already profiled the application and we saw most of the time is spent in doMove ... afterVariableChanged(...).
So we suspect some of our rules are inefficient, but we don't understand where is the problem.
Thanks!
A decent average calculate count per second is higher than 1000 (at least), a good one higher than 5000. Follow these steps in order:
1) First, I strongly recommend to upgrade to to 6.0.0.CR5. Just follow the upgrade recipe which will guide you step by step in a few hours. That alone will double your average calculate count (and potentially far more), due to several improvements (selectors, constraint match system, ...).
2) Open the black box by enabling logging: first DEBUG, then TRACE. The logs can show if the moves are slow (= rules are slow) or the step initialization is slow (= you need JIT selection).
3) Use the stepLimit benchmark technique to find out which rule(s) are slow.
4) Use the benchmarker (if you aren't already) and play with JIT selection, late acceptance, etc. See those topics in the docs.

Generalized Load Balancing (GLB) using Linear Programming (LP)

In one of my project - I have a scenario where I need to implement an algorithm capable of doing load balancing. Now, unlike the general load balancing problem present in CS theory (which is NP hard) - where the task is to allocate M loads in N servers (M >> N), such that the maximum load in any one server is minimized, the case that I am dealing with is a little more generic. In my case, the load balancing problem is more generic in the sense - it has more constraints in the form that - such and such job can only be assigned to such an such server (lets say for example job M_{i} has some special security requirements and hence can be allocated/executed only on secure server N_{j}.
Now I looked at the Kleinberg/Tardos book and I found a section (11.7) on the more generic load balancing problem (load balancing with constraints) and I found that this problem is an exact match for the situation I am in. The Generic Load Balancing problem has been converted from IP to LP taking advantage of the fact that LP can result in fractional assignment of jobs to machines which has later been rounded off adding an additional O(MN) time to the process. This approximation solution has then been shown to be within a factor of 2 times from the minimum possible.
Can someone point me to some C/Java/Python/MATLAB code where this algorithm has been implemented? As KL book hardly gives any examples or sample pseudo/actual code, it is hard to get the algorithm internalized completely sometimes. Also as for the linear programming part of the problem - what kind of an implementation is suitable for it - Simplex/Interior Point? How much difference will it make when complexity from this LP part is added to the problem (to the fractional re-assignment part)? Unfortunately, the KL book is not very thorough in these aspects.
Some sample C/Java/Python/MATLAB code (or pointers to code) showing some real implementation of this complete algorithm would be greatly helpful.
Edit: The original paper is "David B. Shmoys, Éva Tardos: An approximation algorithm for the generalized assignment problem. Math. Program. 62: 461-474 (1993)"
One way in which I did this was to load balance according to the current load on each machine. So if there are three machines A,B,C..... A has a load of 10, B had a load of 5 and C has a load of 2 then the next task (which lets say has a load of 3) should go to C(3+2 = 5 < all other combinations). In case of equality given that the task which starts first usually finishes first(at least most of the times) remove the oldest task from each machine and repeat the above process... Do this recursively

Updating an item/document takes between 1-2 seconds in a small index

We have a small index - less than 1MB in size and covering roughly 10,000 documents. The only fields that are stored are quite short which explains the small index size.
After the documents are loaded into the index, an update of an existing document can take between 1 and 2 seconds (there's quite a variance in this range though). We've tried utilizing various best practices (such as those in the Lucene wiki) but can't find what's wrong. We've even gone ahead and are now using RAMDirectory to remove the possibility of IO being the problem.
Is this really the performance to expect?
UPDATE
As requested below, I'm adding some more details:
We're treating Lucene as a black-box, we just time the amount of time it takes to reindex/update an object. We don't know what's going on inside.
The objects (or documents, in Lucene's terms) are quite small, with a total size of a 2KB of data each.
A code snippet outlining your entire update procedure would help. Are you committing after each update? This is not necessary and for top performance you must use Near Realtime Readers. Newer Lucene versions have an NRTManager that handles most of the boilerplate involved.
In many cases the best practice is to commit only rarely or never (except when shutting down). If your service shuts down ungracefully, you lose your index, but even if you didn't, you'd have to rebuild it upon restart anyway to account for all the changes that happened in the meantime.

Categories