Faster quantile calculations in the Perl Data Language(PDL)
I was writing a data intensive code in Perl relying heavily on PDL for some statitical calculations (estimation of percentile points in some very BIG vectors, e.g. 100k to 1B elements), when
I noticed that PDL was taking a very (and unusually long!) time to produce results compared to my experience in Python.
This happened irrespective of whether one used the pct or oddpct functions in PDL::Ufunc.
The performance degradation had a very interesting quantitative aspect: if one asked PDL to return a single percentile it did so very fast;
but if one were to ask for more than one percentiles, the time-to-solution increased linearly with the number of percentiles specified.
Looking at the source code of the pct function, it seems that it is implemented by calling the function pctover, which according to the PDL documentation “Broadcasts over its inputs.”
But what is exactly broadcasting? According to PDL::Broadcasting : “[broadcasting] can produce very compact and very fast PDL code by avoiding multiple nested for loops that C and BASIC users may be familiar with. The trouble is that it can take some getting used to, and new users may not appreciate the benefits of broadcasting.” Reading the relevant PDL examples and revisiting the NumPy documentation (which also uses this technique), broadcasting : treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python..
It seems that when one does something like:
use PDL::Lite;
my $very_big_ndarray = ... ; # code that constructs a HUGE PDL ndarrat
my $pct = sequence(100)/100; # all percentiles from 0 to 100%
my $pct_values = pct( $very_big_ndarray, $pct);
the broadcasting effectively executes sequentially the code for calculating a single percentile and concatenates the results.
The problem with broadcasting for this operation is that the percentile calculation includes a VERY expensive operation, namely the
sorting of the $very_big_darray before the (trivial) calculation of the percentile from the sorted values as detailed in
Wikipedia. So when the percentile operation is broadcast by PDL, the sorting is repeated for each
percentile value in $pct, leading to catastrophic loss of performance!
How can we fix this? It turns out to be reasonably trivial : we need to reimplement the percentile function so that it does not broadcast.
One of the simplest quantile functions to implement, is the one based on the empirical cumulative distribution function (this corresponds to the Type 3 quantile in the
classification by Hyndman and Fan).
This one can be trivially implemented in Perl using PDL as:
sub quantile_type_3 {
my ( $data, $pct ) = @_;
my $sorted_data = $data->qsort;
my $nelem = $data->nelem;
my $cum_ranks = floor( $pct * $nelem );
$data->index($cum_ranks);
}
(The other quantiles can be implemented equally trivially using affine operations as explained in R’s documentation of the quantile function).
To see how well this works, I wrote a Perl benchmark script that benchmarks
the builtin function pct, the quantile_type_3 function on synthetic data and then calls the companion
R script to profile the 9 quantile functions and the 3 sort
functions in R for the same dataset.
I obtained the following performance figures in my old Xeon: the “de-broadcasted” version of the quantile function achieves the same performance as the R implementations, whereas the
PDL broadcasting version is 100 times slower.
| Test | Iterations | Elements | Quantiles | Elapsed Time (s) |
|---|---|---|---|---|
| pct | 10 | 1000000 | 100 | 132.430000 |
| quantile_type_3 | 10 | 1000000 | 100 | 1.320000 |
| pct_R_1 | 10 | 1000000 | 100 | 1.290000 |
| pct_R_2 | 10 | 1000000 | 100 | 1.281000 |
| pct_R_3 | 10 | 1000000 | 100 | 1.274000 |
| pct_R_4 | 10 | 1000000 | 100 | 1.283000 |
| pct_R_5 | 10 | 1000000 | 100 | 1.290000 |
| pct_R_6 | 10 | 1000000 | 100 | 1.286000 |
| pct_R_7 | 10 | 1000000 | 100 | 1.233000 |
| pct_R_8 | 10 | 1000000 | 100 | 1.309000 |
| pct_R_9 | 10 | 1000000 | 100 | 1.291000 |
| sort_quick | 10 | 1000000 | 100 | 1.220000 |
| sort_shell | 10 | 1000000 | 100 | 1.758000 |
| sort_radix | 10 | 1000000 | 100 | 0.924000 |
As can be seen from the table, the sorting operations account mostly for the bulk of the execution time of the quantile functions.
Two major takehome points:
1) don’t be afraid to look under the hood/inside the blackbox when performance is surprisingly disappointing!
2) be careful of broadcasting operations in PDL, NumPy, or Matlab.