Autocorrelation Detection and Mitigation

A benchmark session needs to be long enough so that we can collect enough samples to calculate the CI at the desired confidence level. The more samples we have, the narrower the CI can be made. However, a crucial issue that is often overlooked in many published benchmark results is the autocorrelation among samples. Autocorrelation is the cross-correlation of a sequence of measurements with itself at different points in time. Conceptually, a high autocorrelation means that previous data points can be used to predict future data points, and that would invalid the calculation of CI no matter how large the sample size is. Most measurements in computer systems are autocorrelated because of the stateful nature of computer systems. For instance, most computer systems have one or more schedulers, which allocate time slice to jobs. The measured performance of such jobs would be highly correlated when they are taken within a single time slice, and would change significantly between time slices if the duration of a measurement unit is not significantly longer than the size of a time slice. The autocorrelation in the samples must be properly handled before we can go on to the next step to calculate the sample’s CI.

Autocorrelation is measured by the autocorrelation coefficient of a sequence, which is calculated as the covariance between measurements from the same sequence as

\[R(\tau) = \frac{\operatorname{E}[(X_t - \mu)(X_{t+\tau} - \mu)]}{\sigma^2},\]

where \(\tau\) is the time-lag. The autocorrelation coefficient is a number in range \([-1,1]\), where \(-1\) means the sample data are reversely correlated and \(1\) means the data is autocorrelated. In statistics, \([-0.1, 0.1]\) is deemed to be a valid range for declaring the sample data has negligible autocorrelation [ferrari:78].

Subsession analysis [ferrari:78] is a statistical method for handling autocorrelation in sample data. \(n\)-subsession analysis models the test data and combines every \(n\) samples into a new sample. Pilot calculates the autocorrelation coefficient of measurement data after performing data sanitizing, such as non-stable phases removal, and gradually increases $n$ until the autocorrelation coefficient is reduced to within the desired range.

[ferrari:78](1, 2) Domenic Ferrari. Computer Systems Performance Evaluation. Prentice-Hall, 1978.