Conditionals on sequential data using numpy
It is well known that vectorized operations using numpy run much faster than equivalent loop based operations. Sometimes though, a problem is easier to think about in terms of loop based operations rather than vectorized operations.
The problem
Suppose you need to check for a particular sequence in a list.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'acceleration': [
1, -3, -2, 7, -10, -3, -2, 8, 2, -1, 5, 10, 7, -3, -4, -6, -7, -9, 10,
-4, -2, 3, 6, 8, 9, 12, -11, -9, -7, -3
]
})
The most straightforward way to think about it is to check for the condition using a for
loop. For example, if you need to check if the current value, the previous value, and the one before are all negative, an easy way to do so is using:
for i in range(2, len(data)):
if (data.loc[i, 'acceleration'] <
0) & (data.loc[i - 1, 'acceleration'] <
0) & (data.loc[i - 2, 'acceleration'] < 0):
data.loc[i, 'resultloop'] = 1
else:
data.loc[i, 'resultloop'] = 0
Using the code above, the column, resultloop
, will hold a 1
if the condition is satisfied and a 0
otherwise. However, for long lists, this loop-based solution is very inefficient.
The vectorized solution
A much more efficient solution is to use vector operations. In this case, to do the ‘sequential check’, use pandas' shift
function to shift the DataFrame
by 1
and 2
units. Then, use numpy’s where
function to do the negative number checking on the shifted Series
.
data['resultvector'] = np.where(
(data < 0) & (data.shift(periods=1) < 0) & (data.shift(periods=2) < 0), 1,
0)
The result is the same as the loop-based solution, but for longer Series
, the comparison will run much faster.