Luis Alvergue

Controls engineer by training, transportation engineer by trade

Conditionals on sequential data using numpy

5 minutes
June 27, 2021

It is well known that vectorized operations using numpy run much faster than equivalent loop based operations. Sometimes though, a problem is easier to think about in terms of loop based operations rather than vectorized operations.

The problem

Suppose you need to check for a particular sequence in a list.

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'acceleration': [
        1, -3, -2, 7, -10, -3, -2, 8, 2, -1, 5, 10, 7, -3, -4, -6, -7, -9, 10,
        -4, -2, 3, 6, 8, 9, 12, -11, -9, -7, -3
    ]
})

The most straightforward way to think about it is to check for the condition using a for loop. For example, if you need to check if the current value, the previous value, and the one before are all negative, an easy way to do so is using:

for i in range(2, len(data)):
    if (data.loc[i, 'acceleration'] <
            0) & (data.loc[i - 1, 'acceleration'] <
                  0) & (data.loc[i - 2, 'acceleration'] < 0):
        data.loc[i, 'resultloop'] = 1
    else:
        data.loc[i, 'resultloop'] = 0

Using the code above, the column, resultloop, will hold a 1 if the condition is satisfied and a 0 otherwise. However, for long lists, this loop-based solution is very inefficient.

The vectorized solution

A much more efficient solution is to use vector operations. In this case, to do the ‘sequential check’, use pandas' shift function to shift the DataFrame by 1 and 2 units. Then, use numpy’s where function to do the negative number checking on the shifted Series.

data['resultvector'] = np.where(
    (data < 0) & (data.shift(periods=1) < 0) & (data.shift(periods=2) < 0), 1,
    0)

The result is the same as the loop-based solution, but for longer Series, the comparison will run much faster.