Exploring weather data: Iterative and recursive algorithms by the example of CountInInterval

In this notebook we want to explore weather data of the weather in Eindhoven. Specifically, we will develop a routine CountInInterval(A, p, q) that computes for an array A how many values are between p and q.

In [1]:
import csv

with open('weather-eindhoven.csv') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)  # skip header row
    datatype = int
    date = []
    max_temperature = []
    for row in reader:
        date.append(datatype(row[1])) #column 1 (that is, the second) contains the date
        max_temperature.append(datatype(row[14])/10) # column 14 contains the max temperature times 10
In [2]:
def countInInterval(A, p, q):
    answer = 0
    for i in range(0, len(A)):
        if A[i]>= p and A[i]<= q: answer = answer+1
    return answer
In [3]:
print(countInInterval(max_temperature, -5, 5))
print(countInInterval(max_temperature, 20, 30))
1220
2963
In [4]:
%timeit countInInterval(max_temperature, -5, 5)
100 loops, best of 3: 29.6 ms per loop

Assume we want to use the countInInterval method frequently. Then it seems a good idea to invest some time into extra preprocessing, if in turn we can count faster. A simple approach is to sort the temperatures.

In [5]:
max_temperature.sort()

Now lets try to count using a recursive divide&conquer algorithm:

In [6]:
def countInInterval2(A, p, q, x=0, y=None):
    if (y == None): y = len(A)-1
    if (x<y):
        h = (x+y)//2
        if (A[h]<p): return countInInterval2(A, p, q, h+1, y)
        else: return countInInterval2(A, p, q, x, h)
    else: 
        count = 0
        while(A[x]>= p and A[x]<=q and x<len(A)):
            count = count+1
            x = x+1
        return count
In [9]:
countInInterval2(max_temperature, -5, 5)
Out[9]:
1220
In [10]:
%timeit countInInterval2(max_temperature, -5, 5)
100 loops, best of 3: 6.3 ms per loop

This this is faster than the iterative algorithm on the unsorted data, the algorithm above sometimes requires two recursive calls. This can be avoided:

In [11]:
def countInInterval3(A, p, q, x=0, y=None):
    if (y == None): y = len(A)-1
    if (x<y):
        h = (x+y)//2
        if (A[h]<p): 
            return countInInterval3(A, p, q, h+1, y)
        else: 
            if (A[h]>q): return countInInterval3(A, p, q, x, h)
            else: return countInInterval3(A, p, q, h+1, y) + countInInterval3(A, p, q, x, h)
    else: 
        if(A[x]>= p and A[x]<=q): return 1
        else: return 0
In [12]:
countInInterval3(max_temperature, -5, 5)
Out[12]:
1220
In [13]:
%timeit countInInterval3(max_temperature, -5, 5)
100 loops, best of 3: 18.7 ms per loop

This didn't work out that well. And there is also something unsatisfactory: If many values are counted the algorithm will also take long. We can avoid this. Essentially we just need two binary searches.

In [19]:
def binary_search_lower(A, v, x, y):
    if (x<y):
        h = (x+y)//2
        if (A[h]<v): return binary_search_lower(A, v, h+1, y)
        else: return binary_search_lower(A, v, x, h)
    else: 
        return x
    
def binary_search_upper(A, v, x, y):
    if (x<y):
        h = (x+y)//2
        if (A[h]<=v): return binary_search_upper(A, v, h+1, y)
        else: return binary_search_upper(A, v, x, h)
    else: 
        return x
    
def countInInterval4(A, p, q):
    return binary_search_upper(A, q, 0, len(A)-1) - binary_search_lower(A, p, 0, len(A)-1)
In [20]:
countInInterval4(max_temperature, -5, 5)
Out[20]:
1220
In [21]:
%timeit countInInterval4(max_temperature, -5, 5)
10000 loops, best of 3: 85.8 ┬Ás per loop
In [ ]: