Exploring weather data: Iterative and recursive algorithms by the example of CountInInterval

In this notebook we want to explore weather data of the weather in Eindhoven. Specifically, we will develop a routine CountInInterval(A, p, q) that computes for an array A how many values are between p and q.

In [3]:
import csv

with open('weather-eindhoven.csv') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)  # skip header row
    datatype = int
    date = []
    max_temperature = []
    for row in reader:
        date.append(datatype(row[1])) #column 1 (that is, the second) contains the date
        max_temperature.append(datatype(row[14])/10) # column 14 contains the max temperature times 10
In [4]:
def countInInterval(A, p, q):
    answer = 0
    for i in range(0, len(A)):
        if A[i]>= p and A[i]<= q: answer = answer+1
    return answer

Before using this algorithm, lets prove the correctness of this algorithm using a loop invariant proof for the for-loop (lines 3-4)

Loop invariant: At the beginning of the iteration with index i, answer is the number of indices $0 < j < i$ with $p \leq A[j] \leq q$.

Initialization: Before the first iteration $i=0$, so there is actually no index $j$ with $0 < j < 0$. answer should be 0 and is 0.

Maintenance: Before the iteration with index $i$, the loop invariant gives us that answer is the number of indices $0 < j < i$ with $p \leq A[j] \leq q$. In the iteration with index $i$ there are two cases:

(1) If $A[i]$ is between $p$ and $q$ then answer is incremented by 1. Also the number of indices $j$ with $0 < j < i+1$ and $p \leq A[j] \leq q$ is by one larger than the corresponding number for $0 < j < i$. So the loop invariant holds again at the beginning of the next iteration.

(2) If $A[i]$ is not between $p$ and $q$ then answer stays unchanged. Also the number of indices $j$ with $0 < j < i+1$ and $p \leq A[j] \leq q$ is the same as the corresponding number for $0 < j < i$. So the loop invariant also in this case holds again at the beginning of the next iteration.

Termination: After the last iteration we have $i = len(A)$. Now the loop invariant gives us that answer is the number of indices $j$ with $0 < j < len(A)$ and $p \leq A[j] \leq q$. Since $A[0:len(A)]$ is all of $A$, this means that answer is the total number of indices with $p \leq A[j] \leq q$, which is the answer we want to return.

The data is over 32 years. Lets see how many cold and how many warm days a year had on average.

In [13]:
print(countInInterval(max_temperature, -5, 5)//32)
print(countInInterval(max_temperature, 20, 30)//32)
38
92
In [7]:
%timeit countInInterval(max_temperature, -5, 5)
100 loops, best of 3: 3.33 ms per loop

Assume we want to use the countInInterval method frequently. Then it seems a good idea to invest some time into extra preprocessing, if in turn we can count faster. A simple approach is to sort the temperatures.

In [8]:
max_temperature.sort()

Now lets try to count using a recursive divide&conquer algorithm:

In [9]:
def countInInterval2(A, p, q, x=0, y=None):
    if (y == None): y = len(A)-1
    if (x<y):
        h = (x+y)//2
        if (A[h]<p): return countInInterval2(A, p, q, h+1, y)
        else: return countInInterval2(A, p, q, x, h)
    else: 
        count = 0
        while(A[x]>= p and A[x]<=q and x<len(A)):
            count = count+1
            x = x+1
        return count
In [11]:
countInInterval2(max_temperature, -5, 5)//32
Out[11]:
38
In [14]:
%timeit countInInterval2(max_temperature, -5, 5)
1000 loops, best of 3: 827 µs per loop

This this is faster than the iterative algorithm on the unsorted data, the algorithm above sometimes requires two recursive calls. This can be avoided:

In [15]:
def countInInterval3(A, p, q, x=0, y=None):
    if (y == None): y = len(A)-1
    if (x<y):
        h = (x+y)//2
        if (A[h]<p): 
            return countInInterval3(A, p, q, h+1, y)
        else: 
            if (A[h]>q): return countInInterval3(A, p, q, x, h)
            else: return countInInterval3(A, p, q, h+1, y) + countInInterval3(A, p, q, x, h)
    else: 
        if(A[x]>= p and A[x]<=q): return 1
        else: return 0
In [17]:
countInInterval3(max_temperature, -5, 5)//32
Out[17]:
38
In [18]:
%timeit countInInterval3(max_temperature, -5, 5)
1000 loops, best of 3: 2.1 ms per loop

This didn't work out that well. And there is also something unsatisfactory: If many values are counted the algorithm will also take long. We can avoid this. Essentially we just need two binary searches.

In [19]:
def binary_search_lower(A, v, x, y):
    if (x<y):
        h = (x+y)//2
        if (A[h]<v): return binary_search_lower(A, v, h+1, y)
        else: return binary_search_lower(A, v, x, h)
    else: 
        return x
    
def binary_search_upper(A, v, x, y):
    if (x<y):
        h = (x+y)//2
        if (A[h]<=v): return binary_search_upper(A, v, h+1, y)
        else: return binary_search_upper(A, v, x, h)
    else: 
        return x
    
def countInInterval4(A, p, q):
    return binary_search_upper(A, q, 0, len(A)-1) - binary_search_lower(A, p, 0, len(A)-1)
In [22]:
countInInterval4(max_temperature, -5, 5)//32
Out[22]:
38
In [23]:
%timeit countInInterval4(max_temperature, -5, 5)
100000 loops, best of 3: 16.3 µs per loop

As we have seen, countInInterval runs with two binary searches much faster than by simply checking all data. What if we would want to find all or count the warm days with little or no rain? Of course we can simply check all data, but to get an efficient solution, we will need to store the data in a suitable data structure. We will see how to do this in a later lecture, namely the lecture on range searching.

In [ ]: