## Exploring weather data: Iterative and recursive algorithms by the example of CountInInterval¶

In this notebook we want to explore weather data of the weather in Eindhoven. Specifically, we will develop a routine CountInInterval(A, p, q) that computes for an array A how many values are between p and q.

In [3]:
import csv

with open('weather-eindhoven.csv') as csvfile:
datatype = int
date = []
max_temperature = []
date.append(datatype(row[1])) #column 1 (that is, the second) contains the date
max_temperature.append(datatype(row[14])/10) # column 14 contains the max temperature times 10

In [4]:
def countInInterval(A, p, q):
for i in range(0, len(A)):


Before using this algorithm, lets prove the correctness of this algorithm using a loop invariant proof for the for-loop (lines 3-4)

Loop invariant: At the beginning of the iteration with index i, answer is the number of indices $0 < j < i$ with $p \leq A[j] \leq q$.

Initialization: Before the first iteration $i=0$, so there is actually no index $j$ with $0 < j < 0$. answer should be 0 and is 0.

Maintenance: Before the iteration with index $i$, the loop invariant gives us that answer is the number of indices $0 < j < i$ with $p \leq A[j] \leq q$. In the iteration with index $i$ there are two cases:

(1) If $A[i]$ is between $p$ and $q$ then answer is incremented by 1. Also the number of indices $j$ with $0 < j < i+1$ and $p \leq A[j] \leq q$ is by one larger than the corresponding number for $0 < j < i$. So the loop invariant holds again at the beginning of the next iteration.

(2) If $A[i]$ is not between $p$ and $q$ then answer stays unchanged. Also the number of indices $j$ with $0 < j < i+1$ and $p \leq A[j] \leq q$ is the same as the corresponding number for $0 < j < i$. So the loop invariant also in this case holds again at the beginning of the next iteration.

Termination: After the last iteration we have $i = len(A)$. Now the loop invariant gives us that answer is the number of indices $j$ with $0 < j < len(A)$ and $p \leq A[j] \leq q$. Since $A[0:len(A)]$ is all of $A$, this means that answer is the total number of indices with $p \leq A[j] \leq q$, which is the answer we want to return.

The data is over 32 years. Lets see how many cold and how many warm days a year had on average.

In [13]:
print(countInInterval(max_temperature, -5, 5)//32)
print(countInInterval(max_temperature, 20, 30)//32)

38
92

In [7]:
%timeit countInInterval(max_temperature, -5, 5)

100 loops, best of 3: 3.33 ms per loop


Assume we want to use the countInInterval method frequently. Then it seems a good idea to invest some time into extra preprocessing, if in turn we can count faster. A simple approach is to sort the temperatures.

In [8]:
max_temperature.sort()


Now lets try to count using a recursive divide&conquer algorithm:

In [9]:
def countInInterval2(A, p, q, x=0, y=None):
if (y == None): y = len(A)-1
if (x<y):
h = (x+y)//2
if (A[h]<p): return countInInterval2(A, p, q, h+1, y)
else: return countInInterval2(A, p, q, x, h)
else:
count = 0
while(A[x]>= p and A[x]<=q and x<len(A)):
count = count+1
x = x+1
return count

In [11]:
countInInterval2(max_temperature, -5, 5)//32

Out[11]:
38
In [14]:
%timeit countInInterval2(max_temperature, -5, 5)

1000 loops, best of 3: 827 µs per loop


This this is faster than the iterative algorithm on the unsorted data, the algorithm above sometimes requires two recursive calls. This can be avoided:

In [15]:
def countInInterval3(A, p, q, x=0, y=None):
if (y == None): y = len(A)-1
if (x<y):
h = (x+y)//2
if (A[h]<p):
return countInInterval3(A, p, q, h+1, y)
else:
if (A[h]>q): return countInInterval3(A, p, q, x, h)
else: return countInInterval3(A, p, q, h+1, y) + countInInterval3(A, p, q, x, h)
else:
if(A[x]>= p and A[x]<=q): return 1
else: return 0

In [17]:
countInInterval3(max_temperature, -5, 5)//32

Out[17]:
38
In [18]:
%timeit countInInterval3(max_temperature, -5, 5)

1000 loops, best of 3: 2.1 ms per loop


This didn't work out that well. And there is also something unsatisfactory: If many values are counted the algorithm will also take long. We can avoid this. Essentially we just need two binary searches.

In [19]:
def binary_search_lower(A, v, x, y):
if (x<y):
h = (x+y)//2
if (A[h]<v): return binary_search_lower(A, v, h+1, y)
else: return binary_search_lower(A, v, x, h)
else:
return x

def binary_search_upper(A, v, x, y):
if (x<y):
h = (x+y)//2
if (A[h]<=v): return binary_search_upper(A, v, h+1, y)
else: return binary_search_upper(A, v, x, h)
else:
return x

def countInInterval4(A, p, q):
return binary_search_upper(A, q, 0, len(A)-1) - binary_search_lower(A, p, 0, len(A)-1)

In [22]:
countInInterval4(max_temperature, -5, 5)//32

Out[22]:
38
In [23]:
%timeit countInInterval4(max_temperature, -5, 5)

100000 loops, best of 3: 16.3 µs per loop


As we have seen, countInInterval runs with two binary searches much faster than by simply checking all data. What if we would want to find all or count the warm days with little or no rain? Of course we can simply check all data, but to get an efficient solution, we will need to store the data in a suitable data structure. We will see how to do this in a later lecture, namely the lecture on range searching.

In [ ]: