Linear Search¶

From timings to asymptotic analysis¶

We want to analyze the efficiency of algorithms. We want to do so by looking at the running time of an algorithm. This sound easy and difficult.

It seems easy, because we can simply check how many seconds an algorithm runs. To illustrate this, we look at how long it takes to check whether a given number $x$ in a collection $A$ of numbers.

In [17]:
A=[10, 17, 22, 1, 9, 5]

In [18]:
%%timeit
11 in A

The slowest run took 9.11 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 421 ns per loop


So what does this number tell us about the algorithm? Not much yet. We can try to use it to compare it to a different algorithm. We again use an algorithm that checks for a number, but this time the collection $A$ is stored differently (we will see more about different Python data-types in a different lecture, for now we can simply think of this as different data structures implementing 'in').

In [19]:
A= set([10, 17, 22, 1, 9, 5])

In [21]:
%%timeit
11 in A

The slowest run took 18.33 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 140 ns per loop


We observe that the algorithm ran faster (on my machine, this time). But there is little we can conclude from this: it only tells us something about the running time for a specific $A$, a specific $x$, and on my computer. Naturally we would be interested in the running time for large instances, since we want to know how scalable the algorithms are. We can test larger instances, and we are testing the data types: list and set.

In [22]:
A = list(range(2, 300))

In [23]:
%%timeit
1 in A

100000 loops, best of 3: 37.4 µs per loop

In [26]:
A = set(range(2, 300))

In [43]:
%%timeit
1 in A

The slowest run took 13.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 140 ns per loop


In preparation of plotting running times, lets achieve (nearly) the same time measurement as above in a different way. 'repeat=3' repeats timeit 3 times, and by taking the minimum ('min') we take the best out of 3. The rationale of taking the 'best of 3' is to ignore runs that are slowed down by my computer being busy with other processes. The result is in seconds but measured over $10^6$ loops, which corresponds to microseconds per loop.

In [51]:
import timeit
min(timeit.repeat(repeat=3, stmt='1 in A', setup='A = set(range(2, 300))'))

Out[51]:
0.12074233655766875

The difference between lists and sets seems to get more significant with increasing size of A. Lists are now much slower, while sets didn't really get slower at all. Without any attempt to explain the following code right now, we can also look at how the running time grows with increasing input size.

In [58]:
import timeit

def test_in(A,x):
return x in A

x = range(1, 101, 10)
y = []
z = []
for n in x:
A = list(range(0, n))
y.append(min(timeit.repeat(lambda: test_in(A, 1000), repeat=3)))
B = set(range(0, n))
z.append(min(timeit.repeat(lambda: test_in(B, 1000), repeat=3)))

%matplotlib notebook

import matplotlib.pyplot as plt
list_plot, =  plt.plot(x,y,'r') # plot list 'in' in red
set_plot, = plt.plot(x,z,'b') # plot set 'in' in blue
plt.legend([list_plot, set_plot], ['list', 'set'])
plt.ylabel('running time (in $\mu$s)')
plt.xlabel('size of input')
plt.title('running time for \'in\' on different data types')
plt.ylim(ymin=0)
plt.show()


In this plot we seem to see a very clear distinction. The running time of 'in' seems to grow linearly for lists, but doesn't seem to grow for sets. Thus the order of growth is different, and the order of growth seems to be what really matters here. The beauty of the asymptotic analysis that we will discuss in the first lecture is that we can work with the order of growth without resorting to experiments. At the end of this lecture it should be clear why the running time for 'in' on lists (which for this purpose are very similar to arrays) is 'in $O(n)$'. The running time of 'in' in sets, however, we will come back to much later in the course.

Linear search in python¶

On the slides I use pseudo-code for linear search, since at this point I assume it is easier to read for you. But we can very easily re-write the pseudo-code as python code. Recall that the pseudo-code was

Linear-Search(A, n, x)

2. For each index i, going from 1 to n, in order:
3. If A[i] = x, then set answer to the value of i
4. Return the value of answer as the output

In Python this would read as follows (we will use -1 to represent Not-Found). Note that the index of the first element is 0 (and not 1). Further note that range(start, stop) includes the numbers from start to stop-1 (and not stop).

In [6]:
def linear_search(A, x):
for i in range(0, len(A)):
if A[i] == x: answer = i

In [7]:
linear_search([10, 5, 9, 9], 10)

Out[7]:
0
In [8]:
linear_search([10, 5, 9, 9], 9)

Out[8]:
3
In [9]:
linear_search([10, 5, 9, 9], 8)

Out[9]:
-1

Also the better linear search can be directly translated into python:

Better-Linear-Search(A, n, x)

1. For i = 1 to n: If A[i] = x, then return the value of i as the output
2. Return Not-Found as the output
In [76]:
def better_linear_search(A, x):
for i in range(0, len(A)):
if A[i] == x: return i
return -1

In [11]:
better_linear_search([10, 5, 9, 9], 10)

Out[11]:
0
In [12]:
better_linear_search([10, 5, 9, 9], 9)

Out[12]:
2
In [13]:
better_linear_search([10, 5, 9, 9], 8)

Out[13]:
-1

Note that the algorithm is again correct, but produces a different output! We could be even more careful when specifying the output. We can use the plotting code from above to experimentally evaluate the performance of the two algorithms, although we already know what to expect from the theoretical analysis.

In [14]:
# import matplotlib.pyplot as plt
x = range(1, 31, 10)
y_1 = []
y_2 = []
y_3 = []
y_4 = []

for n in x:
A = list(range(0, n))
y_1.append(timeit.timeit(lambda: linear_search(A, 1000)))
y_2.append(timeit.timeit(lambda: linear_search(A, 0)))
y_3.append(timeit.timeit(lambda: better_linear_search(A, 1000)))
y_4.append(timeit.timeit(lambda: better_linear_search(A, 0)))

%matplotlib notebook

import matplotlib.pyplot as plt
plt.plot(x,y_1,'r')
plt.plot(x,y_2,'g')
plt.plot(x,y_3,'b')
plt.plot(x,y_4,'c')
plt.ylim(ymin=0)
plt.show()


As expected the best case for better_linear_search is much better, i.e., seemingly O(1). The worst-case of both searches seems to be indeed linear (although from experiments we can't actually say whether things couldn't get even worse). Asymptotic analysis to the rescue!

Linear search on data¶

Finally lets actually use linear search on some data. We will use some weather data from the closest weather station, which is in Eindhoven. The file 'weather-eindhoven.csv' contains weather information for each day between 1985 and 2016. The variables are explained in 'weather-variables.txt'. Lets say we want to find the first day since 1985 for which the maximum temperature was exactly 30.0 degrees celsius.

In [72]:
import csv

with open('weather-eindhoven.csv') as csvfile:
datatype = int
date = []
max_temperature = []

index = better_linear_search(max_temperature, 30)

19950720