Remember when we chatted about how crucial it is to pick the right data structure for quick membership checks? If you missed that, no worries β you can still catch up right here. Now, let’s dive into Part 2 of our series and uncover more insights together!
After reading part 1, you might be wondering .. what kind of wizardry is going behind the scenes to achieve those kickass benchmarks ?
In order to answer this question, let’s start by asking more fundamental questions
Let’s start by asking … what is the problem ?
Chapter 1 – The potato in a bag of apples π₯ π
You’ve probably written a variation of the following logic many times:
user name = <acquire user name>
seen = ["James", "Emily", "Wei", "Amina", "Carlos"]
if record in seen:
< do this>
else:
< do that>
Classically, a computer compares the input against one array value at a time, so let’s say you want to check if the name “James” is an array/ list. How would the program act ?
The program needs to check each element of the list, one element at a time so it could tell if the string “James” is member of the array or not, in that simple case we were lucky to find it at the first entry.
Now let’s see what happens if we try with the string “Wei
“. The program will check is it “James
” ? no ! is it “Emily
” ? no ! is it “Wei
” yes !
What if we try to check if the string “Svetlana” is in the list, we’ll find out that a naive program will need to scan the entire list until it could determine that the string “Svetlana” actually does not belong to the list.
For a list of 5 items that’s no big deal, but what if the input increases to 10 items ? or a 100 ? or a 1000000 ?
That’s the kind of performance degradation you get when you use a list/ array of such action, and that’s only the lookup operation, we haven’t even discussed the insertions or deletion (spoiler alert, they take the same effort !)
Meet map_user_name
. It is a function that takes a user name as an input, and returns an integer that represents an index of an array (more on that in a second)
def map_user_name(user_name: str) -> int:
match user_name:
case "James":
return 0
case "Emily":
return 1
case "Wei":
return 2
case "Amina":
return 3
case "Carlos":
return 4
Now let’s rewrite the seen array as an array that has boolean value for each user of the possible input values.
seen = [False, False, True, False, True]
now instead of having to iterate over all the items in the array until we find (or don’t find !) the element in question, we can simply get the numerical value of that input, and check the index in the array directly.
For example we want to check if a name is seen before, the lookup operation would be much faster. This is how it would work.
Is the name “Emily” seen before ?
we call map_user_name(“Emily”) and we get 1
we look up seen[1] which returns False
That means we haven’t seen the user “Emily” yet.
Is the name “Carlos” seen before ?
we call map_user_name(“Carlos”) and we get 4
we loom up seen[4] which returns True
That means we actually have seen the user “Carlos” before.
That look up operation is a constant time operation !
Pretty wild right ?
We basically did some initial work, repurposed our data structure and made some smart setups and we ended up avoiding scanning the entire array and just checking array indices. As we know that this is a constant time operation, if the array is 3 elements or 3000 elements in size the lookup operation will take the same time !
Limitations
This approach we just mentioned won’t scale too much.
- The function that maps the input value to an array integer is pretty rigid, we need to open it up and edit it if we were to change the array size or the returned indices.
- We don’t know the possible inputs beforehand in the first place ! so creating an array of all possible inputs, and creating a mapping function with hardcoded mapping to array indices is completely impractical.
- Adding new values or deleting old values with this approach is also pretty problematic. How would you handle really large possible inputs ? if you delete a value what would happen ?
Luckily, there are solutions to those concerns, but before that let’s talk about pigeons π¦
Chapter 2 – About pigeons and holes
Let’s allow ourselves to get distracted for a second and talk about pigeonholes. In computer science, the pigeonhole principle usually refers to the fact that if you are trying to map a larger set of items into a smaller set of categories (or “pigeonholes”), then at least one category must contain more than one item.
Bear with me π
When writing a computer program you’ll most certainly need to look up an item in a group (or a set) of other items at a point.
As you’ve noticed by now, computational resources are not infinite, and you understand that with the growth of the program input, you want the input to still behave efficiently in terms of time and space.
So what if the list of inputs increases in size from 10 inputs to 100 ? to 1000 ? to 1000000 ?
Is there a way where the time your program needs to handle 1000 would be more or less the same as 100 ? can you get a 10 bazillion new input elements that you want to check against without having the runtime to get 10 bazillion times slower or the seen
set to not be 10 bazillion times bigger in terms of memory ?
Here where the pigeonhole problem becomes relevant.
Here’s the reasoning, given that we have a limited number of memory slots (holes), and we’re trying to place many inputs (pigeons) we’ll end up with some inputs in the same slot.
so far so good ? we convinced ourselves that we can use our program to make a representation of all the universe of inputs, buuuttt how’s it relevant to the lookups and membership testing ?
Don’t worry about that for now, we’ll get back to that, but before that …
Chapter 3 – Hashing: When Data Plays Musical Chairs! πΆπͺ
Remember when we said there are limitations to the way we keep items in the array and how we get an array index for each possible input value ?
Worry not ! Hans Peter Luhn got your back !
Back in the 50s he proposed two big related ideas:
- The idea of a hash function.
- The idea of hash tables.
Hash Tables
A hash table is a data structure that stores key-value pairs. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. The brilliant idea here is, we don’t actually need to create an array of all possible inputs, we can create a much smaller array that we’ll use intelligently to deal with a much larger amount of inputs.
Does that remind you of anything we’ve talked about earlier in this article ? exactly ! the pigeonhole problem !
We are mapping from a huge group of possibilities to a much much smaller group of manageable items. In that case, we know that we’ll end up with some elements ending up in the same slot !
Isn’t that pretty bad ? in our membership example that could mean if I choose a seen set of a size 3 instead of 5, I could end up with the values “Emily”, “Wei” in the same slot ! Wouldn’t one of them upon insertion delete the other ?
This situation is called collision π₯
How do we deal with collisions ? we’ll get back to that topic, but before that, let’s talk about hash functions.
Hash Functions
A hash function is a generalized and flexible form of the `map_user_input` we’ve talked about before. In the context of hash tables it is a mathematical function π» that takes an input (or ‘key‘) and returns a fixed-size string of bytes. The output is typically a ‘hash code’ used to index a hash table.
Key properties of a good hash function include:
- Deterministic: The same input always results in the same hash.
- Uniform Distribution: It should spread keys uniformly across the hash table to minimize collisions.
- Fast Computation: It should be quick to compute.
- Minimizes Collisions: Different inputs should hash to different values as much as possible.
You can have a look yourself at a couple of well-known (what so called cryptographic has functions), by visiting the website https://www.md5hashgenerator.com/
The ones that are used for hash maps are normally simpler functions that are designed for speed. There are different choices of hash functions depending on what exactly you’re trying to achieve. I will not include any mathematical references here to avoid scaring people off, but you can look it up next Halloween π
Ideally, the hash function will assign each key to a unique bucket, but most hash function designs cause hash collisions since the size of array is limited (as we’ve mentioned).
Collisions
A hash collision occurs when two different input values to a hash function produce the same hash value or output. This is the pigeonhole situation, and like pigeonholes the simplest and the earliest collision management strategy was to place all the values responding to the same hash key in the same slot. This strategy is called chaining
βοΈ.
Chaining βοΈ
Chaining is a method used in hash tables to resolve collisions, where each slot of the table contains a linked list (or chain) of entries that have hashed to that index. When inserting a new key-value pair, if the hash function maps the key to an already occupied slot, the pair is added to the end of the list in that slot. When searching for or deleting an entry, the list in the relevant slot is traversed until the desired entry is found or the end of the list is reached. This method allows multiple entries to exist at the same hash table index, effectively managing collisions.
As you can imagine this is a pretty straightforward strategy but it has an obvious pitfall that in the worst case scenario (probably theoretical rather than practical) in which all (or most !) of the values map to the same slot ! That’ll effectively render the lookup speed and scalability useless because we sneakily started to use a linked list instead of an array now !
Open Addressing
Open addressing is a collision resolution method in hash tables where all elements are stored directly in the hash table array itself. When a collision occurs (i.e., two keys hash to the same index), open addressing seeks an alternative empty slot within the same table for the new element. This is done through various probing methods, such as linear probing, quadratic probing, or double hashing.
Chapter 4 – Python’s Traffic Control βοΈπ¦
Python has two data structures that use hashing/ hash tables under the hood, `dicts` and `sets`. Sets are meant to represent a similar idea of a mathematical set (with operations such as union, intersection, difference, etc.), while dicts (dictionaries) are used to store key-value pairs. Each key in a dictionary is unique and is used to store and retrieve the corresponding value.
Let’s focus on sets because they are normally what you’d be using if you’re doing membership tests.
Collision Management
Python elegantly combines two open addressing strategies to deal with collisions. If you take a deep breath and head to the CPython set implementation, you’ll find the following in the documentation:
To improve cache locality, each probe inspects a series of consecutive
https://github.com/python/cpython/blob/main/Objects/setobject.c
nearby entries before moving on to probes elsewhere in memory. This leaves
us with a hybrid of linear probing and randomized probing. The linear probing
reduces the cost of hash collisions because consecutive memory accesses
tend to be much cheaper than scattered probes. After LINEAR_PROBES steps,
we then use more of the upper bits from the hash value and apply a simple
linear congruential random number generator. This helps break-up long
chains of collisions.
Let’s try to translate that:
For lookup/ insertion operations, Python tries to insert a value at a given key as follows.
- It calculates what so-called “Initial Probe Index”. the starting position in the hash table where the algorithm first tries to insert or locate an item. This index is crucial for the functioning of the hash table, as it determines where the search or insertion process begins.
- Hash Function: An item (or key) is first passed through a hash function. This function computes a hash value based on the content of the item. The purpose of the hash function is to distribute items uniformly across the hash table.
- Modulo Operation: The computed hash value is then used in a modulo operation with the size of the hash table. Specifically, the hash value is divided by the table size, and the remainder of this division is the initial probe index.
In simpler words:
hash_function(key) — outputs –> integer — % size of (array) –> initial probe index - This initial index determines the first slot in the hash table that the algorithm will check or use. If the slot is empty (in the case of insertion) or contains the desired item (in the case of lookup), the operation completes. However, if there’s a collision (i.e., the slot is occupied by a different item), the algorithm proceeds to use a probing sequence to find the next slot to check or use.
Now here’s where it gets really interesting, buckle up ! π
If there’s an already existing items in the mapped slot, Python (at least CPython) does the following:
- Linear Probing: The algorithm tries first to employ linear probing, where it inspects a series of consecutive nearby entries. This improves cache locality, as consecutive memory accesses are typically faster than scattered ones. This step helps reduce the cost of hash collisions.
- Transition to Randomized Probing: After a certain number of linear probes (9 times in the current implementation), the algorithm switches to a form of randomized probing. This method doesn’t just move to the next slot in order, but jumps around in a way that appears random. This helps spread out the items more evenly and reduces the chance of creating clusters of items. This step is designed to break up long chains of collisions, improving the performance in cases of many collisions.
Table Doubling
You might be wondering since we’re mapping an unknown number of inputs to a fixed space in memory, and we keep trying new indices in the array, won’t we run out of slots at a point ?
The answer is, indeed !
Python utilizes a technique called “Table Doubling” to dynamically resize the underlying array ! This technique is crucial for balancing between efficient space usage and providing room for future growth. Here’s how it typically works in Python for insertions:
- Initial Size: When a dynamic array (or a hash table) is created, it starts with an initial size.
- Adding Elements: As you add more elements to the structure, it begins to fill up. Python keeps track of the number of elements and the current capacity of the array or table.
- Load Factor: When the array or table reaches a certain fill threshold (for example, it becomes 2/3 or 3/4 full), it triggers a resizing operation. This threshold is set to ensure a good balance between performance (by avoiding too frequent resizes) and space efficiency (by not wasting too much space).
- Doubling the Size: When resizing happens, the size of the array or table is typically doubled. This is where the term “table doubling” comes from. By doubling the size, the data structure makes room for new elements while keeping the number of resizing operations relatively low.
- Copying Elements: During the resizing, existing elements are copied from the old array or table to the new, larger one. This step can be computationally expensive, but because it doesn’t happen very often (thanks to the doubling strategy), the overall performance remains efficient.
and for deletions:
- Deletions Reduce Load Factor: When items are deleted from a hash table, the number of occupied slots decreases, which in turn lowers the load factor (the ratio of the number of occupied slots to the total number of slots).
- Shrinking the Table: If enough items are deleted and the load factor falls below a certain threshold, the table might be resized to a smaller size. This threshold is generally lower than the one used for expanding the table. For example, while a hash table might grow when it’s 75% full, it might shrink when it drops below 25% full.
- Resizing Downwards: When the table is resized to a smaller size, a process similar to the expansion resizing occurs. The remaining items in the table are rehashed and placed into a new, smaller table. This helps to conserve memory when a large portion of the table is empty.
Runtime π
The efficiency of table doubling is often analyzed using a concept called “amortized analysis.” While individual insertions might be slow (especially when resizing occurs), when averaged over a large number of operations, the time per operation is still constant, or O(1), on average! This makes table doubling a very effective technique for managing lists and hash tables.
In Python’s lists, dictionaries and sets, this table doubling concept ensures that as the number of items grows, the data structure can adapt its size efficiently, maintaining quick access times and reasonable space usage. This automatic resizing is a key feature that makes these structures so flexible and easy to use in Python.
Conclusion
In this article we talked about many concepts, the data structure and algorithm behind python dicts and sets. We talked concepts like hash maps, hash tables, collisions, and table doubling.
By now I hope you have a better understanding why should almost always choose a set over a list when you perform membership tests in your code (especially if it’s a large collection of items).