dict 对象，高效的关联式容器

Python 中的 dict 对象是一种 关联式容器 对象，用于保存由键 ( key )到值 ( value )的映射关系。借助关联式容器，程序可快速定位到与指定键相关联的值。dict 对象在 Python 程序中使用频率非常高，如果应用不当将严重影响程序的执行效率。

本节，我们先从 dict 对象常用操作入手，回顾它的 基本用法 ；接着结合其他内建容器对象，研究 dict 关键操作的 执行效率 ；最后以 dict 对象 内部结构 收尾，详细讲解其内部 哈希表 的实现要点，以及其中几个关键 性能考量 。相信通过本节学习，读者将对 dict 实现原理了如指掌，这对用好 dict 对象至关重要。

基本用法

我们用一个 dict 对象来保存培训班学员的成绩，先创建一个空对象：

1
2
3


>>> scores = {}
>>> scores
{}

那么，一个什么都不放的 dict 对象需要占用多少内存呢？根据前面章节，我们知道对象头部字段是必不可少的。可我们很惊讶地发现，一个空的 dict 对象居然要占用 248 字节的内存！

1
2
3


>>> import sys
>>> sys.getsizeof(scores)
248

这是为什么呢？后续我们将从 dict 内部的哈希表中寻找答案。现在我们接着回顾 dict 的基本用法。

现在将 jim 的成绩保存保存到 dict 对象中：

1
2
3
4
5


>>> scores['jim'] = 70
>>> scores
{'jim': 70}
>>> sys.getsizeof(scores)
248

数据插入后，我们发现 dict 对象内存使用量保存不变。看来， dict 对象也有一种类似 list 对象的 预分配机制 。

现在，接着存入 lily 、lucy 以及 tom 的成绩。我们发现， dict 还没达到扩容条件，内存大小保存不变：

1
2
3
4
5
6
7
8


>>> scores['lily'] = 75
>>> scores['lucy'] = 80
>>> scores['tom'] = 90
>>> scores['alice'] = 95
>>> scores
{'jim': 70, 'lily': 75, 'lucy': 80, 'tom': 90, 'alice': 95}
>>> sys.getsizeof(scores)
248

借助 dict 对象，我们可以快速检索出某位学员的成绩。例如，获取 tom 的成绩：

1
2


>>> scores['tom']
90

”快速“不是一个精确的形容词，到底多快呢？这里先给出答案，由于 dict 对象底层由哈希表实现，查找操作平均时间复杂度是 $O(1)$ 。当然了，在哈希不均匀的情况下，最坏时间复杂度是 $O(n)$ ，但一般情况下很难发生。

当然了，如果有某位学员(例如 lily)转学了，可通过 pop 方法将其剔除：

1
2
3
4


>>> scores.pop('lily')
75
>>> scores
{'jim': 70, 'lucy': 80, 'tom': 90, 'alice': 95}

哈希表结构决定了 dict 的删除操作也很快，平均时间复杂度也是 $O(1)$ 。实际上， dict 插入、删除、查找的平均时间复杂度都是 $O(1)$ ，最坏时间复杂度是 $O(n)$ 。因此，哈希函数的选择就至关重要，一个好的哈希函数应该将键尽可能均匀地映射到哈希空间中，最大限度地避免 哈希冲突 。

执行效率

我们知道 dict 对象搜索操作时间复杂度为 $O(1)$ ，远远好于 list 对象的 $O(n)$ 。这意味着什么了？为得到一个更准确、直观的感受，我们编写一个测试程序，分别测试不同规模 dict 、 list 对象完成 1000 次搜索所需的时间：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80


import random
import time

# 随机数生成器
randint = lambda: random.randint(-2**30+1, 2**30-1)

def count_targets(items, targets):
    '''
    计算目标对象出现个数
        items: 待搜索容器
        targets: 待搜索目标元素列表
    '''

    found = 0

    for target in targets:
        if target in items:
            found += 1

    return found

def generate_random_dict(n):
    '''
    生成随机数字典
    '''

    dict_items = {}
    while len(dict_items) < n:
        dict_items[randint()] = True

    return dict_items

def generate_random_list(n):
    '''
    生成随机数列表
    '''

    return [
        randint()
        for _ in range(0, n)
    ]

def test_for_scale(scale, targets):
    '''
    执行一个样例
        scale: 测试容器规模
        targets: 待搜索元素列表
    '''

    # 生成指定规模的随机数容器
    dict_items = generate_random_dict(scale)
    list_items = generate_random_list(scale)

    # 测试dict搜索所需时间
    start_ts = time.time()
    count_targets(dict_items, targets)
    dict_time = time.time() - start_ts

    # 测试list搜索所需时间
    start_ts = time.time()
    count_targets(list_items, targets)
    list_time = time.time() - start_ts

    # 打印结果
    print('Scale:', scale)
    print('Dict:', dict_time)
    print('List:', list_time)
    print()

def main():
    # 每次测试搜索1000次
    # 生成1000个随机数作为搜索目标
    targets = generate_random_list(1000)

    # 以不同规模运行测试函数
    for scale in [1000, 10000, 100000, 1000000]:
        test_for_scale(scale, targets)

if __name__ == '__main__':
    main()

测试程序代码逻辑并不复杂，请结合注释阅读理解，这里不再赘述。测试程序执行后，输出内容大致如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


Scale: 1000
Dict: 0.00012683868408203125
List: 0.03683590888977051

Scale: 10000
Dict: 0.00017213821411132812
List: 0.3484950065612793

Scale: 100000
Dict: 0.00021696090698242188
List: 3.6795387268066406

Scale: 1000000
Dict: 0.0003829002380371094
List: 48.04447102546692

我们将测试结果制作表格， dict 和 list 的表现一目了然：

容器规模	增长系数	dict消耗时间	dict增长系数	list消耗时间	list增长系数
1000	1	0.000129s	1	0.036s	1
10000	10	0.000172s	1.33	0.348s	9.67
100000	100	0.000216s	1.67	3.679s	102.19
1000000	1000	0.000382s	2.96	48.044s	1335.56

从表格中，我们看到：当容器规模增长 1000 倍， dict 搜索时间几乎保持不变，但 list 搜索时间增长了差不多 1000 倍。当规模达到 10 万时，1000 次 list 搜索花了接近一分钟时间，而 dict 只需 382 微秒！dict 对象完成一次搜索只需 0.382 微秒，也就是说一秒钟可以完成 200 多万次搜索！

dict 对象到底用了什么黑科技呢？接下来，我们一起从它的内部结构中寻找答案。点击 阅读原文，获取更多细节！

【Python源码剖析】系列文章首发于公众号【小菜学编程】，敬请关注：