练习35：排序和搜索 · 笨办法学C 中文版

# 练习35：排序和搜索 > 原文：[Exercise 35: Sorting And Searching](http://c.learncodethehardway.org/book/ex35.html) > 译者：[飞龙](https://github.com/wizardforcel) 这个练习中我打算涉及到四个排序算法和一个搜索算法。排序算法是快速排序、堆排序、归并排序和基数排序。之后在你完成基数排序之后，我打算想你展示二分搜索。然而，我是一个懒人，大多数C标准库都实现了堆排序、快速排序和归并排序算法，你可以直接使用它们： ```c #include <lcthw/darray_algos.h> #include <stdlib.h> int DArray_qsort(DArray *array, DArray_compare cmp) { qsort(array->contents, DArray_count(array), sizeof(void *), cmp); return 0; } int DArray_heapsort(DArray *array, DArray_compare cmp) { return heapsort(array->contents, DArray_count(array), sizeof(void *), cmp); } int DArray_mergesort(DArray *array, DArray_compare cmp) { return mergesort(array->contents, DArray_count(array), sizeof(void *), cmp); } ``` 这就是`darray_algos.c`文件的整个实现，它在大多数现代Unix系统上都能运行。它们的每一个都使用`DArray_compare`对`contents`中储存的无类型指针进行排序。我也要向你展示这个头文件： ```c #ifndef darray_algos_h #define darray_algos_h #include <lcthw/darray.h> typedef int (*DArray_compare)(const void *a, const void *b); int DArray_qsort(DArray *array, DArray_compare cmp); int DArray_heapsort(DArray *array, DArray_compare cmp); int DArray_mergesort(DArray *array, DArray_compare cmp); #endif ``` 大小几乎一样，你也应该能预料到。接下来你可以了解单元测试中这三个函数如何使用： ```c #include "minunit.h" #include <lcthw/darray_algos.h> int testcmp(char **a, char **b) { return strcmp(*a, *b); } DArray *create_words() { DArray *result = DArray_create(0, 5); char *words[] = {"asdfasfd", "werwar", "13234", "asdfasfd", "oioj"}; int i = 0; for(i = 0; i < 5; i++) { DArray_push(result, words[i]); } return result; } int is_sorted(DArray *array) { int i = 0; for(i = 0; i < DArray_count(array) - 1; i++) { if(strcmp(DArray_get(array, i), DArray_get(array, i+1)) > 0) { return 0; } } return 1; } char *run_sort_test(int (*func)(DArray *, DArray_compare), const char *name) { DArray *words = create_words(); mu_assert(!is_sorted(words), "Words should start not sorted."); debug("--- Testing %s sorting algorithm", name); int rc = func(words, (DArray_compare)testcmp); mu_assert(rc == 0, "sort failed"); mu_assert(is_sorted(words), "didn't sort it"); DArray_destroy(words); return NULL; } char *test_qsort() { return run_sort_test(DArray_qsort, "qsort"); } char *test_heapsort() { return run_sort_test(DArray_heapsort, "heapsort"); } char *test_mergesort() { return run_sort_test(DArray_mergesort, "mergesort"); } char * all_tests() { mu_suite_start(); mu_run_test(test_qsort); mu_run_test(test_heapsort); mu_run_test(test_mergesort); return NULL; } RUN_TESTS(all_tests); ``` 你需要注意的事情是第四行`testcmp`的定义，它困扰了我一整天。你必须使用`char **`而不是`char *`，因为`qsort`会向你提供指向`content`数组中指针的指针。原因是`qsort`会打扫数组，使用你的比较函数来处理数组中每个元素的指针。因为我在`contents`中存储指针，所以你需要使用指针的指针。有了这些之后，你只需要实现三个困难的搜索算法，每个大约20行。你应该在这里停下来，不过这本书的一部分就是学习这些算法的原理，附加题会涉及到实现这些算法。 ## 基数排序和二分搜索既然你打算自己实现快速排序、堆排序和归并排序，我打算向你展示一个流行的算法叫做基数排序。它的实用性很小，只能用于整数数组，并且看上去像魔法一样。这里我打算常见一个特殊的数据结构，叫做`RadixMap`，用于将一个整数映射为另一个。下面是为新算法创建的头文件，其中也含有数据结构： ```c #ifndef _radixmap_h #include <stdint.h> typedef union RMElement { uint64_t raw; struct { uint32_t key; uint32_t value; } data; } RMElement; typedef struct RadixMap { size_t max; size_t end; uint32_t counter; RMElement *contents; RMElement *temp; } RadixMap; RadixMap *RadixMap_create(size_t max); void RadixMap_destroy(RadixMap *map); void RadixMap_sort(RadixMap *map); RMElement *RadixMap_find(RadixMap *map, uint32_t key); int RadixMap_add(RadixMap *map, uint32_t key, uint32_t value); int RadixMap_delete(RadixMap *map, RMElement *el); #endif ``` 你看到了其中有许多和`Dynamic Array`或`List`数据结构相同的操作，不同就在于我只处理固定32位大小的`uint32_t`正忽视。我也会想你介绍C语言的一个新概念，叫做`union`。 ## C联合体联合体是使用不同方式引用内存中同一块区域的方法。它们的工作方式，就像你把它定义为`sturct`，然而，每个元素共享同一片内存区域。你可以认为，联合体是内存中的一幅画，所有颜色不同的元素都重叠在它上面。它可以用于节约内存，或在不同格式之间转换内存块。它的第一个用途就是实现“可变类型”，你可以创建一个带有类型“标签”的结构体，之后在其中创建含有多种类型的联合体。用于在内存的不同格式之间转换时，只需要定义两个结构体，访问正确的那个类型。首先让我向你展示如何使用C联合体构造可变类型： ```c #include <stdio.h> typedef enum { TYPE_INT, TYPE_FLOAT, TYPE_STRING, } VariantType; struct Variant { VariantType type; union { int as_integer; float as_float; char *as_string; } data; }; typedef struct Variant Variant; void Variant_print(Variant *var) { switch(var->type) { case TYPE_INT: printf("INT: %d\n", var->data.as_integer); break; case TYPE_FLOAT: printf("FLOAT: %f\n", var->data.as_float); break; case TYPE_STRING: printf("STRING: %s\n", var->data.as_string); break; default: printf("UNKNOWN TYPE: %d", var->type); } } int main(int argc, char *argv[]) { Variant a_int = {.type = TYPE_INT, .data.as_integer = 100}; Variant a_float = {.type = TYPE_FLOAT, .data.as_float = 100.34}; Variant a_string = {.type = TYPE_STRING, .data.as_string = "YO DUDE!"}; Variant_print(&a_int); Variant_print(&a_float); Variant_print(&a_string); // here's how you access them a_int.data.as_integer = 200; a_float.data.as_float = 2.345; a_string.data.as_string = "Hi there."; Variant_print(&a_int); Variant_print(&a_float); Variant_print(&a_string); return 0; } ``` 你可以在许多动态语言实现中发现它。对于为语言中所有基本类型，代码中首先定义了一些带有变迁的可变类型，之后通常给你所创建的类型打上`object`标签。这样的好处就是`Variant`通常只需要`VariantType type`标签的空间，加上联合体最大成员的空间，因为C将`Variant.data`的每个元素堆起来，它们是重叠的，只保证有足够的空间放下最大的元素。 `radixmap.h`文件中我创建了`RMElement`联合体，用于在类型之间转换内存块。这里，我希望存储`uint64_t`定长整数用于排序目录，但是我也希望使用两个`uint32_t`用于表示数据的`key`和`value`对。通过使用联合体我就能够使用所需的两种不同方法来访问内存。 ## 实现接下来是实际的`RadixMap`对于这些操作的实现： ```c /* * Based on code by Andre Reinald then heavily modified by Zed A. Shaw. */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <lcthw/radixmap.h> #include <lcthw/dbg.h> RadixMap *RadixMap_create(size_t max) { RadixMap *map = calloc(sizeof(RadixMap), 1); check_mem(map); map->contents = calloc(sizeof(RMElement), max + 1); check_mem(map->contents); map->temp = calloc(sizeof(RMElement), max + 1); check_mem(map->temp); map->max = max; map->end = 0; return map; error: return NULL; } void RadixMap_destroy(RadixMap *map) { if(map) { free(map->contents); free(map->temp); free(map); } } #define ByteOf(x,y) (((uint8_t *)x)[(y)]) static inline void radix_sort(short offset, uint64_t max, uint64_t *source, uint64_t *dest) { uint64_t count[256] = {0}; uint64_t *cp = NULL; uint64_t *sp = NULL; uint64_t *end = NULL; uint64_t s = 0; uint64_t c = 0; // count occurences of every byte value for (sp = source, end = source + max; sp < end; sp++) { count[ByteOf(sp, offset)]++; } // transform count into index by summing elements and storing into same array for (s = 0, cp = count, end = count + 256; cp < end; cp++) { c = *cp; *cp = s; s += c; } // fill dest with the right values in the right place for (sp = source, end = source + max; sp < end; sp++) { cp = count + ByteOf(sp, offset); dest[*cp] = *sp; ++(*cp); } } void RadixMap_sort(RadixMap *map) { uint64_t *source = &map->contents[0].raw; uint64_t *temp = &map->temp[0].raw; radix_sort(0, map->end, source, temp); radix_sort(1, map->end, temp, source); radix_sort(2, map->end, source, temp); radix_sort(3, map->end, temp, source); } RMElement *RadixMap_find(RadixMap *map, uint32_t to_find) { int low = 0; int high = map->end - 1; RMElement *data = map->contents; while (low <= high) { int middle = low + (high - low)/2; uint32_t key = data[middle].data.key; if (to_find < key) { high = middle - 1; } else if (to_find > key) { low = middle + 1; } else { return &data[middle]; } } return NULL; } int RadixMap_add(RadixMap *map, uint32_t key, uint32_t value) { check(key < UINT32_MAX, "Key can't be equal to UINT32_MAX."); RMElement element = {.data = {.key = key, .value = value}}; check(map->end + 1 < map->max, "RadixMap is full."); map->contents[map->end++] = element; RadixMap_sort(map); return 0; error: return -1; } int RadixMap_delete(RadixMap *map, RMElement *el) { check(map->end > 0, "There is nothing to delete."); check(el != NULL, "Can't delete a NULL element."); el->data.key = UINT32_MAX; if(map->end > 1) { // don't bother resorting a map of 1 length RadixMap_sort(map); } map->end--; return 0; error: return -1; } ``` 像往常一样键入它并使它通过单元测试，之后我会解释它。尤其要注意`radix_sort`函数，我实现它的方法非常特别。 ```c #include "minunit.h" #include <lcthw/radixmap.h> #include <time.h> static int make_random(RadixMap *map) { size_t i = 0; for (i = 0; i < map->max - 1; i++) { uint32_t key = (uint32_t)(rand() | (rand() << 16)); check(RadixMap_add(map, key, i) == 0, "Failed to add key %u.", key); } return i; error: return 0; } static int check_order(RadixMap *map) { RMElement d1, d2; unsigned int i = 0; // only signal errors if any (should not be) for (i = 0; map->end > 0 && i < map->end-1; i++) { d1 = map->contents[i]; d2 = map->contents[i+1]; if(d1.data.key > d2.data.key) { debug("FAIL:i=%u, key: %u, value: %u, equals max? %d\n", i, d1.data.key, d1.data.value, d2.data.key == UINT32_MAX); return 0; } } return 1; } static int test_search(RadixMap *map) { unsigned i = 0; RMElement *d = NULL; RMElement *found = NULL; for(i = map->end / 2; i < map->end; i++) { d = &map->contents[i]; found = RadixMap_find(map, d->data.key); check(found != NULL, "Didn't find %u at %u.", d->data.key, i); check(found->data.key == d->data.key, "Got the wrong result: %p:%u looking for %u at %u", found, found->data.key, d->data.key, i); } return 1; error: return 0; } // test for big number of elements static char *test_operations() { size_t N = 200; RadixMap *map = RadixMap_create(N); mu_assert(map != NULL, "Failed to make the map."); mu_assert(make_random(map), "Didn't make a random fake radix map."); RadixMap_sort(map); mu_assert(check_order(map), "Failed to properly sort the RadixMap."); mu_assert(test_search(map), "Failed the search test."); mu_assert(check_order(map), "RadixMap didn't stay sorted after search."); while(map->end > 0) { RMElement *el = RadixMap_find(map, map->contents[map->end / 2].data.key); mu_assert(el != NULL, "Should get a result."); size_t old_end = map->end; mu_assert(RadixMap_delete(map, el) == 0, "Didn't delete it."); mu_assert(old_end - 1 == map->end, "Wrong size after delete."); // test that the end is now the old value, but uint32 max so it trails off mu_assert(check_order(map), "RadixMap didn't stay sorted after delete."); } RadixMap_destroy(map); return NULL; } char *all_tests() { mu_suite_start(); srand(time(NULL)); mu_run_test(test_operations); return NULL; } RUN_TESTS(all_tests); ``` 我不应该向你解释关于测试的过多东西，它只是模拟将随机正是放入`RadixMap`，确保你可以可靠地将其取出。也不是非常有趣。在`radixmap.c`中的大多数操作都易于理解，如果你阅读代码的话。下面是每个基本函数作用及其工作原理的描述： RadixMap_create 像往常一样，我分配了结构体所需的内存，结构体在`radixmap.h`中定义。当后面涉及到`radix_sort`时我会使用`temp`和`contents`。 RadixMap_destroy 同样，销毁我所创建的东西。 radix_sort 这个数据结构的灵魂，我会在下一节中解释其作用。 RadixMap_sort 它使用了`radix_sort`函数来实际对`contents`进行排序。 RadixMap_find 使用二分搜索算法来寻找提供的`key`，我之后会解释它的原理。 RadixMap_add 使用`RadixMap_sort`函数，它会在末尾添加`key`和`value`，然后简单地重新排序使一切元素都有序。一旦排序完，`RadixMap_find`会正确工作，因为它是二分搜索。 RadixMap_delete 工作方式类似`RadixMap_add`，除了“删除”结构中的元素，通过将它们的值设为无符号的32为整数的最大值，也就是`UINT32_MAX`。这意味着你不能使用这个值作为合法的键，但是它是元素删除变得容易。简单设置它之后排序，它会被移动到末尾，这就算删除了。学习我所描述的代码，接下来还剩`RadixMap_sort`，`radix_sort`和`RadixMap_find`需要了解。 ## RadixMap_find 和二分搜索我首先以二分搜索如何实现开始。二分搜索是一种简单算法，大多数人都可以直观地理解。实际上，你可以取一叠游戏卡片（或带有数字的卡片）来手动操作。下面是该函数的工作方式，也是二分搜索的原理： + 基于数组大小设置上界和下界。 + 获取上下界之间的中间元素。 + 如果键小于这个元素的值，就一定在它前面，所以上界设置为中间元素。 + 如果键大于这个元素的值，就一定在它后面，所以下界设置为中间元素。 + 继续循环直到上界和下界越过了彼此。如果退出了循环则没有找到。你实际上所做的事情是，通过挑选中间的值来比较，猜出`key`可能的位置。由于数据是有序的，你知道`key`一定会在它前面或者后面，这样就能把搜索区域分成两半。之后你继续搜索知道找到他，或者越过了边界并穷尽了搜索空间。 ## RadixMap_sort 和 radix_sort 如果你事先手动模拟基数排序，它就很易于理解。这个算法利用了一个现象，数字都以十进制字符的序列来表示，按照“不重要”到“重要”的顺序排列。之后它通过十进制字符来选取数字并且将它们储存在桶中，当它处理完所有字符时，数字就排好序了。一开始它看上去像是魔法，浏览代码也的确如此，但是你要尝试手动执行它。为了解释这个算法，需要先写下一组三位的十进制数，以随机的顺序，假设就是223、912、275、100、633、120 和 380。 + 按照它们的个位，将数字放入桶中：`[380, 100, 120], [912], [633, 223], [275]`。 + 现在遍历每个桶中的数字，接着按十位排序：`[100], [912], [120, 223], [633], [275], [380]`。 + 现在每个桶都包含了按照个位和十位排序后的数字。接着我需要按照这个顺序遍历，并把它们放入最后百位的桶中：`[100, 120], [223, 275], [380], [633], [912]`。 + 到现在为止，每个数字都按照百位、十位和个位排序，并且如果我按照顺序遍历每个桶，我会得到最终排序的结果：`100, 120, 223, 275, 380, 633, 912`。确保你多次重复了这个过程，便于你理解它如何工作。这实在是一种机智的算法，并且最重要的是它对于任何大小的数字都有效。所以你可以用它来排序比较大的数字，因为你一次只是处理一位。在我的环境下，“字符”是独立的8位字节，所以我需要256个桶来储存这些数字按照字节的分布结果。我需要一种方法来储存它，并且不需要花费太多的空间。如果你查看`radix_sort`，首先我会构建`count`直方图，便于我了解对于给定的`offset`，每个字节的频率。一旦我知道了每一种字节的数量（共有256种），我就可以将目标数组用于存储这些值的分布。比如，如果0x00的数量为10个，我就可以将它们放在目标数组的前10个位置中。这可以让我索引到它们在目标数组中的位置，这就是`radix_sort`中的第二个`for`循环。最后，当我知道它们在目标数组中储存在哪里，我只是遍历`source`数组对于当前`offset`的所有字节，并且将数值按顺序放入它们的位置中。`ByteOf`宏的使用有助于保持代码整洁，因为它需要一些指针的黑魔法，但是最后当`for`循环结束之后，所有整数都会按照它们的字节放入桶中。我在`RadixMap_sort`中对这些64位的整数按照它们的前32位进行排序，这非常有意思。还记得我是如何将键和值放入`RMElement`类型的联合体了吗？这意味着如果要按照键来对这个数组排序，我只需要对每个整数前4个字节（32位/8位每字节）进行排序。如果你观察`RadixMap_sort`，你会看到我获取了`contents`和`temp`的便利指针，用于源数组和目标数组，之后我四次调用`radix_sort`。每次调用我将源数组和目标数组替换为下一字节的情况。当我完成时，`radix_sort`就完成了任务，并且`contents`中也有了最后的结果。 ## 如何改进这个实现有个很大的缺点，就是它遍历了整个数组四次。它执行地很快，但是如果你通过需要排序的数值大小来限制排序的总量，会更好一些。有两个方法可以用于改进这个实现： + 使用二分搜索来寻找新元素的最小位置，只对这个位置到微末之间进行排序。你需要找到它，将新元素放到末尾，之后对它们之间进行排序。大多数情况下这会显著地缩减排序范围。 + 跟踪当前所使用的最大的键，之后只对足够的位数进行排序，来处理这个键。你也可以跟踪最小的数值，之后只对范围中必要的字节进行排序。为了这样做，你需要关心CPU的整数存储顺序（大小端序）。 ## 附加题 + 实现快速排序、堆排序和归并排序，并且提供一个`#define`让其他人在二者（标准库和你的实现）当中进行选择，或者创建另一套不同名称的函数。使用我教给你的技巧，阅读维基百科的算法页面，之后参照伪代码来实现它。 + 对比你的实现和标准库实现的性能。 + 使用这些排序函数创建`DArray_sort_add`，它可以向`DArray`添加元素，但是随后对数组排序。 + 编写`DArray_find`，使用`RadixMap_find`中的二分搜索算法和`DArray_compare`，来在有序的`DArray`中寻找元素。