练习46：三叉搜索树 · 笨办法学C

# 练习46：三叉搜索树 > 原文：[Exercise 46: Ternary Search Tree](http://c.learncodethehardway.org/book/ex46.html) > 译者：[飞龙](https://github.com/wizardforcel) 我打算向你介绍的最后一种数据结构就是三叉搜索树（`TSTree`），它和`BSTree`很像，除了它有三个分支，`low`、`equal`和`high`。它的用法和`BStree`以及`Hashmap`基本相同，用于储存键值对的数据，但是它通过键中的独立字符来控制。这使得`TSTree`具有一些`BStree`和`Hashmap`不具备的功能。 `TSTree`的工作方式是，每个键都是字符串，根据字符串中字符的等性，通过构建或者遍历一棵树来进行插入。首先由根节点开始，观察每个节点的字符，如果小于、等于或大于则去往相应的方向。你可以参考这个头文件： ```c #ifndef _lcthw_TSTree_h #define _lctwh_TSTree_h #include <stdlib.h> #include <lcthw/darray.h> typedef struct TSTree { char splitchar; struct TSTree *low; struct TSTree *equal; struct TSTree *high; void *value; } TSTree; void *TSTree_search(TSTree *root, const char *key, size_t len); void *TSTree_search_prefix(TSTree *root, const char *key, size_t len); typedef void (*TSTree_traverse_cb)(void *value, void *data); TSTree *TSTree_insert(TSTree *node, const char *key, size_t len, void *value); void TSTree_traverse(TSTree *node, TSTree_traverse_cb cb, void *data); void TSTree_destroy(TSTree *root); #endif ``` `TSTree`拥有下列成员： splitchar 树中该节点的字符。 low 小于`splitchar`的分支。 equal 等于`splitchar`的分支。 high 大于`splitchar`的分支。 value 这个节点上符合当前`splitchar`的值的集合。你可以看到这个实现中含有下列操作： search 为特定`key`寻找值的典型操作。 search_prefix 寻找第一个以`key`为前缀的值，这是你不能轻易使用`BSTree` 或 `Hashmap` 完成的操作。 insert 将`key`根据每个字符拆分，并把它插入到树中。 traverse 遍历整颗树，使你能够收集或分析所包含的所有键和值。唯一缺少的操作就是`TSTree_delete`，这是因为它是一个开销很大的操作，比`BSTree_delete`大得多。当我使用`TSTree`结构时，我将它们视为常量数据，我打算遍历许多次，但是永远不会移除任何东西。它们对于这样的操作会很快，但是不适于需要快速插入或删除的情况。为此我会使用`Hashmap`因为它优于`BSTree`和`TSTree`。 `TSTree`的实现非常简单，但是第一次可能难以理解。我会在你读完之后拆分它。 ```c #include <stdlib.h> #include <stdio.h> #include <assert.h> #include <lcthw/dbg.h> #include <lcthw/tstree.h> static inline TSTree *TSTree_insert_base(TSTree *root, TSTree *node, const char *key, size_t len, void *value) { if(node == NULL) { node = (TSTree *) calloc(1, sizeof(TSTree)); if(root == NULL) { root = node; } node->splitchar = *key; } if(*key < node->splitchar) { node->low = TSTree_insert_base(root, node->low, key, len, value); } else if(*key == node->splitchar) { if(len > 1) { node->equal = TSTree_insert_base(root, node->equal, key+1, len - 1, value); } else { assert(node->value == NULL && "Duplicate insert into tst."); node->value = value; } } else { node->high = TSTree_insert_base(root, node->high, key, len, value); } return node; } TSTree *TSTree_insert(TSTree *node, const char *key, size_t len, void *value) { return TSTree_insert_base(node, node, key, len, value); } void *TSTree_search(TSTree *root, const char *key, size_t len) { TSTree *node = root; size_t i = 0; while(i < len && node) { if(key[i] < node->splitchar) { node = node->low; } else if(key[i] == node->splitchar) { i++; if(i < len) node = node->equal; } else { node = node->high; } } if(node) { return node->value; } else { return NULL; } } void *TSTree_search_prefix(TSTree *root, const char *key, size_t len) { if(len == 0) return NULL; TSTree *node = root; TSTree *last = NULL; size_t i = 0; while(i < len && node) { if(key[i] < node->splitchar) { node = node->low; } else if(key[i] == node->splitchar) { i++; if(i < len) { if(node->value) last = node; node = node->equal; } } else { node = node->high; } } node = node ? node : last; // traverse until we find the first value in the equal chain // this is then the first node with this prefix while(node && !node->value) { node = node->equal; } return node ? node->value : NULL; } void TSTree_traverse(TSTree *node, TSTree_traverse_cb cb, void *data) { if(!node) return; if(node->low) TSTree_traverse(node->low, cb, data); if(node->equal) { TSTree_traverse(node->equal, cb, data); } if(node->high) TSTree_traverse(node->high, cb, data); if(node->value) cb(node->value, data); } void TSTree_destroy(TSTree *node) { if(node == NULL) return; if(node->low) TSTree_destroy(node->low); if(node->equal) { TSTree_destroy(node->equal); } if(node->high) TSTree_destroy(node->high); free(node); } ``` 对于`TSTree_insert`，我使用了相同模式的递归结构，其中我创建了一个小型函数，它调用真正的递归函数。我对此并不做任何检查，但是你应该为之添加通常的防御性编程策略。要记住的一件事，就是它使用了一些不同的设计，这里并没有单独的`TSTree_create`函数，如果你将`node`传入为`NULL`，它会新建一个，然后返回最终的值。这意味着我需要为你分解`TSTree_insert_base`，使你理解插入操作。 tstree.c:10-18 像我提到的那样，如果函数接收到`NULL`，我需要创建节点，并且将`*key`（当前字符）赋值给它。这用于当我插入键时来构建树。 tstree.c:20-21 当`*key`小于`splitchar`时，选择`low`分支。 tstree.c:22 如果`splitchar`相等，我就要进一步确定等性。这会在我刚刚创建这个节点时发生，所以这里我会构建这棵树。 tstree.c:23-24 仍然有字符串需要处理，所以向下递归`equal`分支，并且移动到下一个`*key`字符。 tstree.c:26-27 这是最后一个字符的情况，所以我将值设置好。我编写了一个`assert`来避免重复。 tstree.c:29-30 最后的情况是`*key`大于`splitchar`，所以我需要向下递归`high`分支。这个数据结构的`key`实际上带有一些特性，我只会在`splitchar`相等时递增所要分析的字符。其它两种情况我只会继续遍历整个树，直到碰到了相等的字符，我才会递归处理下一个字符。这一操作使它对于找不到键的情况是非常快的。我可以传入一个不存在的键，简单地遍历一些`high`和`low`节点，直到我碰到了末尾并且知道这个键不存在。我并不需要处理键的每个字符，或者树的每个节点。一旦你理解了这些，之后来分析`TSTree_search`如何工作： tstree.c:46 我并不需要递归处理整棵树，只需要使用`while`循环和当前的`node`节点。 tstree.c:47-48 如果当前字符小于节点中的`splitchar`，则选择`low`分支。 tstree.c:49-51 如果相等，自增`i`并且选择`equal`分支，只要不是最后一个字符。这就是`if(i < len)`所做的，使我不会越过最后的`value`。 tstree.c:52-53 否则我会选择`high`分支，由于当前字符更大。 tstree.c:57-61 循环结束后如果`node`不为空，那么返回它的`value`，否则返回`NULL`。这并不难以理解，并且你可以看到`TSTree_search_prefix`函数用了几乎相同的算法。唯一的不同就是我并不试着寻找精确的匹配，而是可找到的最长前缀。我在相等时跟踪`last`节点来实现它，并且在搜索循环结束之后，遍历这个节点直到发现`value`。观察`TSTree_search_prefix`，你就会开始明白`TSTree`相对`BSTree` 和 `Hashmap`在查找操作上的另一个优点。给定一个长度为X的键，你可以在X步内找到任何键，但是也可以在X步加上额外的N步内找到第一个前缀，取决于匹配的键有多长。如果树中最长的键是十个字符，那么你就可以在10步之内找到任意的前缀。更重要的是，你可以通过对键的每个字符只比较一次来实现。相比之下，使用`BSTree`执行相同操作，你需要在`BSTree`的每一个可能匹配的节点中检查两个字符串是否有共同的前缀。这对于寻找键，或者检查键是否存在（`TSTree_search`）是相同的。你需要将每个字符与`BSTree`中的大多数字符对比，来确认是否匹配。 `Hashamp`对于寻找前缀更加糟糕，因为你不能够仅仅计算前缀的哈希值。你基本上不能高效在`Hashmap`中实现它，除非数据类似URL可以被解析。即使这样你还是需要遍历`Hashmap`的所有节点。 > 译者注：二叉树和三叉树在搜索时都是走其中的一支，但由于二叉树中每个节点储存字符串，而三叉树储存的是字符。所以三叉树的整个搜索过程相当于一次字符串比较，而二叉树的每个节点都需要一次字符串比较。三叉树堆叠储存字符串使搜索起来更方便。 > 至于哈希表，由于字符串整体和前缀计算出来的哈希值差别很大，所以按前缀搜索时，哈希的优势完全失效，所以只能改为暴力搜索，效果比二叉树还要差。最后的两个函数应该易于分析，因为它们是典型的遍历和销毁操作，你已经在其它数据结构中看到过了。最后，我编写了简单的单元测试，来确保我所做的全部东西正确。 ```c #include "minunit.h" #include <lcthw/tstree.h> #include <string.h> #include <assert.h> #include <lcthw/bstrlib.h> TSTree *node = NULL; char *valueA = "VALUEA"; char *valueB = "VALUEB"; char *value2 = "VALUE2"; char *value4 = "VALUE4"; char *reverse = "VALUER"; int traverse_count = 0; struct tagbstring test1 = bsStatic("TEST"); struct tagbstring test2 = bsStatic("TEST2"); struct tagbstring test3 = bsStatic("TSET"); struct tagbstring test4 = bsStatic("T"); char *test_insert() { node = TSTree_insert(node, bdata(&test1), blength(&test1), valueA); mu_assert(node != NULL, "Failed to insert into tst."); node = TSTree_insert(node, bdata(&test2), blength(&test2), value2); mu_assert(node != NULL, "Failed to insert into tst with second name."); node = TSTree_insert(node, bdata(&test3), blength(&test3), reverse); mu_assert(node != NULL, "Failed to insert into tst with reverse name."); node = TSTree_insert(node, bdata(&test4), blength(&test4), value4); mu_assert(node != NULL, "Failed to insert into tst with second name."); return NULL; } char *test_search_exact() { // tst returns the last one inserted void *res = TSTree_search(node, bdata(&test1), blength(&test1)); mu_assert(res == valueA, "Got the wrong value back, should get A not B."); // tst does not find if not exact res = TSTree_search(node, "TESTNO", strlen("TESTNO")); mu_assert(res == NULL, "Should not find anything."); return NULL; } char *test_search_prefix() { void *res = TSTree_search_prefix(node, bdata(&test1), blength(&test1)); debug("result: %p, expected: %p", res, valueA); mu_assert(res == valueA, "Got wrong valueA by prefix."); res = TSTree_search_prefix(node, bdata(&test1), 1); debug("result: %p, expected: %p", res, valueA); mu_assert(res == value4, "Got wrong value4 for prefix of 1."); res = TSTree_search_prefix(node, "TE", strlen("TE")); mu_assert(res != NULL, "Should find for short prefix."); res = TSTree_search_prefix(node, "TE--", strlen("TE--")); mu_assert(res != NULL, "Should find for partial prefix."); return NULL; } void TSTree_traverse_test_cb(void *value, void *data) { assert(value != NULL && "Should not get NULL value."); assert(data == valueA && "Expecting valueA as the data."); traverse_count++; } char *test_traverse() { traverse_count = 0; TSTree_traverse(node, TSTree_traverse_test_cb, valueA); debug("traverse count is: %d", traverse_count); mu_assert(traverse_count == 4, "Didn't find 4 keys."); return NULL; } char *test_destroy() { TSTree_destroy(node); return NULL; } char * all_tests() { mu_suite_start(); mu_run_test(test_insert); mu_run_test(test_search_exact); mu_run_test(test_search_prefix); mu_run_test(test_traverse); mu_run_test(test_destroy); return NULL; } RUN_TESTS(all_tests); ``` ## 优点和缺点 `TSTree`可以用于实现一些其它实用的事情： + 除了寻找前缀，你可以反转插入的所有键，之后通过后缀来寻找。我使用它来寻找主机名称，因为我想要找到`*.learncodethehardway.com`，所以如果我反向来寻找，会更快匹配到它们。 + 你可以执行“模糊”搜索，其中你可以收集所有与键的大多数字符相似的节点，或者使用其它算法用于搜索近似的匹配。 + 你可以寻找所有中间带有特定部分的键。我已经谈论了`TSTree`能做的一些事情，但是它们并不总是最好的数据结构。`TSTree`的缺点在于： + 像我提到过的那样，删除操作非常麻烦。它们适用于需要快速检索并且从不移除的操作。如果你需要删除，可以简单地将`value`置空，之后当树过大时周期性重构它。 + 与`BSTree`和`Hashmap`相比，它在相同的键上使用了大量的空间。它对于键中的每个字符都使用了完整的节点。它对于短的键效果更好，但如果你在`TSTree`中放入一大堆东西，它会变得很大。 + 它们也不适合处理非常长的键，然而“长”是主观的词，所以应当像通常一样先进行测试。如果你尝试储存一万个字符的键，那么应当使用`Hashmap`。 ## 如何改进像通常一样，浏览代码，使用防御性的先决条件、断言，并且检查每个函数来改进。下面是一些其他的改进方案，但是你并不需要全部实现它们： + 你可以使用`DArray`来允许重复的`value`值。 + 因为我提到删除非常困难，但是你可以通过将值设为`NULL`来模拟，使值能够高效被删除。 + 目前还不能获取到所有匹配指定前缀的值，我会让你在附加题中实现它。 + 有一些其他得更复杂的算法会比它要好。查询前缀数组、前缀树和基数树的资料。 ## 附加题 + 实现`TSTree_collect`返回`DArray`包含所有匹配指定前缀的键。 + 实现`TSTree_search_suffix`和`TSTree_insert_suffix`，实现后缀搜索和插入。 + 使用`valgrind`来查看与`BSTree` 和 `Hashmap`相比，这个结构使用了多少内存来储存数据。