10-散列 · guava

# 10-散列 # 10-散列 [原文链接](http://code.google.com/p/guava-libraries/wiki/HashingExplained) [译文链接](http://ifeve.com/google-guava-hashing) **译者：**沈义扬 ## 概述 Java内建的散列码\[hash code\]概念被限制为32位，并且没有分离散列算法和它们所作用的数据，因此很难用备选算法进行替换。此外，使用Java内建方法实现的散列码通常是劣质的，部分是因为它们最终都依赖于JDK类中已有的劣质散列码。 Object.hashCode往往很快，但是在预防碰撞上却很弱，也没有对分散性的预期。这使得它们很适合在散列表中运用，因为额外碰撞只会带来轻微的性能损失，同时差劲的分散性也可以容易地通过再散列来纠正（Java中所有合理的散列表都用了再散列方法）。然而，在简单散列表以外的散列运用中，Object.hashCode几乎总是达不到要求——因此，有了[com.google.common.hash](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/package-summary.html)包。 ## 散列包的组成在这个包的Java doc中，我们可以看到很多不同的类，但是文档中没有明显地表明它们是怎样一起配合工作的。在介绍散列包中的类之前，让我们先来看下面这段代码范例： ``` <pre class="calibre11">``` HashFunction hf = Hashing.md5(); HashCode hc = hf.newHasher() .putLong(id) .putString(name, Charsets.UTF_8) .putObject(person, personFunnel) .hash(); ``` ``` ### HashFunction [HashFunction](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/HashFunction.html)是一个单纯的（引用透明的）、无状态的方法，它把任意的数据块映射到固定数目的位指，并且保证相同的输入一定产生相同的输出，不同的输入尽可能产生不同的输出。 ### Hasher HashFunction的实例可以提供有状态的[Hasher](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/Hasher.html)，Hasher提供了流畅的语法把数据添加到散列运算，然后获取散列值。Hasher可以接受所有原生类型、字节数组、字节数组的片段、字符序列、特定字符集的字符序列等等，或者任何给定了Funnel实现的对象。 Hasher实现了PrimitiveSink接口，这个接口为接受原生类型流的对象定义了fluent风格的API ### Funnel Funnel描述了如何把一个具体的对象类型分解为原生字段值，从而写入PrimitiveSink。比如，如果我们有这样一个类： ``` <pre class="calibre11">``` class Person { final int id; final String firstName; final String lastName; final int birthYear; } ``` ``` 它对应的Funnel实现可能是： ``` <pre class="calibre11">``` Funnel<Person> personFunnel = new Funnel<Person>() { @Override public void funnel(Person person, PrimitiveSink into) { into .putInt(person.id) .putString(person.firstName, Charsets.UTF_8) .putString(person.lastName, Charsets.UTF_8) .putInt(birthYear); } } ``` ``` 注：putString(“abc”, Charsets.UTF\_8).putString(“def”, Charsets.UTF\_8)完全等同于putString(“ab”, Charsets.UTF\_8).putString(“cdef”, Charsets.UTF\_8)，因为它们提供了相同的字节序列。这可能带来预料之外的散列冲突。增加某种形式的分隔符有助于消除散列冲突。 ### HashCode 一旦Hasher被赋予了所有输入，就可以通过[hash()](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/Hasher.html#hash%28%29)方法获取[HashCode](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/HashCode.html)实例（多次调用hash()方法的结果是不确定的）。HashCode可以通过[asInt()](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/HashCode.html#asInt%28%29)、[asLong()](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/HashCode.html#asLong%28%29)、[asBytes()](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/HashCode.html#asBytes%28%29)方法来做相等性检测，此外，[writeBytesTo(array, offset, maxLength)](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/HashCode.html#writeBytesTo%28byte%5B%5D,%20int,%20int%29)把散列值的前maxLength字节写入字节数组。 ## 布鲁姆过滤器\[BloomFilter\] 布鲁姆过滤器是哈希运算的一项优雅运用，它可以简单地基于Object.hashCode()实现。简而言之，布鲁姆过滤器是一种概率数据结构，它允许你检测某个对象是一定不在过滤器中，还是可能已经添加到过滤器了。[布鲁姆过滤器的维基页面](http://en.wikipedia.org/wiki/Bloom_filter)对此作了全面的介绍，同时我们推荐github中的一个[教程](http://billmill.org/bloomfilter-tutorial/)。 Guava散列包有一个内建的布鲁姆过滤器实现，你只要提供Funnel就可以使用它。你可以使用[create(Funnel funnel, int expectedInsertions, double falsePositiveProbability)](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/BloomFilter.html#create%28com.google.common.hash.Funnel,%20int,%20double%29)方法获取[BloomFilter<T>](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/BloomFilter.html)，缺省误检率\[falsePositiveProbability\]为3%。BloomFilter<T>提供了[boolean mightContain(T)](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/BloomFilter.html#mightContain%28T%29) 和[void put(T)](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/BloomFilter.html#put%28T%29)，它们的含义都不言自明了。 ``` <pre class="calibre11">``` BloomFilter<Person> friends = BloomFilter.create(personFunnel, 500, 0.01); for(Person friend : friendsList) { friends.put(friend); } // 很久以后 if (friends.mightContain(dude)) { //dude不是朋友还运行到这里的概率为1% //在这儿，我们可以在做进一步精确检查的同时触发一些异步加载 } ``` ``` ## Hashing类 Hashing类提供了若干散列函数，以及运算HashCode对象的工具方法。 ### 已提供的散列函数 [`md5()`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#md5%28%29)[`murmur3_128()`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#murmur3_128%28%29)[`murmur3_32()`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#murmur3_32%28%29)[`sha1()`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#sha1%28%29)[`sha256()`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#sha256%28%29)[`sha512()`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#sha512%28%29)[`goodFastHash(int bits)`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#goodFastHash%28int%29)### HashCode运算 **方法****描述**[`HashCode combineOrdered( Iterable<HashCode>)`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#combineOrdered%28java.lang.Iterable%29)以有序方式联接散列码，如果两个散列集合用该方法联接出的散列码相同，那么散列集合的元素可能是顺序相等的[`HashCode combineUnordered( Iterable<HashCode>)`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/com/google/common/hash/Hashing.html#combineUnordered%28java.lang.Iterable%29)以无序方式联接散列码，如果两个散列集合用该方法联接出的散列码相同，那么散列集合的元素可能在某种排序下是相等的[`int consistentHash( HashCode, int buckets)`](http://docs.guava-libraries.googlecode.com/git-history/release12/javadoc/co%E2%80%A6le/common/hash/Hashing.html#consistentHash%28com.google.common.hash.HashCode,%20int%29)为给定的”桶”大小返回一致性哈希值。当”桶”增长时，该方法保证最小程度的一致性哈希值变化。详见[一致性哈希](http://en.wikipedia.org/wiki/Consistent_hashing)。