12.5. 解析器 · PostgreSQL 中文文档 9.3

# 12.5\. 解析器文本搜索分析器负责分离原文档文本为_标记_并且标识每个记号的类型，这里可能的类型集由解析器本身定义。注意一个解析器并不修改文本—它只是确定合理的单词边界。因为这个限制范围，为特定应用定制的分析器比自定义字典需要的更少。目前PostgreSQL提供了只有一个内置的解析器，这已被用于一个广泛的应用中。内置分析器命名`pg_catalog.default`。它识别23种标记类型，显示在[Table 12-1](#calibre_link-1145)中。 **Table 12-1\. 缺省分析器的标记类型** | Alias | Description | Example | | --- | --- | --- | | `asciiword` | Word, all ASCII letters | `elephant` | | `word` | Word, all letters | `mañana` | | `numword` | Word, letters and digits | `beta1` | | `asciihword` | Hyphenated word, all ASCII | `up-to-date` | | `hword` | Hyphenated word, all letters | `lógico-matemática` | | `numhword` | Hyphenated word, letters and digits | `postgresql-beta1` | | `hword_asciipart` | Hyphenated word part, all ASCII | `postgresql` in the context `postgresql-beta1` | | `hword_part` | Hyphenated word part, all letters | `lógico` or `matemática` in the context `lógico-matemática` | | `hword_numpart` | Hyphenated word part, letters and digits | `beta1` in the context `postgresql-beta1` | | `email` | Email address | `foo@example.com` | | `protocol` | Protocol head | `http://` | | `url` | URL | `example.com/stuff/index.html` | | `host` | Host | `example.com` | | `url_path` | URL path | `/stuff/index.html`, in the context of a URL | | `file` | File or path name | `/usr/local/foo.txt`, if not within a URL | | `sfloat` | Scientific notation | `-1.234e56` | | `float` | Decimal notation | `-1.234` | | `int` | Signed integer | `-1234` | | `uint` | Unsigned integer | `1234` | | `version` | Version number | `8.3.0` | | `tag` | XML tag | `<a href="dictionaries.html">` | | `entity` | XML entity | `&` | | `blank` | Space symbols | (any whitespace or punctuation not otherwise recognized) | > **Note:** 注意：一个"字母"的语法分析器的概念是由数据库的区域设置决定的，特别是`lc_ctype`。只包含基本ASCII字母的词作为一个单独的标记类型被报告，因为区分他们有时候是有用的。大多数欧洲语言，标记类型`word`和`asciiword`应该一视同仁。 > > `email`不支持由RFC 5322定义的所有有效的电子邮件字符。具体来说，唯一的非字母数字字符支持电子邮件用户名有句号，破折号和下划线。对于分析器从文本的同一块产生重叠的标记是可能的。作为一个例子，一个连字符的单词将作为整个单词和每个组件被报道： ``` SELECT alias, description, token FROM ts_debug('foo-bar-beta1'); alias | description | token -----------------+------------------------------------------+--------------- numhword | Hyphenated word, letters and digits | foo-bar-beta1 hword_asciipart | Hyphenated word part, all ASCII | foo blank | Space symbols | - hword_asciipart | Hyphenated word part, all ASCII | bar blank | Space symbols | - hword_numpart | Hyphenated word part, letters and digits | beta1 ``` 这种行为是可取的，因为它允许为整个复合词和组件进行搜索。这里是另一个很好的例子: ``` SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html'); alias | description | token ----------+---------------+------------------------------ protocol | Protocol head | http:// url | URL | example.com/stuff/index.html host | Host | example.com url_path | URL path | /stuff/index.html ```