tokenize — Tokenizer for Python source · Python3.7.3官方文档简体中文

### 导航 - [索引](../genindex.xhtml "总目录") - [模块](../py-modindex.xhtml "Python 模块索引") | - [下一页](tabnanny.xhtml "tabnanny --- 模糊缩进检测") | - [上一页](keyword.xhtml "keyword --- 检验Python关键字") | - ![](https://box.kancloud.cn/a721fc7ec672275e257bbbfde49a4d4e_16x16.png) - [Python](https://www.python.org/) » - zh\_CN 3.7.3 [文档](../index.xhtml) » - [Python 标准库](index.xhtml) » - [Python 语言服务](language.xhtml) » - $('.inline-search').show(0); | # [`tokenize`](#module-tokenize "tokenize: Lexical scanner for Python source code.") --- Tokenizer for Python source **Source code:** [Lib/tokenize.py](https://github.com/python/cpython/tree/3.7/Lib/tokenize.py) \[https://github.com/python/cpython/tree/3.7/Lib/tokenize.py\] - - - - - - The [`tokenize`](#module-tokenize "tokenize: Lexical scanner for Python source code.") module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing "pretty-printers," including colorizers for on-screen displays. To simplify token stream handling, all [operator](../reference/lexical_analysis.xhtml#operators) and [delimiter](../reference/lexical_analysis.xhtml#delimiters) tokens and [`Ellipsis`](constants.xhtml#Ellipsis "Ellipsis") are returned using the generic [`OP`](token.xhtml#token.OP "token.OP") token type. The exact type can be determined by checking the `exact_type` property on the [named tuple](../glossary.xhtml#term-named-tuple) returned from [`tokenize.tokenize()`](#tokenize.tokenize "tokenize.tokenize"). ## Tokenizing Input The primary entry point is a [generator](../glossary.xhtml#term-generator): `tokenize.``tokenize`(*readline*)The [`tokenize()`](#tokenize.tokenize "tokenize.tokenize") generator requires one argument, *readline*, which must be a callable object which provides the same interface as the [`io.IOBase.readline()`](io.xhtml#io.IOBase.readline "io.IOBase.readline") method of file objects. Each call to the function should return one line of input as bytes. The generator produces 5-tuples with these members: the token type; the token string; a 2-tuple `(srow, scol)` of ints specifying the row and column where the token begins in the source; a 2-tuple `(erow, ecol)` of ints specifying the row and column where the token ends in the source; and the line on which the token was found. The line passed (the last tuple item) is the *logical* line; continuation lines are included. The 5 tuple is returned as a [named tuple](../glossary.xhtml#term-named-tuple) with the field names: `type string start end line`. The returned [named tuple](../glossary.xhtml#term-named-tuple) has an additional property named `exact_type` that contains the exact operator type for [`OP`](token.xhtml#token.OP "token.OP") tokens. For all other token types `exact_type`equals the named tuple `type` field. 在 3.1 版更改: Added support for named tuples. 在 3.3 版更改: Added support for `exact_type`. [`tokenize()`](#tokenize.tokenize "tokenize.tokenize") determines the source encoding of the file by looking for a UTF-8 BOM or encoding cookie, according to [**PEP 263**](https://www.python.org/dev/peps/pep-0263) \[https://www.python.org/dev/peps/pep-0263\]. All constants from the [`token`](token.xhtml#module-token "token: Constants representing terminal nodes of the parse tree.") module are also exported from [`tokenize`](#module-tokenize "tokenize: Lexical scanner for Python source code."). Another function is provided to reverse the tokenization process. This is useful for creating tools that tokenize a script, modify the token stream, and write back the modified script. `tokenize.``untokenize`(*iterable*)Converts tokens back into Python source code. The *iterable* must return sequences with at least two elements, the token type and the token string. Any additional sequence elements are ignored. The reconstructed script is returned as a single string. The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change. It returns bytes, encoded using the [`ENCODING`](token.xhtml#token.ENCODING "token.ENCODING") token, which is the first token sequence output by [`tokenize()`](#tokenize.tokenize "tokenize.tokenize"). [`tokenize()`](#tokenize.tokenize "tokenize.tokenize") needs to detect the encoding of source files it tokenizes. The function it uses to do this is available: `tokenize.``detect_encoding`(*readline*)The [`detect_encoding()`](#tokenize.detect_encoding "tokenize.detect_encoding") function is used to detect the encoding that should be used to decode a Python source file. It requires one argument, readline, in the same way as the [`tokenize()`](#tokenize.tokenize "tokenize.tokenize") generator. It will call readline a maximum of twice, and return the encoding used (as a string) and a list of any lines (not decoded from bytes) it has read in. It detects the encoding from the presence of a UTF-8 BOM or an encoding cookie as specified in [**PEP 263**](https://www.python.org/dev/peps/pep-0263) \[https://www.python.org/dev/peps/pep-0263\]. If both a BOM and a cookie are present, but disagree, a [`SyntaxError`](exceptions.xhtml#SyntaxError "SyntaxError") will be raised. Note that if the BOM is found, `'utf-8-sig'` will be returned as an encoding. If no encoding is specified, then the default of `'utf-8'` will be returned. Use [`open()`](#tokenize.open "tokenize.open") to open Python source files: it uses [`detect_encoding()`](#tokenize.detect_encoding "tokenize.detect_encoding") to detect the file encoding. `tokenize.``open`(*filename*)Open a file in read only mode using the encoding detected by [`detect_encoding()`](#tokenize.detect_encoding "tokenize.detect_encoding"). 3\.2 新版功能. *exception* `tokenize.``TokenError`Raised when either a docstring or expression that may be split over several lines is not completed anywhere in the file, for example: ``` """Beginning of docstring ``` or: ``` [1, 2, 3 ``` Note that unclosed single-quoted strings do not cause an error to be raised. They are tokenized as [`ERRORTOKEN`](token.xhtml#token.ERRORTOKEN "token.ERRORTOKEN"), followed by the tokenization of their contents. ## Command-Line Usage 3\.3 新版功能. The [`tokenize`](#module-tokenize "tokenize: Lexical scanner for Python source code.") module can be executed as a script from the command line. It is as simple as: ``` python -m tokenize [-e] [filename.py] ``` The following options are accepted: `-h````, ``--help```show this help message and exit `-e````, ``--exact```display token names using the exact type If `filename.py` is specified its contents are tokenized to stdout. Otherwise, tokenization is performed on stdin. ## 示例 Example of a script rewriter that transforms float literals into Decimal objects: ``` from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP from io import BytesIO def decistmt(s): """Substitute Decimals for floats in a string of statements. >>> from decimal import Decimal >>> s = 'print(+21.3e-5*-.1234/81.7)' >>> decistmt(s) "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" The format of the exponent is inherited from the platform C library. Known cases are "e-007" (Windows) and "e-07" (not Windows). Since we're only showing 12 digits, and the 13th isn't close to 5, the rest of the output should be platform-independent. >>> exec(s) #doctest: +ELLIPSIS -3.21716034272e-0...7 Output from calculations with Decimal should be identical across all platforms. >>> exec(decistmt(s)) -3.217160342717258261933904529E-7 """ result = [] g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string for toknum, tokval, _, _, _ in g: if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens result.extend([ (NAME, 'Decimal'), (OP, '('), (STRING, repr(tokval)), (OP, ')') ]) else: result.append((toknum, tokval)) return untokenize(result).decode('utf-8') ``` Example of tokenizing from the command line. The script: ``` def say_hello(): print("Hello, World!") say_hello() ``` will be tokenized to the following output where the first column is the range of the line/column coordinates where the token is found, the second column is the name of the token, and the final column is the value of the token (if any) ``` $ python -m tokenize hello.py 0,0-0,0: ENCODING 'utf-8' 1,0-1,3: NAME 'def' 1,4-1,13: NAME 'say_hello' 1,13-1,14: OP '(' 1,14-1,15: OP ')' 1,15-1,16: OP ':' 1,16-1,17: NEWLINE '\n' 2,0-2,4: INDENT ' ' 2,4-2,9: NAME 'print' 2,9-2,10: OP '(' 2,10-2,25: STRING '"Hello, World!"' 2,25-2,26: OP ')' 2,26-2,27: NEWLINE '\n' 3,0-3,1: NL '\n' 4,0-4,0: DEDENT '' 4,0-4,9: NAME 'say_hello' 4,9-4,10: OP '(' 4,10-4,11: OP ')' 4,11-4,12: NEWLINE '\n' 5,0-5,0: ENDMARKER '' ``` The exact token type names can be displayed using the [`-e`](#cmdoption-tokenize-e) option: ``` $ python -m tokenize -e hello.py 0,0-0,0: ENCODING 'utf-8' 1,0-1,3: NAME 'def' 1,4-1,13: NAME 'say_hello' 1,13-1,14: LPAR '(' 1,14-1,15: RPAR ')' 1,15-1,16: COLON ':' 1,16-1,17: NEWLINE '\n' 2,0-2,4: INDENT ' ' 2,4-2,9: NAME 'print' 2,9-2,10: LPAR '(' 2,10-2,25: STRING '"Hello, World!"' 2,25-2,26: RPAR ')' 2,26-2,27: NEWLINE '\n' 3,0-3,1: NL '\n' 4,0-4,0: DEDENT '' 4,0-4,9: NAME 'say_hello' 4,9-4,10: LPAR '(' 4,10-4,11: RPAR ')' 4,11-4,12: NEWLINE '\n' 5,0-5,0: ENDMARKER '' ``` ### 导航 - [索引](../genindex.xhtml "总目录") - [模块](../py-modindex.xhtml "Python 模块索引") | - [下一页](tabnanny.xhtml "tabnanny --- 模糊缩进检测") | - [上一页](keyword.xhtml "keyword --- 检验Python关键字") | - ![](https://box.kancloud.cn/a721fc7ec672275e257bbbfde49a4d4e_16x16.png) - [Python](https://www.python.org/) » - zh\_CN 3.7.3 [文档](../index.xhtml) » - [Python 标准库](index.xhtml) » - [Python 语言服务](language.xhtml) » - $('.inline-search').show(0); | © [版权所有](../copyright.xhtml) 2001-2019, Python Software Foundation. Python 软件基金会是一个非盈利组织。 [请捐助。](https://www.python.org/psf/donations/) 最后更新于 5月 21, 2019. [发现了问题](../bugs.xhtml)？使用[Sphinx](http://sphinx.pocoo.org/)1.8.4 创建。