加载和准备 text8 数据集 · 精通 TensorFlow 1.x

# 加载和准备 text8 数据集现在我们使用 text8 数据集执行相同的加载和预处理步骤： ```py from datasetslib.text8 import Text8 text8 = Text8() text8.load_data() # downloads data, converts words to ids, converts files to a list of ids print('Train:', text8.part['train'][0:5]) print('Vocabulary Length = ',text8.vocab_len) ``` 我们发现词汇长度大约是 254,000 字： ```py Train: [5233, 3083, 11, 5, 194] Vocabulary Length = 253854 ``` 一些教程通过查找最常用的单词或将词汇量大小截断为 10,000 个单词来操纵此数据。但是，我们使用了 text8 数据集的第一个文件中的完整数据集和完整词汇表。准备 CBOW 对： ```py text8.skip_window=2 text8.reset_index_in_epoch() # in CBOW input is the context word and output is the target word y_batch, x_batch = text8.next_batch_cbow() print('The CBOW pairs : context,target') for i in range(5 * text8.skip_window): print('(', [text8.id2word[x_i] for x_i in x_batch[i]], ',', y_batch[i], text8.id2word[y_batch[i]], ')') ``` 输出是： ```py The CBOW pairs : context,target ( ['anarchism', 'originated', 'a', 'term'] , 11 as ) ( ['originated', 'as', 'term', 'of'] , 5 a ) ( ['as', 'a', 'of', 'abuse'] , 194 term ) ( ['a', 'term', 'abuse', 'first'] , 1 of ) ( ['term', 'of', 'first', 'used'] , 3133 abuse ) ( ['of', 'abuse', 'used', 'against'] , 45 first ) ( ['abuse', 'first', 'against', 'early'] , 58 used ) ( ['first', 'used', 'early', 'working'] , 155 against ) ( ['used', 'against', 'working', 'class'] , 127 early ) ( ['against', 'early', 'class', 'radicals'] , 741 working ) ``` 准备 skip-gram 对： ```py text8.skip_window=2 text8.reset_index_in_epoch() # in skip-gram input is the target word and output is the context word x_batch, y_batch = text8.next_batch() print('The skip-gram pairs : target,context') for i in range(5 * text8.skip_window): print('(',x_batch[i], text8.id2word[x_batch[i]], ',', y_batch[i], text8.id2word[y_batch[i]],')') ``` 输出为： ```py The skip-gram pairs : target,context ( 11 as , 5233 anarchism ) ( 11 as , 3083 originated ) ( 11 as , 5 a ) ( 11 as , 194 term ) ( 5 a , 3083 originated ) ( 5 a , 11 as ) ( 5 a , 194 term ) ( 5 a , 1 of ) ( 194 term , 11 as ) ( 194 term , 5 a ) ```