Huggingface Hub · PHP/Python/前端/Linux 等等学习笔记

[TOC] > [官网](https://huggingface.co/models?pipeline_tag=question-answering&sort=trending) > [官方教程](https://huggingface.co/docs/transformers/main/en/index) > [教程](http://fancyerii.github.io/2021/05/11/huggingface-transformers-1/#%E4%BD%BF%E7%94%A8pipeline) ## 概述通过 pip 的 `transformers` 模块可以很方便的调用 huggingface 的模型 Transformers的目的是为了： * 帮助NLP研究者进行大规模的transformer模型 * 帮助工业界的使用者微调模型并且不是到生产环境 * 帮助工程师下载预训练模型并且解决实际问题它的设计原则包括： * 易用 * 只有[configuration](https://huggingface.co/transformers/main_classes/configuration.html)，[models](https://huggingface.co/transformers/main_classes/model.html)和[tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html)三个主要类。 * 所有的模型都可以通过统一的from\_pretrained()函数来实现加载，transformers会处理下载、缓存和其它所有加载模型相关的细节。而所有这些模型都统一在[Hugging Face Models](https://huggingface.co/models)管理。 * 基于上面的三个类，提供更上层的pipeline和Trainer/TFTrainer，从而用更少的代码实现模型的预测和微调。 * 因此它不是一个基础的神经网络库来一步一步构造Transformer，而是把常见的Transformer模型封装成一个building block，我们可以方便的在PyTorch或者TensorFlow里使用它。 * 尽量和原论文作者的实现一致 * 每个模型至少有一个例子实现和原论文类似的效果 * 尽量参考原论文的实现，因此有些代码不会那么自然 ## 主要概念 * 诸如BertModel的**模型(Model)**类，包括30+PyTorch模型(torch.nn.Module)和对应的TensorFlow模型(tf.keras.Model)。 * 诸如BertConfig的**配置(Config)**类，它保存了模型的相关(超)参数。我们通常不需要自己来构造它。如果我们不需要进行模型的修改，那么创建模型时会自动使用对于的配置 * 诸如BertTokenizer的**Tokenizer**类，它保存了词典等信息并且实现了把字符串变成ID序列的功能。所有这三类对象都可以使用from\_pretrained()函数自动通过名字或者目录进行构造，也可以使用save\_pretrained()函数保存。 ## 安装 ``` pip install transformers ``` 仅需 CPU 支持，可以使用单行命令方便地安装 ``` // Transformers 和 PyTorch pip install 'transformers[torch]' //Transformers 和 TensorFlow 2.0： pip install 'transformers[tf-cpu]' // Transformers 和 Flax: pip install 'transformers[flax]' ``` 测试是否安装成功 ``` python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))" ``` ### 缓存设置预训练模型会被下载并本地缓存到 `~/.cache/huggingface/hub` ## pipeline | **任务** | **描述** | **模态** | **Pipeline** | | --- | --- | --- | --- | | 文本分类 | 为给定的文本序列分配一个标签 | NLP | pipeline(task=“sentiment-analysis”) | | 文本生成 | 根据给定的提示生成文本 | NLP | pipeline(task=“text-generation”) | | 命名实体识别 | 为序列里的每个 token 分配一个标签（人, 组织, 地址等等） | NLP | pipeline(task=“ner”) | | 问答系统 | 通过给定的上下文和问题, 在文本中提取答案 | NLP | pipeline(task=“question-answering”) | | 掩盖填充 | 预测出正确的在序列中被掩盖的token | NLP | pipeline(task=“fill-mask”) | | 文本摘要 | 为文本序列或文档生成总结 | NLP | pipeline(task=“summarization”) | | 文本翻译 | 将文本从一种语言翻译为另一种语言 | NLP | pipeline(task=“translation”) | | 图像分类 | 为图像分配一个标签 | Computer vision | pipeline(task=“image-classification”) | | 图像分割 | 为图像中每个独立的像素分配标签（支持语义、全景和实例分割） | Computer vision | pipeline(task=“image-segmentation”) | | 目标检测 | 预测图像中目标对象的边界框和类别 | Computer vision | pipeline(task=“object-detection”) | | 音频分类 | 给音频文件分配一个标签 | Audio | pipeline(task=“audio-classification”) | | 自动语音识别 | 将音频文件中的语音提取为文本 | Audio | pipeline(task=“automatic-speech-recognition”) | | 视觉问答 | 给定一个图像和一个问题，正确地回答有关图像的问题 | Multimodal | pipeline(task=“vqa”) | ## 快速入门 ### 下载到本地进行调用 ### 翻译我们使用 pipeline的参数 `translation_xx_to_yy`使用英文转中文 ``` from transformers import pipeline, AutoModelWithLMHead, AutoTokenizer model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh") tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh") translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer) text = "The Home of Machine Learning" translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text) // 机器学习之家 ``` ### 对话 #### roberta-base-chinese-extractive-qa 模型 ``` from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline model = AutoModelForQuestionAnswering.from_pretrained('uer/roberta-base-chinese-extractive-qa') tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-chinese-extractive-qa') QA = pipeline('question-answering', model=model, tokenizer=tokenizer) QA_input = {'question': "著名诗歌《假如生活欺骗了你》的作者是",'context': "普希金从那里学习人民的语言，吸取了许多有益的养料，这一切对普希金后来的创作产生了很大的影响。这两年里，普希金创作了不少优秀的作品，如《囚徒》、《致大海》、《致凯恩》和《假如生活欺骗了你》等几十首抒情诗，叙事诗《努林伯爵》，历史剧《鲍里斯·戈都诺夫》，以及《叶甫盖尼·奥涅金》前六章。"} qa = QA(QA_input,max_length=100) print(qa) # {'score': 0.9766427278518677, 'start': 0, 'end': 3, 'answer': '普希金'} ``` #### luhua/chinese_pretrain_mrc_roberta_wwm_ext_large 模型 ``` from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline, BertTokenizer model_name = "chinese_pretrain_mrc_roberta_wwm_ext_large" # "chinese_pretrain_mrc_macbert_large" # Use in Transformers tokenizer = AutoTokenizer.from_pretrained(f"luhua/{model_name}") model = AutoModelForQuestionAnswering.from_pretrained(f"luhua/{model_name}") QA = pipeline('question-answering', model=model, tokenizer=tokenizer) QA_input = { 'question': "钱钟书是谁", 'context': "普希金从那里学习人民的语言，吸取了许多有益的养料，这一切对普希金后来的创作产生了很大的影响。这两年里，普希金创作了不少优秀的作品，如《囚徒》、《致大海》、《致凯恩》和《假如生活欺骗了你》等几十首抒情诗，叙事诗《努林伯爵》，历史剧《鲍里斯·戈都诺夫》，以及《叶甫盖尼·奥涅金》前六章。" } qa = QA(QA_input, max_length=100) print(qa) # {'score': 0.0037305462174117565, 'start': 66, 'end': 76, 'answer': '《囚徒》、《致大海》'} ```