Tesseract · Python爬虫

**Tesseract：** Tesseract是一个将图像翻译成文字的OCR(光学文字识别，OpticalCharacterRecognition),目前由谷歌赞助。Tesseract是目前公认最优秀、最准确的开源OCR库。Tesseract具有很高的识别度，也具有很高的灵活性，他可以通过训练识别任何字体。 <br/> **不同环境下Tesseract的安装方法：** （1）根据不同的使用环境安装在Windows系统安装：https://github.com/tesseract-ocr/tesseract 下载安装软件在Python中安装esseract：` pip install pytesseract` 在ubuntu中安装： `sudo apt install tesseract-ocr` （2）安装好后，无论哪一种环境，都需要设置环境变量 Mac和Linux在安装的时候就默认已经设置好了。在Windows下把`tesseract.exe`所在的路径添加到PATH环境变量中。（3）配置好环境变量后，进⼊cmd输⼊下⾯的命令查看版本，正常运⾏则安装成功。 ```shell C:\Users\Administrator>D:\Tesseract-OCR\tesseract --version tesseract 4.00.00alpha leptonica-1.74.1 ``` <br/> **在命令⾏中使⽤tesseract识别图像：** ```shell # tesseract 图⽚路径识别后数字输出的⽂件路径 tesseract demo.png d ``` 如果要识别中⽂图像，需要下载语⾔安装包，URL地址：https://github.com/tesseract-ocr/tessdat 。 <br/> **在代码中使⽤tesseract识别图像：** ```python import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract.exe' tessdata_dir_config = r'D:\Tesseract-OCR\tessdata' image = Image.open(r'F:\demo\demo.jpg') # 将会输出图片验证码上的字符 print(pytesseract.image_to_string(image, lang='eng', config=tessdata_dir_config)) ```