colly [11.9k] · PHP/Python/前端/Linux 等等学习笔记

[TOC] * Colly Github:[https://github.com/gocolly/colly](https://github.com/gocolly/colly) * Colly 文档:[https://pkg.go.dev/github.com...](https://pkg.go.dev/github.com/gocolly/colly/v2?tab=doc) * Crawlab Github:[https://github.com/crawlab-te...](https://github.com/crawlab-team/crawlab) * Crawlab 官网:[https://crawlab.cn](https://crawlab.cn/) * Crawlab 演示:[https://demo-pro.crawlab.cn](https://demo-pro.crawlab.cn/) ## 特性 * 清洁API * 快速（单个内核上> 1k请求/秒） * 管理请求延迟和每个域的最大并发 * 自动cookie和会话处理 * 同步/异步/并行抓取 * 分布式刮 * 快取 * 自动编码非unicode响应 * Robots.txt支持 * Google App Engine支持 ## 与 Crawlab 可视化 ``` » crawlab upload go.mod go.sum baidu_spider.go uploaded successfully ``` Crawlab 的爬虫详情界面中输入执行命令`go run baidu_spider.go` 点击运行 ## 实例 ### 官方实例 ``` func main() { c := colly.NewCollector() // Find and visit all links c.OnHTML("a[href]", func(e *colly.HTMLElement) { e.Request.Visit(e.Attr("href")) }) c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL) }) c.Visit("http://go-colly.org/") } ``` ### 百度top 热点 ``` package main import ( "fmt" "github.com/gocolly/colly" ) func main() { c := colly.NewCollector() // 获取所有的 href c.OnHTML(".title-content", func(e *colly.HTMLElement) { fmt.Printf("%+v\n", e.ChildText(".title-content-title")) }) c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL) }) c.Visit("http://www.baidu.com") } //输出 //杜特尔特称优先采购中俄疫苗 //黄子韬悼念爸爸 //云南瑞丽全员核酸检测 //2020百度世界大会 //狗不理解除与王府井店加盟方合作 //关晓彤老虎菜拌魔芋粉 ``` ### 更多实例查看官方例子 https://github.com/gocolly/colly/tree/master/_examples