5.3.1 MongoDB存储 · python3爬虫笔记

### 1.相关&安装官方文档:[https://docs.mongodb.com/manual/reference/operator/query/](https://docs.mongodb.com/manual/reference/operator/query/) [安装文档](/1kai-fa-huan-jing-pei-zhi/15-cun-chu-ku-de-an-zhuang/152-pymongode-an-zhuang.md) ### 2.连接数据库通过pymongo库中的MongoClient，连接MongoDB数据库，有两种连接方式: 第一种: ``` import pymongo client = pymongo.MongoClient(host='localhost',port=27017) ``` 第二种: ``` import pymongo client = pymongo.MongoClient('mongodb://localhost:27017') ``` ### 3.指定数据库 ``` # 在MongoDB中不需要去创建数据库，指定数据库名称，调用会自定生成相应的数据库 db = client.test # 等价于 # db = client["test"] ``` ### 4.指定集合 MongoDB 的每个数据库包含了许多集合 Collection，类似与关系型数据库中的表指定集合和指定数据库的操作是一样的 ``` collection = db.students # collection = db["students"] ``` ### 5.插入数据 ``` student = { "name":"angle", "age":20, } # 通过调用集合的insert()方法插入数据 result = collection.insert(student) print(result) ``` insert 方法会在执行后返回的 \_id 值，\_id值是每一条数据的唯一标识，如果没有显示指明\_id，MongoDB会自动产生一个ObjectId类型的\_id属性运行结果: ``` 5b684a54bd880b468471dccf ``` 如果有多个值，可以以列表形式写入 ``` import pymongo client = pymongo.MongoClient(host='localhost',port=27017) # 在MongoDB中不需要去创建数据库，指定数据库名称，调用会自定生成相应的数据库 db = client.test # db = client["test"] collection = db.students # collection = db["students"] student1 = { "name":"angle", "age":20, } student2 = { "name":"angle", "age":20, } result = collection.insert([student1,student2]) print(result) ``` 运行结果: ``` [ObjectId('5b684e83bd880b4408713464'), ObjectId('5b684e83bd880b4408713465')] ``` 注意在python3中，insert方法已经不再被推荐使用，现在官方推荐使用insert\_one和insert\_many方法 * insert\_one:插入一条数据 * insert\_many:插入多条数据 ``` # 插入单挑数据 result = collection.insert_one(student) print(result) ``` 运行结果: ``` <pymongo.results.InsertOneResult object at 0x000002744113BF48> ``` 返回结果和 insert 方法不同，返回的是InsertOneResult 对象，可以调用其 inserted\_id 属性获取 \_id ``` # 插入多条数据 result = collection.insert_many([student1,student2]) print(result) # 通过调用inserted_ids属性获取插入数据的_id的列表 print(result.inserted_ids) ``` 运行结果: ``` <pymongo.results.InsertManyResult object at 0x000001BBFFE2AF88> [ObjectId('5b684f75bd880b2228c1fd23'), ObjectId('5b684f75bd880b2228c1fd24')] ``` ### 6. 查询 {#6-查询} * find\_one:查询得到单个结果 * find:返回一个生成器对象 ``` result = collection.find_one({"name":"angle"}) print(type(result)) print(result) ``` 运行结果: ``` <class 'dict'> {'_id': ObjectId('5b684a54bd880b468471dccf'), 'name': 'angle', 'age': 20} ``` 可以根据ObjectId查询，但是需要导入bson库的ObjectId ``` from bson.objectid import ObjectId result = collection.find_one({'_id': ObjectId('5b684a54bd880b468471dccf')}) print(type(result)) print(result) ``` 运行结果: ``` <class 'dict'> {'_id': ObjectId('5b684a54bd880b468471dccf'), 'name': 'angle', 'age': 20} ``` 对于多条数据的查询 ``` results = collection.find({"name":"angle"}) print(results) for result in results: print(result) ``` 运行结果: ``` <pymongo.cursor.Cursor object at 0x0000022AED824518> {'_id': ObjectId('5b684a54bd880b468471dccf'), 'name': 'angle', 'age': 20} {'_id': ObjectId('5b684e83bd880b4408713464'), 'name': 'angle', 'age': 20} {'_id': ObjectId('5b684e83bd880b4408713465'), 'name': 'angle', 'age': 20} {'_id': ObjectId('5b684f1dbd880b49a85d9e9e'), 'name': 'angle', 'age': 20} {'_id': ObjectId('5b684f75bd880b2228c1fd23'), 'name': 'angle', 'age': 20} {'_id': ObjectId('5b684f75bd880b2228c1fd24'), 'name': 'angle', 'age': 20} ``` 在查询时，可以使用条件查询，例如:查询age小于20的数据 ``` # 添加数据 student= { "name":"miku", "age":18, } result = collection.insert_one(student) ``` 条件语句通过以字典形式书写: ``` {'$lt':20} ``` ``` results = collection.find({'age':{'$lt':20}}) for result in results: print(result) ``` 运行结果: ``` {'_id': ObjectId('5b68513cbd880b4dd0e128c7'), 'name': 'miku', 'age': 18} ``` 比较符号 | 符号 | 含义 | 示例 | | :--- | :--- | :--- | | $lt | 小于 | `{'age': {'$lt': 20}}` | | $gt | 大于 | `{'age': {'$gt': 20}}` | | $lte | 小于等于 | `{'age': {'$lte': 20}}` | | $gte | 大于等于 | `{'age': {'$gte': 20}}` | | $ne | 不等于 | `{'age': {'$ne': 20}}` | | $in | 在范围内 | `{'age': {'$in': [20, 23]}}` | | $nin | 不在范围内 | `{'age': {'$nin': [20, 23]}}` | | $regex | 正则匹配 | {'name':{'$regex':'^a.\*'}}（匹配以a开头的字符串） | 更详细的官方文档:[https://docs.mongodb.com/manual/reference/operator/query/](https://docs.mongodb.com/manual/reference/operator/query/) 功能符号 | 符号 | 含义 | 示例 | 示例含义 | | :--- | :--- | :--- | :--- | | $regex | 匹配正则 | `{'name': {'$regex': '^M.*'}}` | name 以 M开头 | | $exists | 属性是否存在 | `{'name': {'$exists': True}}` | name 属性存在 | | $type | 类型判断 | `{'age': {'$type': 'int'}}` | age 的类型为 int | | $mod | 数字模操作 | `{'age': {'$mod': [5, 0]}}` | 年龄模 5 余 0 | | $text | 文本查询 | `{'$text': {'$search': 'Mike'}}` | text 类型的属性中包含 Mike 字符串 | | $where | 高级条件查询 | `{'$where': 'obj.fans_count == obj.follows_count'}` | 自身粉丝数等于关注数 | | $inc | 加法 | {'$inc':{'age':1}} | 自身年龄加1 | ### 7.计数统计查询结有多少条数据，可以调用count方法 ``` # 统计所有数据数目 count = collection.find().count() print(count) ``` 可以统计符合某个条件的数据有多少条数目 ``` count = collection.find({'age':{'$lt':20}}).count() print(count) ``` ### 8.排序可以调用sort方法进行排序，并传入如下参数可指定升序或降序 * pymongo.ASCENDING:升序 * pymongo,DESCENDING:降序 ``` # 升序 results = collection.find().sort('name',pymongo.ASCENDING) print([result['name'] for result in results]) ``` 运行结果: ``` ['angle', 'angle', 'angle', 'angle', 'angle', 'angle', 'miku', 'miku'] ``` ### 9.偏移 skip$n$方法:向后偏移n个，获取第n+1及其后的数据 ``` results = collection.find().sort('name',pymongo.ASCENDING).skip(2) print([result['name'] for result in results]) ``` 运行结果: ``` ['angle', 'angle', 'angle', 'angle', 'miku', 'miku'] ``` 使用limit限制指定要取的结果个数 limit$n$:限制只取n个数据 ``` results = collection.find().sort('name',pymongo.ASCENDING).skip(2).limit(3) print([result['name'] for result in results]) ``` 运行结果: ``` ['angle', 'angle', 'angle'] ``` 注意:在数据超多时，不要使用偏移量，应该使用\_id来进行筛选 ``` from bson.objectid import ObjectId collection.find({'_id': {'$gt': ObjectId('5b684a54bd880b468471dccf')}}) ``` ### 10.更新 update方法:指定更新条件和更新的数据，对数据进行更新 ``` # 根据条件先筛选出数据 condition = {"age":{"$lt":20}} student = collection.find_one(condition) # 修改数据 student['age'] = 100 # 把原条件和修改后的数据传入，完成数据的更新 result = collection.update(condition,student) print(result) ``` 运行结果: ``` {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} ``` 返回结果为字典形式，ok为执行成，nModified:为影响的数据条数使用$set操作符对数据进行更新，$set操作符，只更新student字典内存在的字段，如果student原还有其他字段则不会更新，也不会删除，如果不用$set操作符的话，之前的数据全部被student字典替换，如果原先存在其他的字段则会被删除 ``` result = collection.update(condition,{'$set':student}) ``` 注意update方法已经不再被推荐使用了，目前推荐使用update\_one和update\_many方法，用法更加严格，第二个参数需要使用$类型操作符作为字典的键名 ``` condition = {"name":"miku"} student = collection.find_one(condition) student['age'] = 50 result = collection.update_one(condition,{'$set':student}) print(result) # matched_count:获取匹配的数据条数 # modified_count:获取影响的数据条数 print(result.matched_count, result.modified_count) ``` 运行结果: ``` <pymongo.results.UpdateResult object at 0x0000028E1A94AE88> 1 1 ``` 更新多条数据 ``` # 所有符合年龄大于20的，筛选出来后，再加上10，然后更新age字段数据 condition = {"age":{"$gt":20}} result = collection.update_many(condition,{'$inc':{'age':10}}) print(result) print(result.matched_count,result.modified_count) ``` ### 11.删除 remove$condition$方法:指定删除条件，所有符合的数据都会被删除数据 ``` condition = {"age":{"$gt":20}} result = collection.remove(condition) print(result) ``` 运行结果: ``` {'n': 2, 'ok': 1.0} ``` n:表示删除逇数目，ok表示删除成功和上面一样remove已经不被推荐使用，目前使用delete\_one和delete\_many方法 * delete\_one:删除一条数据 * delete\_many:删除多条数据 ``` result = collection.delete_one({'name':'angle'}) print(result) print(result.deleted_count) result = collection.delete_many({'name':'angle'}) print(result.deleted_count) ``` 运行结果: ``` <pymongo.results.DeleteResult object at 0x0000024241D3AE88> 1 4 ``` ### 12. 更多 {#12-更多} 另外 PyMongo 还提供了一些组合方法，如find\_one\_and\_delete、find\_one\_and\_replace、find\_one\_and\_update，就是查找后删除、替换、更新操作，用法与上述方法基本一致。另外还可以对索引进行操作，如 create\_index、create\_indexes、drop\_index 等。详细用法可以参见官方文档：[http://api.mongodb.com/python/current/api/pymongo/collection.html](http://api.mongodb.com/python/current/api/pymongo/collection.html)。另外还有对数据库、集合本身以及其他的一些操作，在这不再一一讲解，可以参见官方文档：[http://api.mongodb.com/python/current/api/pymongo/](http://api.mongodb.com/python/current/api/pymongo/)。