爬虫.微博数据的存储：分布式数据库及应用-PingCAP

爬虫.微博数据的存储：分布式数据库及应用

网友投稿 1254 2023-04-07

爬虫.微博数据的存储：分布式数据库及应用

分布式爬虫系统

简单的分布式爬虫

分布式爬虫的作用：1.解决目标地址对IP访问频率的限制

2.利用更高的宽带，提高下载速度

3.大规模系统的分布式存储和备份

4.数据的扩展能力

将多进程爬虫部署到多台主机上

将数据库地址配置到统一的服务器上

将数据库设置仅允许特定IP来源的访问请求

设置防护墙，允许端口远程连接

分布式爬虫系统-爬虫

分布式存储

爬虫原数据存储特点

1.文件小，大量KB级别的文件

2.文件数量大

3.增量方式一次性写入，极少需要修改

4.顺序读取

5.并发的文件读写

6.可扩展

Googls FS

HDFS

Distributed,Scalable,Portable,File System

Written in Java

Not fully POSIX-compliant

Replication : 3 copies by default

Designed for immutable files

Files are cached and chunked ,chunk size 64MB

Python hdfs module

Installation : pip install hdfs

Methods :　　　　　　Desc

read（）　　　　　　 read a file

write（）　　　　　　 write a file

delete（）　　　　　　Remove a file or directory from HDFS

rename（）　　　　　 Move a file or folder

download（）　　　　 Download a file or folder from HDFS and save it locally

list（）　　　　　　　 Return names of files contained in a remote folder

makedirs（）　　　　 Create a remote directory , recursively if necessary

resolve（）　　　　　 Return absolute , normalized path , with special markers expanded

upload（）　　　　　 Upload a file or directory to HDFS

walk（）　　　　　　　Depth-first walk of remote filesystem

存储到HDFS

from hdfs import *

from hdfs.util import HdfsError

hdfs_client=InsecureClient ('[ host ] : [ port ]',user='user')

with hdfs_client.write('/htmls/mfw/%s.html'%(filename)) as writer :

writer.write(html_page)

except HdfsError,Arguments :

print Arguments

HBASE

on top of HDFS

Column-oriented database

Can store huge size raw data

KEY-VALUE

HDFS 和 HBASE

HBASE

*** is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families,which are the key value pairs.

A table have multiple column families and each column family can have any number of column. Subsequent column values are stored contiguously on the

disk . Each cell value of the table has a timestamp. In short, in an *** :

1.Table is a collection of rows

2.Row is a collection of column families

3.Column family is a collection of columns

4.Column is a collection of key value pairs

分布式爬虫系统—存储

分布式爬虫—数据库

***

Schema less - *** is a document database in which one collection holds different documents. Number of fields,content and size of the document can differ from one

document to another.

Structure of a single object is clear.

No complex joins .

Deep query-ability. *** supports dynamic queries on documents using a document-based query language that's nearly as powerful as SQL.

Ease of scale-out-*** is easy to scale .

Conversion/ mapping of application objects to database objects not needed

Installation

download

setup

mkdir mongodb

tar xzvf mongodb-liunx-x86_64-amazon-3.4.2.tgz-C mongodb

client

mongo

Mongo DB

db.collection.findOneAndUpdate（filter,update,options）

Returns one document that satisfies the specified query criteria.

Returns the first document according to natural order, means insert order

Find and update are done atomically

MongoClient methods :

db.spider.mfw.find_one_and_uodate（）

数据库类型

Redis Overview

基于KEY VALUE 模式的内存数据库

支持复杂的对象模型（MemoryCached 仅支持少量类型）

支持Replication，实现集群（MemoryCached 不支持分布式部署）

所有操作都是原子性（MemoryCached 多数操作都不是原子的）

可以序列化到磁盘（MemoryCached 不能序列化）

Redis Environment Setup

downlod

$ tar xzf redis-3.2.7.tar.gz

$ cd redis-3.2.7

$ make

Start server and cli

$ nohup src/ redis-server&

$ src/redis-cli

Test it

redis >set foo bar

redis >get foo

"bar"

python Redis

Installation

$ sudo pip install redis

Sample Code

>>>import redis

>>>r=redis.StricRedis（host='localhost',port=6379,db=0）

>>>

>>>r.set（‘foo’，‘bar’）

True

>>>r.get（‘foo’）

‘bar’

Mongo的优化

url作为_id，默认会被创建索引，创建索引是需要额外开销的

index尽量简单，url长一些

dequeueUrl find_one（）并没有利用index，会全库扫描，但是仍然会很快，因为扫描到第一个后就停止了，但是当下载完后的数量特别大的时候，扫描依然是很费时的，考虑一下能不能进一步优化

插入的操作很频繁，每一个网页对应着几百次插入，到了depth=3的时候，基数网页是百万级，插入检查将会是亿级，考虑使用更高效的方式来检查

Mongo with Redis

status：create index OR in different collections

Code Snippet

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

1254 2023-04-07

爬虫.微博数据的存储：分布式数据库及应用

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

高成本云服务？TiDB 帮你省钱

零售业数据库选型与迁移ToC系统实践大规模场景应用

推荐文章

HTAP 还可以这么玩？丨TiDB 在 IoT 智慧园区的应用

新特性解析丨TiDB 资源管控的设计思路与场景解析

TiDB赋能保险业-首个全栈自主核心保单系统成功投产

首个云原生、分布式、全栈国产化银行核心业务系统投产上线丨TiDB × 杭州银行

TiDB 在社交场景的解决方案实践

电商数据技术栈，在海量数据增长下如何实现实时与全量兼得？

金融行业数据库的选择

TiDB 在智能制造中的应用实践

TiDB 在全球头部物流企业计费管理系统的应用实践

PingCAP与教育部教育管理信息中心合作，推动普惠教育数字化转型

友情链接

热评文章

TiDB 中标杭州银行核心系统数据库项目

TiDB 首批通过信通院 HTAP 数据库基础能力评

PingCAP 与 Wisconsin-Madiso

PingCAP 成为中国唯一入选 Forrester

TiDB 走进东软集团，共建医疗数字化基石

共享开源技术，共建开放生态丨平凯星辰余梦杰出席 20

爬虫.微博数据的存储：分布式数据库及应用

微信扫一扫：分享

推荐文章

友情链接

热评文章