简介

第一个爬虫就虫一段html开始, 之后会有根据url爬取，模拟登陆信息爬取等^_^

需要的第三方库介绍

BeautifulSoup
Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。
它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。
Beautiful Soup的安装方法
1. pip install bs4方法
2. 在Pycharm中，可以在File -> Settings -> Project Interpreter -> 右侧有个加号按钮 -> 在弹出的窗口搜索bs4并安装。

实例

下面是一段代码, 主要绑我们构建出能够通过第三方库帮助我们识别html，这样我们就能提取出对我们有用的东西了

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')  #声明BeautifulSoup对象
find = soup.find('p')  #使用find方法查到第一个p标签
print("find's return type is ", type(find))  #输出返回值类型
print("find's content is", find)  #输出find获取的值
print("find's Tag Name is ", find.name)  #输出标签的名字
print("find's Attribute(class) is ", find['class'])  #输出标签的class属性值

输出内容为一下

find's return type is  <class 'bs4.element.Tag'>
find's content is <p class="title"><b>The Dormouse's story</b></p>
find's Tag Name is  p
find's Attribute(class) is  ['title']

结语

好久没写博客了, 慢慢来吧