Python网络爬虫使用BeautifulSoup爬取网页内容并存入数据库案例

阅读量：

使用BeautifulSoup爬取网页内容并存入数据库案例

学习了Python网络爬虫，完成里一个比较完整的爬虫案例与大家分享

爬取地址：http://www.tipdm.com/cpzx/index.jhtml
任务：爬取网页中产品中心的小标题、简介、超链接，存入数据库
数据库使用的是Mysql，直接使用代码创建数据库数据表以及插入数据

长篇短写，代码如下：

复制代码

    import requests
    import pymysql
    from bs4 import BeautifulSoup
    
    
    def get_html_text(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36'
    }
    html_resutl = requests.get(url=url, headers=headers)
    return html_resutl.text
    
    
    def get_title_link_intro(html_text_list):
    result_list = list()
    for i in range(len(html_text_list)):
        result_bs = BeautifulSoup(html_text_list[i],'lxml')
        search_con = result_bs.select('#t248 > div > div.con')
        for i_con in search_con:
            result_list.append([])
            result_list[len(result_list) - 1].append(i_con.select('h1>a')[0].text)
            result_list[len(result_list) - 1].append(i_con.select('div')[0].text)
            result_list[len(result_list) - 1].append(i_con.select('h1>a')[0].attrs['href'])
    return result_list
    
    
    def connect_mysql():
    try:
        import pymysql
        connect = pymysql.connect(host='localhost', user='root', password='795247', port=3306)
        print('连接数据库成功')
        return connect
    except:
        print('连接数据库失败')
        return None
    
    
    def mk_DB_base(connect: pymysql.connect):
    cursor = connect.cursor()
    sql_crdDB_newdb = 'CREATE DATABASE IF NOT EXISTS pzkdb'
    sql_use = 'USE pzkdb;'
    sql_crdTB_products = '''CREATE TABLE IF NOT EXISTS products(
                             `title` varchar(255) ,
                             `intro` varchar(255)  NULL DEFAULT NULL,
                             `link` varchar(255)  NULL DEFAULT NULL,
                             primary key (`title`)
                            ) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;
                            '''
    cursor.execute(sql_crdDB_newdb)
    cursor.execute(sql_use)
    cursor.execute(sql_crdTB_products)
    connect.commit()
    return True
    
    
    def into_sql(connect: pymysql.connect, values_list, table_name='products'):
    # 存储数据结构为二维列表，存储在数据库的二维表中
    # 传入一位数据元组,values_list
    # 元组结构：Title(标题）,intro(内容）,link（对应标题链接）
    try:
        cursor = connect.cursor()
        sql_insert = 'insert into %s values(%s,%s,%s)' % (table_name, '%s', '%s', '%s')
        cursor.execute(sql_insert, values_list)
        print('插入一条数据：', values_list)
        return True
    except pymysql.err.IntegrityError:
        print('该数据已存在')
        return None
    except:
        return None
    
    
    def into_list(connect: pymysql.connect, values_list):
    # 存储数据结构为二维列表，存储在数据库的二维表中
    # 传入一位数据元组,values_list
    # 元组结构：Title(标题）,intro(内容）,link（对应标题链接）
    try:
        for index_list in values_list:
            into_sql(connect=connect, values_list=index_list)
        print('数据插入完成')
        return True
    except:
        return None
    
    
    if __name__ == '__main__':
    html_text_list=[]
    for i in range(1, 5):
        html_text_list.append(get_html_text(url='http://www.tipdm.com/cpzx/index_' + str(i) + '.jhtml'))
    result = get_title_link_intro(html_text_list)
    connect = connect_mysql()
    mk_DB_base(connect)
    into_list(connect, result)
    connect.commit()

这边也可以使用Xpath的方式进行切片爬取：

复制代码

    def get_title_link_intro(html_text_list):
    result_list = list()
    for i in range(len(html_text_list)):
        result_lxml = etree.HTML(html_text_list[i], etree.HTMLParser(encoding='utf-8'))
        search_con = result_lxml.xpath('//div[@class="con"]')
        for i_con in search_con:
            result_list.append([])
            result_list[len(result_list) - 1].append(i_con.xpath('h1/a/text()'))
            result_list[len(result_list) - 1].append(i_con.xpath('div/text()'))
            result_list[len(result_list) - 1].append(i_con.xpath('h1/a/@href'))
    return result_list

个人感觉Xpath比较好理解一些

PS：正作为一名大数据技术方向的大学生学习中,我也会把我学习中完成的一些任务实训发表出来和大家一起学习，还请大家请多多指教。

全部评论 (0)

还没有任何评论哟~

Python网络爬虫使用BeautifulSoup爬取网页内容并存入数据库案例

使用BeautifulSoup爬取网页内容并存入数据库案例学习了Python网络爬虫，完成里一个比较完整的爬虫案例与大家分享爬取地址：<http://www.tipdm.com/cpzx/inde...

python如何爬虫网页数据-python网络爬虫爬取网页内容

什么是网络爬虫？网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索...

开发一个网络爬虫：使用python网络爬虫库（BeautifulSoup)爬取网页文章内容并生成思维导图

目录前言 1\.需求分析/问题描述 1.1问题描述 2\.程序设计 2.1功能模块结构图 2.2自顶向下的设计方法 2.2.2第1层设计（子模块） 2.2.3解析网页内容 2.2.4生成思维导图 3...

python爬虫：使用BeautifulSoup修改网页内容

BeautifulSoup除了可以查找和定位网页内容，还可以修改网页。修改意味着可以增加或删除标签，改变标签名字，变更标签属性，改变文本内容等等。使用修BeautifulSoup修改标签每一个标签...

Python网络爬虫并存入数据库

爬取前程无忧招聘网站十万条招聘信息并存入数据库 1.首先导入本次需要用到的包 importjson网页可能使用json写的数据 importre 正则表达式 importpymysql导入pymysq...

python爬虫爬取网站新闻内容植入mysql数据库内案例demo

python爬虫爬取网站新闻内容植入mysql数据库内案例demo 今天上午再次进行了优化代码，完成了内容的部分清洗操作比如：过滤掉新闻主体内容里面的img标签和a标签，这样就不会有对方网站的外链...

利用BeautifulSoup爬取网页内容

利用BeautifulSoup可以很简单的爬取网页上的内容。这个套件可以把一个网页变成DOMTree 要使用BeautifulSoup需要使用命令行进行安装，不过也可以直接用python的ide。

python抓取网页内容并保存_python - 爬虫爬取网页后，如何保存网页？

问题爬虫从Internet中爬取众多的网页作为原始网页库存储于本地，然后网页分析器抽取网页中的主题内容交给分词器进行分词，得到的结果用索引器建立正排和倒排索引，这样就得到了索引数据库，用户查询时，在...

Python网络爬虫入门：学会使用Python爬取网页数据

本篇文章将带领大家入门Python网络爬虫，通过使用Python的requests和BeautifulSoup库，学会如何爬取网页数据内容。文章包括以下内容： 1.网络爬虫概述 2.requests库...

Python网络爬虫入门：学会使用Python爬取网页数据

Python网络爬虫入门：学会使用Python爬取网页数据网络爬虫的魅力：为什么你需要学习Python爬虫爬虫的应用场景：从数据分析到市场调研 Python爬虫的优势：简洁、高效、易上手首次接触...

是否确定退出登录?

Python网络爬虫使用BeautifulSoup爬取网页内容并存入数据库案例

使用BeautifulSoup爬取网页内容并存入数据库案例

学习了Python网络爬虫，完成里一个比较完整的爬虫案例与大家分享

长篇短写，代码如下：

PS：正作为一名大数据技术方向的大学生学习中,我也会把我学习中完成的一些任务实训发表出来和大家一起学习，还请大家请多多指教。

全部评论 (0)

相关文章推荐

Python网络爬虫使用BeautifulSoup爬取网页内容并存入数据库案例

python如何爬虫网页数据-python网络爬虫爬取网页内容

开发一个网络爬虫：使用python网络爬虫库（BeautifulSoup)爬取网页文章内容并生成思维导图

python爬虫：使用BeautifulSoup修改网页内容

Python网络爬虫并存入数据库

python爬虫爬取网站新闻内容植入mysql数据库内案例demo

利用BeautifulSoup爬取网页内容

python抓取网页内容并保存_python - 爬虫爬取网页后，如何保存网页？

Python网络爬虫入门：学会使用Python爬取网页数据

Python网络爬虫入门：学会使用Python爬取网页数据