Advertisement

python爬取网页信息

阅读量:

一、简单了解html网页

1.推荐浏览器:

使用Chrome浏览器,在检查元素中可以看到HTML代码和css样式。

2.网页构成:

网页的内容主要包括三个部分:javascript主要针对功能,html针对结构,css针对样式。在本地文件中通常是三部分,html+images+css。

3.常用标签和结构

复制代码
 <div></div> 划分区域

    
 <div class=”aasdf”></div>说明样式
    
 <p>wowiji</p>说明文字内容
    
 <li></li>列表
    
 <img>图片
    
 <h1></h1>....<h6></h6>六种字体不同的标题格式
    
 <a href=”” ></a>超链接

标签可以互相嵌套

4.实战做一个网页

使用工具:pycharm

文件内容:sample.html

Main.css

主要框架:head(标题栏+导航栏),content(主体),footer(页脚)

5.网页效果

6.html源码

复制代码
 <!DOCTYPE html>

    
 <html lang="en">
    
 <head>
    
     <meta charset="UTF-8">
    
     <title>The blah</title>
    
     <link rel="stylesheet" type="text/css" href="main.css">
    
 </head>
    
 <body>
    
     <div class="header">
    
     <img src="images/blah.png">
    
     <ul class="nav">
    
         <li><a href="#">Home</a></li>
    
         <li><a href="#">Site</a></li>
    
         <li><a href="#">Other</a></li>
    
     </ul>
    
     </div>
    
     <div class="main-content">
    
     <h2>Article</h2>
    
     <ul class="article">
    
         <li>
    
             <img src="images/0001.jpg" width="100" height="90">
    
             <h3><a href="#">The blah</a></h3>
    
             <p>Say something</p>
    
         </li>
    
         <li>
    
             <img src="images/0002.jpg" width="100" height="90">
    
             <h3><a href="#">The blah</a></h3>
    
             <p>Say something</p>
    
         </li>
    
         <li>
    
             <img src="images/0003.jpg" width="100" height="90">
    
             <h3><a href="#">The blah</a></h3>
    
             <p>Say something</p>
    
         </li>
    
         <li>
    
             <img src="images/0004.jpg" width="100" height="90">
    
             <h3><a href="#">The blah</a></h3>
    
             <p>Say something</p>
    
         </li>
    
     </ul>
    
     </div>
    
     <div class="footer">
    
     <p>@xumeng</p>
    
     </div>
    
 </body>
    
 </html>

7.css源码

复制代码
 body {

    
     padding: 0 0 0 0;
    
     background-color: #ffffff;
    
     background-image: url(images/bg3-dark.jpg);
    
     background-position: top left;
    
     background-repeat: no-repeat;
    
     background-size: cover;
    
     font-family: Helvetica, Arial, sans-serif;
    
 }
    
 .main-content {
    
     width: 500px;
    
     padding: 20px 20px 20px 20px;
    
     border: 1px solid #dddddd;
    
     border-radius:25px;
    
     margin: 30px auto 0 auto;
    
     background: #f1f1f1;
    
     -webkit-box-shadow: 0 0 22px 0 rgba(50, 50, 50, 1);
    
     -moz-box-shadow:    0 0 22px 0 rgba(50, 50, 50, 1);
    
     box-shadow:         0 0 22px 0 rgba(50, 50, 50, 1);
    
 }
    
 .main-content p {
    
     line-height: 26px;
    
 }
    
 .main-content h2 {
    
     color: dimgray;
    
 }
    
  
    
 .nav {
    
     padding-left: 0;
    
     margin: 5px 0 20px 0;
    
     text-align: center;
    
 }
    
 .nav li {
    
     display: inline;
    
     padding-right: 10px;
    
 }
    
 .nav li:last-child {
    
     padding-right: 0;
    
 }
    
 .header {
    
     padding: 10px 10px 10px 10px;
    
  
    
 }
    
  
    
 .header a {
    
     color: #ffffff;
    
 }
    
 .header img {
    
     display: block;
    
     margin: 0 auto 0 auto;
    
 }
    
 .header h1 {
    
     text-align: center;
    
 }
    
  
    
 .article {
    
     list-style-type: none;
    
     padding: 0;
    
 }
    
 .article li {
    
     border: 1px solid #f6f8f8;
    
     background-color: #ffffff;
    
     height: 90px;
    
 }
    
 .article h3 {
    
     border-bottom: 0;
    
     margin-bottom: 5px;
    
 }
    
 .article a {
    
     color: #37a5f0;
    
     text-decoration: none;
    
 }
    
 .article img {
    
     float: left;
    
     padding-right: 11px;
    
 }
    
  
    
 .footer {
    
     margin-top: 20px;
    
 }
    
 .footer p {
    
     color: #aaaaaa;
    
     text-align: center;
    
     font-weight: bold;
    
     font-size: 12px;
    
     font-style: italic;
    
     text-transform: uppercase;
    
 }
    
  
    
  
    
  
    
  
    
  
    
  
    
 .post {
    
     padding-bottom: 2em;
    
 }
    
 .post-title {
    
     font-size: 2em;
    
     color: #222;
    
     margin-bottom: 0.2em;
    
 }
    
 .post-avatar {
    
     border-radius: 50px;
    
     float: right;
    
     margin-left: 1em;
    
 }
    
 .post-description {
    
     font-family: Georgia, "Cambria", serif;
    
     color: #444;
    
     line-height: 1.8em;
    
 }
    
 .post-meta {
    
     color: #999;
    
     font-size: 90%;
    
     margin: 0;
    
 }
    
  
    
 .post-category {
    
     margin: 0 0.1em;
    
     padding: 0.3em 1em;
    
     color: #fff;
    
     background: #999;
    
     font-size: 80%;
    
 }
    
 .post-category-design {
    
     background: #5aba59;
    
 }
    
 .post-category-pure {
    
     background: #4d85d1;
    
 }
    
 .post-category-yui {
    
     background: #8156a7;
    
 }
    
 .post-category-js {
    
     background: #df2d4f;
    
 }
    
  
    
 .post-images {
    
     margin: 1em 0;
    
 }
    
 .post-image-meta {
    
     margin-top: -3.5em;
    
     margin-left: 1em;
    
     color: #fff;
    
     text-shadow: 0 1px 1px #333;
    
 }

8.注意:

共有十张图片,注意路径关系,CSS、HTML、IMages文件夹在同一目录下。

写给自己:此项目路径在:F:\Python实战:四周实现爬虫系统\作业代码\第一周\上课_1

二、解析本地文件中的元素

1.解析的文件html源码

复制代码
 <html>

    
 <head>
    
     <link rel="stylesheet" type="text/css" href="new_blah.css">
    
 </head>
    
 <body>
    
     <div class="header">
    
     <img src="images/blah.png">
    
     <ul class="nav">
    
         <li><a href="#">Home</a></li>
    
         <li><a href="#">Site</a></li>
    
         <li><a href="#">Other</a></li>
    
     </ul>
    
     </div>
    
     <div class="main-content">
    
     <h2>Article</h2>
    
     <ul class="articles">
    
         <li>
    
             <img src="images/0001.jpg" width="100" height="91">
    
             <div class="article-info">
    
                 <h3><a href="www.sample.com">Sardinia's top 10 beaches</a></h3>
    
                 <p class="meta-info">
    
                     <span class="meta-cate">fun</span>
    
                     <span class="meta-cate">Wow</span>
    
                 </p>
    
                 <p class="description">white sands and turquoise waters</p>
    
             </div>
    
             <div class="rate">
    
                 <span class="rate-score">4.5</span>
    
             </div>
    
         </li>
    
         <li>
    
             <img src="images/0002.jpg" width="100" height="91">
    
             <div class="article-info">
    
                 <h3><a href="www.sample.com">How to get tanned</a></h3>
    
                 <p class="meta-info">
    
                     <span class="meta-cate">butt</span><span class="meta-cate">NSFW</span>
    
                 </p>
    
                 <p class="description">hot bikini girls on beach</p>
    
             </div>
    
             <div class="rate">
    
                 <img src="images/Fire.png" width="18" height="18">
    
                 <span class="rate-score">5.0</span>
    
             </div>
    
         </li>
    
         <li>
    
             <img src="images/0003.jpg" width="100" height="91">
    
             <div class="article-info">
    
                 <h3><a href="www.sample.com">How to be an Aussie beach bum</a></h3>
    
                 <p class="meta-info">
    
                     <span class="meta-cate">sea</span>
    
                 </p>
    
                 <p class="description">To make the most of your visit</p>
    
             </div>
    
             <div class="rate">
    
                 <span class="rate-score">3.5</span>
    
             </div>
    
         </li>
    
         <li>
    
             <img src="images/0004.jpg" width="100" height="91">
    
             <div class="article-info">
    
                 <h3><a href="www.sample.com">Summer's cheat sheet</a></h3>
    
                 <p class="meta-info">
    
                     <span class="meta-cate">bay</span>
    
                     <span class="meta-cate">boat</span>
    
                     <span class="meta-cate">beach</span>
    
                 </p>
    
                 <p class="description">choosing a beach in Cape Cod</p>
    
             </div>
    
             <div class="rate">
    
                 <span class="rate-score">3.0</span>
    
             </div>
    
         </li>
    
     </ul>
    
     </div>
    
     <div class="footer">
    
     <p>© Mugglecoding</p>
    
     </div>
    
 </body>
    
 </html>

2.需解析的网页CSS文件

复制代码
 body {

    
     padding: 0 0 0 0;
    
     background-color: #ffffff;
    
     background-image: url(images/bg3-dark.jpg);
    
     background-position: top left;
    
     background-repeat: no-repeat;
    
     background-size: cover;
    
     font-family: Helvetica, Arial, sans-serif;
    
 }
    
 .main-content {
    
     width: 500px;
    
     padding: 20px 20px 20px 20px;
    
     border: 1px solid #dddddd;
    
     border-radius:15px;
    
     margin: 30px auto 0 auto;
    
     background: #fdffff;
    
     -webkit-box-shadow: 0 0 22px 0 rgba(50, 50, 50, 1);
    
     -moz-box-shadow:    0 0 22px 0 rgba(50, 50, 50, 1);
    
     box-shadow:         0 0 22px 0 rgba(50, 50, 50, 1);
    
 }
    
 .main-content p {
    
     line-height: 26px;
    
 }
    
 .main-content h2 {
    
     color: #585858;
    
 }
    
 .articles {
    
     list-style-type: none;
    
     padding: 0;
    
 }
    
 .articles img {
    
     float: left;
    
     padding-right: 11px;
    
 }
    
 .articles li {
    
     border-top: 1px solid #F1F1F1;
    
     background-color: #ffffff;
    
     height: 90px;
    
     clear: both;
    
 }
    
 .articles h3 {
    
     margin: 0;
    
 }
    
 .articles a {
    
     color:#585858;
    
     text-decoration: none;
    
 }
    
 .articles p {
    
     margin: 0;
    
 }
    
  
    
 .article-info {
    
     float: left;
    
     display: inline-block;
    
     margin: 8px 0 8px 0;
    
 }
    
  
    
 .rate {
    
     float: right;
    
     display: inline-block;
    
     margin:35px 20px 35px 20px;
    
 }
    
  
    
 .rate-score {
    
     font-size: 18px;
    
     font-weight: bold;
    
     color: #585858;
    
 }
    
  
    
 .rate-score-hot {
    
  
    
  
    
 }
    
  
    
 .meta-info {
    
 }
    
  
    
 .meta-cate {
    
     margin: 0 0.1em;
    
     padding: 0.1em 0.7em;
    
     color: #fff;
    
     background: #37a5f0;
    
     font-size: 20%;
    
     border-radius: 10px ;
    
 }
    
  
    
 .description {
    
     color: #cccccc;
    
 }
    
  
    
 .nav {
    
     padding-left: 0;
    
     margin: 5px 0 20px 0;
    
     text-align: center;
    
 }
    
 .nav li {
    
     display: inline;
    
     padding-right: 10px;
    
 }
    
 .nav li:last-child {
    
     padding-right: 0;
    
 }
    
 .header {
    
     padding: 10px 10px 10px 10px;
    
  
    
 }
    
  
    
 .header a {
    
     color: #ffffff;
    
 }
    
 .header img {
    
     display: block;
    
     margin: 0 auto 0 auto;
    
 }
    
 .header h1 {
    
     text-align: center;
    
 }
    
  
    
  
    
  
    
 .footer {
    
     margin-top: 20px;
    
 }
    
 .footer p {
    
     color: #aaaaaa;
    
     text-align: center;
    
     font-weight: bold;
    
     font-size: 12px;
    
     font-style: italic;
    
     text-transform: uppercase;
    
 }

3.解析步骤

(1)beautifulsoup解析网页

(2)描述爬取定位

(3)从标签获取信息并按照要求装进容器方便查询

4.beautifulsoup解析网页

(1)爬取代码

标准解析格式为:soup=beautifulsoup(html,’lxml’)//第一个参数是网页文件,第二个是解析方式,解析方式共有五种:lxml,html.parser,lxml HTML,lxml xML,HTML5lib

from bs4 import BeautifulSoup

with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:

Soup = BeautifulSoup(wb_data,'lxml')

print(Soup)

(2)报错1:

can't import beautifulsoup

原因是没有安装beautifulsoup库,解决:在cmd下

pip install bs4

(3)报错2:

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

原因是没有安装解析器,解决:在cmd下:

pip install lxml

(4)爬取结果

复制代码
    <span style="font-size:18px;"><html>
    <head>
    <link href="new_blah.css" rel="stylesheet" type="text/css"/>
    </head>
    <body>
    <div class="header">
    <img src="images/blah.png"/>
    <ul class="nav">
    <li><a href="#">Home</a></li>
    <li><a href="#">Site</a></li>
    <li><a href="#">Other</a></li>
    </ul>
    </div>
    <div class="main-content">
    <h2>Article</h2>
    <ul class="articles">
    <li>
    <img height="91" src="images/0001.jpg" width="100"/>
    <div class="article-info">
    <h3><a href="www.sample.com">Sardinia's top 10 beaches</a></h3>
    <p class="meta-info">
    <span class="meta-cate">fun</span>
    <span class="meta-cate">Wow</span>
    </p>
    <p class="description">white sands and turquoise waters</p>
    </div>
    <div class="rate">
    <span class="rate-score">4.5</span>
    </div>
    </li>
    <li>
    <img height="91" src="images/0002.jpg" width="100"/>
    <div class="article-info">
    <h3><a href="www.sample.com">How to get tanned</a></h3>
    <p class="meta-info">
    <span class="meta-cate">butt</span><span class="meta-cate">NSFW</span>
    </p>
    <p class="description">hot bikini girls on beach</p>
    </div>
    <div class="rate">
    <img height="18" src="images/Fire.png" width="18"/>
    <span class="rate-score">5.0</span>
    </div>
    </li>
    <li>
    <img height="91" src="images/0003.jpg" width="100"/>
    <div class="article-info">
    <h3><a href="www.sample.com">How to be an Aussie beach bum</a></h3>
    <p class="meta-info">
    <span class="meta-cate">sea</span>
    </p>
    <p class="description">To make the most of your visit</p>
    </div>
    <div class="rate">
    <span class="rate-score">3.5</span>
    </div>
    </li>
    <li>
    <img height="91" src="images/0004.jpg" width="100"/>
    <div class="article-info">
    <h3><a href="www.sample.com">Summer's cheat sheet</a></h3>
    <p class="meta-info">
    <span class="meta-cate">bay</span>
    <span class="meta-cate">boat</span>
    <span class="meta-cate">beach</span>
    </p>
    <p class="description">choosing a beach in Cape Cod</p>
    </div>
    <div class="rate">
    <span class="rate-score">3.0</span>
    </div>
    </li>
    </ul>
    </div>
    <div class="footer">
    <p>© Mugglecoding</p>
    </div>
    </body>
    </html></span>

5.描述爬取位置

描述位置使用selector位置,获取方法,选择->右键检查->右键copy->复制selector

复制代码
 #源码

    
 from bs4 import BeautifulSoup
    
 with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:
    
     Soup = BeautifulSoup(wb_data,'lxml')
    
     #print(Soup)
    
     print("获取第一张照片")
    
     #images=Soup.select('body > div.main-content > ul > li:nth-child(1) > img')
    
     #注意使用上面的地址会报错,要根据提示修改
    
     image1 = Soup.select('body > div.main-content > ul > li:nth-of-type(1) > img')
    
     print(image1)
    
     print("获取所有照片")
    
     #要获取所有照片需要清除位置信息
    
     images = Soup.select('body > div.main-content > ul > li > img')
    
     #把其他信息筛选出来
    
     title=Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    
     score=Soup.select('body > div.main-content > ul > li > div.rate > span')
    
     selector=Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    
     description=Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
    
     print(images,title,score,selector,description,sep='\n----------------------------------\n')
复制代码
    <span style="font-size:18px;">#打印结果
    获取第一张照片
    [<img height="91" src="images/0001.jpg" width="100"/>]
    获取所有照片
    [<img height="91" src="images/0001.jpg" width="100"/>, <img height="91" src="images/0002.jpg" width="100"/>, <img height="91" src="images/0003.jpg" width="100"/>, <img height="91" src="images/0004.jpg" width="100"/>]
    ----------------------------------
    [<a href="www.sample.com">Sardinia's top 10 beaches</a>, <a href="www.sample.com">How to get tanned</a>, <a href="www.sample.com">How to be an Aussie beach bum</a>, <a href="www.sample.com">Summer's cheat sheet</a>]
    ----------------------------------
    [<span class="rate-score">4.5</span>, <span class="rate-score">5.0</span>, <span class="rate-score">3.5</span>, <span class="rate-score">3.0</span>]
    ----------------------------------
    [<span class="meta-cate">fun</span>, <span class="meta-cate">Wow</span>, <span class="meta-cate">butt</span>, <span class="meta-cate">NSFW</span>, <span class="meta-cate">sea</span>, <span class="meta-cate">bay</span>, <span class="meta-cate">boat</span>, <span class="meta-cate">beach</span>]
    ----------------------------------
    [<p class="description">white sands and turquoise waters</p>, <p class="description">hot bikini girls on beach</p>, <p class="description">To make the most of your visit</p>, <p class="description">choosing a beach in Cape Cod</p>]</span>

6.筛选有关信息

复制代码
 #打印出所有种类的结果

    
 from bs4 import BeautifulSoup
    
 with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:
    
     Soup = BeautifulSoup(wb_data,'lxml')
    
     images = Soup.select('body > div.main-content > ul > li > img')
    
     titles = Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    
     scores = Soup.select('body > div.main-content > ul > li > div.rate > span')
    
     #selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    
     selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info ')
    
     descrs = Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
    
  
    
 for title,image,desc,selec,score in zip(titles,images,descrs,selecs,scores):
    
     data={
    
     #'selec': selec.get_text(),
    
     'selec':list(selec.stripped_strings),#获取子级目录下所有
    
     'title':title.get_text(),
    
     'image':image.get('src'),
    
     'desc':desc.get_text(),
    
     'score':score.get_text()
    
     }
    
     print(data)
复制代码
    <span style="font-size:18px;">#打印结果
    ['fun', 'Wow'], 'title': "Sardinia's top 10 beaches", 'image': 'images/0001.jpg', 'desc': 'white sands and turquoise waters', 'score': '4.5'}
    {'selec': ['butt', 'NSFW'], 'title': 'How to get tanned', 'image': 'images/0002.jpg', 'desc': 'hot bikini girls on beach', 'score': '5.0'}
    {'selec': ['sea'], 'title': 'How to be an Aussie beach bum', 'image': 'images/0003.jpg', 'desc': 'To make the most of your visit', 'score': '3.5'}
    {'selec': ['bay', 'boat', 'beach'], 'title': "Summer's cheat sheet", 'image': 'images/0004.jpg', 'desc': 'choosing a beach in Cape Cod', 'score': '3.0'}
     </span>
复制代码
 #打印出评分>3分的文章

    
 from bs4 import BeautifulSoup
    
 info=[]
    
 with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:
    
     Soup = BeautifulSoup(wb_data,'lxml')
    
     images = Soup.select('body > div.main-content > ul > li > img')
    
     titles = Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    
     scores = Soup.select('body > div.main-content > ul > li > div.rate > span')
    
     #selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    
     selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info ')
    
     descrs = Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
    
  
    
 for title,image,desc,selec,score in zip(titles,images,descrs,selecs,scores):
    
     data={
    
     #'selec': selec.get_text(),
    
     'selec':list(selec.stripped_strings),#获取子级目录下所有
    
     'title':title.get_text(),
    
     'image':image.get('src'),
    
     'desc':desc.get_text(),
    
     'score':score.get_text()
    
  
    
     }
    
     info.append(data)
    
 for i in info:
    
     if float(i['score'])>3:
    
     print(i['title'],i['score'])
复制代码
    <span style="font-size:18px;">#打印结果:
    Sardinia's top 10 beaches 4.5
    How to get tanned 5.0
    How to be an Aussie beach bum 3.5</span>

三、爬取真实网页

Requests+beautifulsoup爬取tripadvisior

1.服务器与本地的交换机制

(1)http协议

点击页面:向服务器发送请求(request)

#get:

GET /page_one.html HTTP/1.1 Host:www.sample.com

显示页面:response(status_code:)

查看:右键->检查->network

HTTP1.0:get,post,head

http1.1:get,post,head,options.connect,trace,delete

(2)代码

复制代码
    pip install requests

2.解析真实网页的步骤

(1)requests请求

(2)爬取整个界面

复制代码
 from bs4 import BeautifulSoup

    
 import requests
    
  
    
 url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
    
  
    
 wb_data=requests.get(url,timeout = 500)
    
 soup=BeautifulSoup(wb_data.text,'lxml')
    
 print(soup)

(3)描述爬取的元素位置

复制代码
 #爬取某个标题的selector

    
 from bs4 import BeautifulSoup
    
 import requests
    
  
    
 url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
    
  
    
 wb_data=requests.get(url,timeout=500)
    
 soup=BeautifulSoup(wb_data.text,'lxml')
    
 titles=soup.select('#taplc_attraction_coverpage_attraction_0 > div:nth-of-type(4) > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
    
 print(titles)

结果:

复制代码
    <span style="font-size:18px;">[<a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="4|poi|272517" data-tpid="20" data-tpp="Attractions" href="/Attraction_Review-g60763-d272517-Reviews-Conservatory_Garden-New_York_City_New_York.html" οnclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">温室花园</a>]</span>

(4)描述爬取的所有元素取所有特征大小的图片

复制代码
 #爬取所有特征大小的图片

    
 from bs4 import BeautifulSoup
    
 import requests
    
  
    
 url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
    
  
    
 wb_data=requests.get(url,timeout=500)
    
 soup=BeautifulSoup(wb_data.text,'lxml')
    
 imgs=soup.select('img[width="200"]')
    
 print(imgs)

(5)字典方式遍历

复制代码
 #字典方式遍历

    
 from bs4 import BeautifulSoup
    
 import requests
    
 url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
    
  
    
 wb_data=requests.get(url,timeout=500)
    
 soup=BeautifulSoup(wb_data.text,'lxml')
    
 imgs=soup.select('img[width="200"]')
    
 titles=soup.select('#taplc_attraction_coverpage_attraction_0 > div > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
    
 for title,img in zip(titles,imgs):
    
     data={
    
     'title':title.get_text(),
    
     'img':img.get('src'),
    
     }
    
     print(data)

3.跳过登录步骤,在request参数获取信息

复制代码
 from bs4 import BeautifulSoup

    
 import requests
    
 import time
    
  
    
 url_saves = 'http://www.tripadvisor.com/Saves#37685322'
    
 url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
    
 urls = ['https://cn.tripadvisor.com/Attractions-g60763-Activities-oa{}-New_York_City_New_York.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)]
    
  
    
 headers = {
    
     'User-Agent':'',
    
     'Cookie':''
    
 }
    
  
    
  
    
 def get_attractions(url,data=None):
    
     wb_data = requests.get(url)
    
     time.sleep(4)
    
     soup = BeautifulSoup(wb_data.text,'lxml')
    
     titles    = soup.select('div.property_title > a[target="_blank"]')
    
     imgs      = soup.select('img[width="160"]')
    
     cates     = soup.select('div.p13n_reasoning_v2')
    
  
    
     if data == None:
    
     for title,img,cate in zip(titles,imgs,cates):
    
         data = {
    
             'title'  :title.get_text(),
    
             'img'    :img.get('src'),
    
             'cate'   :list(cate.stripped_strings),
    
             }
    
     print(data)
    
  
    
  
    
 def get_favs(url,data=None):
    
     wb_data = requests.get(url,headers=headers)
    
     soup      = BeautifulSoup(wb_data.text,'lxml')
    
     titles    = soup.select('a.location-name')
    
     imgs      = soup.select('div.photo > div.sizedThumb > img.photo_image')
    
     metas = soup.select('span.format_address')
    
  
    
     if data == None:
    
     for title,img,meta in zip(titles,imgs,metas):
    
         data = {
    
             'title'  :title.get_text(),
    
             'img'    :img.get('src'),
    
             'meta'   :list(meta.stripped_strings)
    
         }
    
         print(data)
    
  
    
 for single_url in urls:
    
     get_attractions(single_url)

4.反爬虫

只用检查->在移动端查看->解析(保护措施不是非常严密)

四、获取动态数据异步加载

1.异步加载

不换页的情况不断加载

JS 持续加载,与JavaScript不在一起,分批量加载

2. 发现异步数据

检查->Network->XHR

Name:出现新请求成功的页码->动态请求网址URL(page=x)

Response加载回一组div标签,包括链接

3.代码

复制代码
 from bs4 import BeautifulSoup

    
 import requests
    
 import time
    
 url = 'https://knewone.com/discover?page='
    
 def get_page(url,data=None):
    
  
    
     wb_data = requests.get(url)
    
     soup = BeautifulSoup(wb_data.text,'lxml')
    
     imgs = soup.select('a.cover-inner > img')
    
     titles = soup.select('section.content > h4 > a')
    
     links = soup.select('section.content > h4 > a')
    
  
    
     if data==None:
    
     for img,title,link in zip(imgs,titles,links):
    
         data = {
    
             'img':img.get('src'),
    
             'title':title.get('title'),
    
             'link':link.get('href')
    
         }
    
         print(data)
    
  
    
 #自控页码函数
    
 def get_more_pages(start,end):
    
     for one in range(start,end):
    
     get_page(url+str(one))
    
     time.sleep(2)
    
  
    
 get_more_pages(1,10)

五、作业:爬取商品信息

复制代码
 from bs4 import BeautifulSoup

    
 import requests
    
 import time
    
  
    
 url = 'http://bj.58.com/pingbandiannao/24604629984324x.shtml'
    
  
    
 wb_data = requests.get(url)
    
 soup = BeautifulSoup(wb_data.text,'lxml')
    
  
    
 def get_links_from(who_sells):
    
     urls = []
    
     list_view = 'http://bj.58.com/pbdn/{}/pn2/'.format(str(who_sells))
    
     wb_data = requests.get(list_view)
    
     soup = BeautifulSoup(wb_data.text,'lxml')
    
     for link in soup.select('td.t a.t'):
    
     urls.append(link.get('href').split('?')[0])
    
     return urls
    
  
    
  
    
 def get_views_from(url):
    
     id = url.split('/')[-1].strip('x.shtml')
    
     api = 'http://jst1.58.com/counter?infoid={}'.format(id)
    
     # 这个是找到了58的查询接口,不了解接口可以参照一下新浪微博接口的介绍
    
     js = requests.get(api)
    
     views = js.text.split('=')[-1]
    
     return views
    
     # print(views)
    
  
    
  
    
 def get_item_info(who_sells=0):
    
  
    
     urls = get_links_from(who_sells)
    
     for url in urls:
    
  
    
     wb_data = requests.get(url)
    
     soup = BeautifulSoup(wb_data.text,'lxml')
    
     data = {
    
         'title':soup.title.text,
    
         'price':soup.select('.price')[0].text,
    
         'area' :list(soup.select('.c_25d')[0].stripped_strings) if soup.find_all('span','c_25d') else None,
    
         'date' :soup.select('.time')[0].text,
    
         'cate' :'个人' if who_sells == 0 else '商家',
    
         # 'views':get_views_from(url)
    
     }
    
     print(data)
    
  
    
 # get_item_info(url)
    
  
    
 # get_links_from(1)
    
  
    
 get_item_info(url)

全部评论 (0)

还没有任何评论哟~