python简单爬虫：selenium webdriver (phantomjs) 和 beautifulSoup

发表于 2019-05-28 分类于 python 阅读次数：本文字数： 362 阅读时长 ≈ 1 分钟

1. Selenium

https://docs.seleniumhq.org/

https://phantomjs.org/

https://github.com/ariya/phantomjs

Selenium 是一个JavaScript框架，调用 webdriver 模拟在浏览器内的操作，可以适用 Chrome、 Firefox 、IE 等浏览器。

本文用的是 PhantomJS，一个在可以后台运行的无头浏览器(Scriptable Headless Browser)。

PhantomJS 项目目前已经暂停更新。
Selenium 推荐用 Chrome 或 Firefox 的无头模式。
但 chromedriver 无头模式启动时，还是会有一个空白的命令行窗口，比较讨厌，因此还是用 PhantomJS。

import time

from bs4 import BeautifulSoup
from selenium import webdriver

# # ====== chrome config ==========
# # chromedriver 下载地址 http://npm.taobao.org/mirrors/chromedriver/
# # 或： scoop install chromedriver
# chrome_options = webdriver.ChromeOptions()
# chrome_options.headless = True  # 无界面
#
# chromedriver_path = r'C:\xxx\scoop\apps\chromedriver\current\chromedriver'
# driver = webdriver.Chrome(executable_path=chromedriver_path, options=chrome_options)

# ====== phantomjs config ========
# 安装： scoop install phantomjs

phantomjs_path = r'C:\xxx\scoop\apps\PhantomJS\current\phantomjs'
driver = webdriver.PhantomJS(executable_path=phantomjs_path)

driver.get('https://www.baidu.com')
time.sleep(1)

print(driver.title)
print(driver.page_source)

driver.close()
driver.quit()

2. BeautifulSoup

driver.get(url)

content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')  # 解析器：'lxml'

# 匹配第一个 class 为 title 的 div 标签
div = soup.find('div', {'class': 'title'})

# 匹配第一个 class 包含 title 的 div 标签
div = soup.find('div', {'class': ['title', ' ']})

# 获得 div 标签内的 子标签 <a> 的 title 属性
title = div.a.get('title')

# 获得 div 标签内的 子标签 <a> 的 文字
text = div.a.text

# 匹配所有 class 包含 sound 的 div 标签
sound_list = soup.find_all('div', {'class': ['sound', ' ']})
for sound in sound_list：
    print(sound)

# 匹配第一个 class 包含 title, id 为 sound 的 div 标签
div = soup.find('div', {'class': ['title', ' '], 'id': 'sound'})

3. Selenium 模拟浏览器翻页 (滚动条)

# == loading all pages ==
driver.get(url)

page_num = 10
for i in range(page_num):
    driver.execute_script('window.scrollBy(0, document.body.scrollHeight)')
    time.sleep(3)

content = driver.page_source