Gecko Firefox Selenium 自动化浏览器文本 Microsoft Excel Python

使用 Selenium 和 Python 自动化 Web

WasiUllah Khan

5.00/5 (2投票s)

2019年4月16日

CPOL

3分钟阅读

10495

176

使用 Selenium 和 Python 自动解析 Pakwheels。

下载源代码 - 3.4 KB

引言

Selenium 是一种用于自动化 Web 浏览器的工具。开发人员使用此工具自动测试他们的网站，而不是手动测试网站。它的工作方式就像用户在他/她的系统上使用网站一样。虽然它很受网站测试的欢迎，但无数无聊和重复的网络任务可以而且应该自动化。

让我们开始吧。

图解

程序打开浏览器并浏览到 pakwheels.com。
它进一步导航到他们的 二手车 版块并输入以下查询
- 汽车品牌或型号：本田思域
- 地点：伊斯兰堡
- 年份范围: 2008 - 2012
- 价格范围：10 - 18 拉克
结果页面在主页上显示了 29 个广告，满足了查询。
该程序获取了 29 个广告的链接。
将所需数据保存到 Excel 电子表格
最后关闭浏览器

使用的技术

Selenium：版本 3.141.0
xlwt：版本 1.3.0
Python：版本 3.6.5

必备组件

最新版本的 Firefox
已安装 Python 3.6.5

设置

安装 geckodriver

注意

以下命令用于 Homebrew 包管理器。
geckodriver 应该放置在 /usr/local/bin/ 中，否则您必须给出它所在的路径。
```
$ brew install geckodriver
```
requirements.txt：它具有运行程序的所有依赖项。
```
$ pip3 install -r requirements.txt
```
运行
```
$ python3 app.py
```

代码概述

在初始化块中，我正在导入所有必需的模块和包，创建 selenium 对象，例如driver 和 options，并且我正在借助 xlwt 创建一个 Excel 工作簿。

注意

我已初始化无头模式，因为它占用的资源较少。
如果注释掉 options.add_argument('-headless')，则可以禁用此功能。

源代码如下：

from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
import xlwt

options = webdriver.FirefoxOptions()
options.add_argument('-headless')  # To disable headless mode, comment this line
driver = webdriver.Firefox(firefox_options=options)
workbook = xlwt.Workbook()

该程序总共有三个函数

navigation()

以下函数导航到使用的“Honda Civic”页面。

打开浏览器并浏览到 pakwheels
等待弹出窗口显示并关闭它
单击二手车
输入查询
点击搜索
转到结果查询的主页

源代码如下：

"""
    - The function below, when called, 
      opens up the browser and goes to wwww.pakwheels.com and then navigates to their
      used cars section.
      
    - It then inserts queries into their query bar such as:
      Car Make Model
      City
      Price Range
      From - To Year
      
    - Finally it clicks on the search button and the page with the inserted criteria opens 
"""

def navigation():
    driver.get("https://www.pakwheels.com")
    WebDriverWait(driver, 500).until(EC.element_to_be_clickable_
                 ((By.ID, 'onesignal-popover-cancel-button')))
    driver.find_element_by_id('onesignal-popover-cancel-button').click()
    time.sleep(5)
    driver.find_element_by_link_text('Used Cars').click()
    driver.find_element_by_id('more_option').click()
    driver.find_element_by_name('home-query').send_keys('Honda Civic')
    driver.find_element_by_class_name('chzn-single').click()
    driver.find_element_by_id('UsedCity_chzn_o_4').click()
    pr_range = driver.find_element_by_id('pr-range-filter')
    pr_range.click()
    driver.find_element_by_id('pr_from').send_keys('10')
    driver.find_element_by_id('pr_to').send_keys('18')
    pr_range.click()
    yr_from = Select(driver.find_element_by_id('YearFrom'))
    yr_to = Select(driver.find_element_by_id('YearTo'))
    yr_from.select_by_value('2008')
    yr_to.select_by_value('2012')
    driver.find_element_by_id('used-cars-search-btn').click()

get_car_links()

我在此函数内部调用 navigation() 函数，以便 selenium 可以导航到所需的页面。

该函数执行以下操作

创建一个名为 links 的空列表。
获取本田思域广告的所有锚标记。
通过一个 for 循环，将 links 列表中所有锚标记的 href 追加到列表中。
最后返回列表

源代码如下：

'''

    - This function gets all the required links.
    
    - Firstly it gets all the Anchor tags that has the required links 
      (class = "car-name ad-detail-path")
    
    - Then through the for-loop, it further parses the anchor tags to get 
      the href from the anchor tags one by one and appends it to another list 
      called as links and returns that list.    

'''

def get_car_links():
    navigation()  # Calling the navigation func in get_car_links func
    links = []
    elems = driver.find_elements(By.XPATH, '//a[@class = "car-name ad-detail-path"]')
    for elem in elems:
        links.append(elem.get_attribute('href'))

    return links

scrape_output()

最后一个函数接收包含链接的列表并执行以下操作

在工作簿中创建一个工作表
添加将在工作簿中提供的数据的标题
for 循环遍历链接并且
- 打开链接
- 抓取数据
  - 电话号码
  - 里程
  - 年份
  - 传输
  - 价格
  - 发动机容量
  - Color
  - 注册城市
- 还将抓取的数据与广告的链接一起写入工作簿中。
这些步骤一直执行到列表为空为止。
最后，以名称“output”保存工作簿。

源代码如下：

'''
  - The last function takes the returned list from the upper function.
  - A new workbook is created and a sheet is added by the name of Report.
  - The function scrapes the required data:
        phone number
        milage
        car_year
        transmission
        price
        engine_capacity
        color
        registration
    and finally writes it to the worksheet with the ads' link as well.
'''

def scrape_output(links):
    style = xlwt.easyxf('font: bold 1')
    sheet1 = workbook.add_sheet('Report')
    sheet1.write(0, 0, 'Phone Number', style)
    sheet1.write(0, 1, 'Milage', style)
    sheet1.write(0, 2, 'Year', style)
    sheet1.write(0, 3, 'Transmission', style)
    sheet1.write(0, 4, 'Price', style)
    sheet1.write(0, 5, 'Engine Capacity', style)
    sheet1.write(0, 6, 'Color', style)
    sheet1.write(0, 7, 'Registration', style)
    sheet1.write(0, 8, 'Link', style)
    row = 1
    for link in links:
        driver.get(link)
        driver.find_element(By.XPATH, '//button[@class = 
                "btn btn-large btn-block btn-success phone_number_btn"]').click()
        phone_number = driver.find_element(By.XPATH, '//*[@id="scrollToFixed"]/
                                              div[2]/div[1]/button[1]/span').text
        milage = driver.find_element(By.XPATH, 
                       '//*[@id="scroll_car_info"]/table/tbody/tr/td[2]/p').text
        car_year = driver.find_element(By.XPATH, 
         '/html/body/div[2]/section[2]/div/div[2]/div[1]/div/table/tbody/tr/td[1]/p').text
        transmission = driver.find_element(By.XPATH, 
                      '//*[@id="scroll_car_info"]/table/tbody/tr/td[4]/p').text
        price = driver.find_element(By.XPATH, 
                      '//*[@id="scrollToFixed"]/div[2]/div[1]/div/strong').text
        engine_capacity = driver.find_element(By.CSS_SELECTOR, 
                      '#scroll_car_detail > li:nth-child(8)').text
        color = driver.find_element(By.CSS_SELECTOR, 
                      '#scroll_car_detail > li:nth-child(4)').text
        registration = driver.find_element(By.CSS_SELECTOR, 
                      '#scroll_car_detail > li:nth-child(2)').text

        sheet1.write(row, 0, phone_number)
        sheet1.write(row, 1, milage)
        sheet1.write(row, 2, car_year)
        sheet1.write(row, 3, transmission)
        sheet1.write(row, 4, price)
        sheet1.write(row, 5, engine_capacity)
        sheet1.write(row, 6, color)
        sheet1.write(row, 7, registration)
        sheet1.write(row, 8, link)
        row += 1
    workbook.save('output.xls')

最后，代码的最后一部分。

在 try 块中，我将 scrape_output() 和 get_car_links() 链接在一起。

最后，它关闭浏览器。

try:
    print('Starting')
    car_links = get_car_links()
    scrape_output(car_links)
finally:
    print('Done')
    driver.quit()

注意

此项目也可以在我的 GitHub 存储库上查看。