使用 Node.js 进行 Web 浏览器自动化

Sam__Khan

5.00/5 (4投票s)

2018 年 4 月 29 日

CPOL

3分钟阅读

18838

211

使用 Node.js 和 Selenium 自动化 Craigslist 解析

下载源代码 - 4.3 KB

引言

Selenium 是一套工具，能够实现跨平台的 Web 浏览器自动化。它广泛用于网站/Web 应用的自动化测试，但其用途不限于测试，其他频繁、枯燥、重复和耗时的 Web 活动也可以也应该被自动化。

这篇帖子将直接切入主题，介绍如何使用 Selenium 的一个组件，即 WebDriver，来自动化给定的用例。请继续阅读。

用例

获取在洛杉矶 Craigslist 上发布的二手本田思域广告，并将相关信息整理到电子表格中

导航到 Craigslist 洛杉矶页面 (https://losangeles.craigslist.org/)
在“For sale”部分，点击“cars+trucks”链接
在下一页，点击“BY-OWNER ONLY”（仅限车主）链接
在下一页，“MAKE AND MODE”（品牌和型号）文本框中，输入“Honda civic”，会出现一个链接，点击它
- 主搜索页面将显示 120 条广告。
- 浏览前 5 条广告并抓取以下字段
  - 标题
  - 传输
  - 燃油类型
  - 里程表
  - 广告链接
- 将抓取的信息整理到电子表格中（通过稍微调整代码，此数字肯定可以更改/调整。代码注释清楚地指出了可以更改此数字的位置）
- 处理完前 5 条广告后，保存电子表格

使用的技术

Selenium Webdriver：它有助于自动化浏览器（Chrome、Firefox、Internet Explorer、Safari 等）。它像用户在自己的系统上操作一样，原生“驱动”浏览器。对于此实现，我选择了 Firefox（geckodriver）webdriver。
Node.js：这里选择的编程语言是 JavaScript，运行时是 Node.js。
Exceljs：使用此实用程序读取/写入/创建/操作 Excel 电子表格。

设置

安装 geckodriver
```
$ npm install –g geckodriver
```
package.json：它已经指定了 exceljs 和 selenium-webdriver
安装 package.json
```
$ npm install
```
运行
```
$ node app.js
```

代码概述

初始化块包含通常的内容；创建 selenium 对象，例如 webdriver、By、until、firefox、firefoxOptions 和 driver，以及 excel 对象（通过 `require('exceljs')` 模块）。

/*
    Initializing and building the selenium webdriver with firefox options
    along with the exceljs object that will later be used to create the 
    spreadsheet
*/

const webdriver = require('selenium-webdriver'),
    By = webdriver.By,
    until = webdriver.until;

const firefox = require('selenium-webdriver/firefox');

const firefoxOptions = new firefox.Options();

/*
    Path to FF bin
*/
firefoxOptions.setBinary('/Applications/Firefox.app/Contents/MacOS/firefox-bin');
/*
    Uncomment the following line to enable headless browsing
*/
//firefoxOptions.headless();


const driver = new webdriver.Builder()
    .forBrowser('firefox')
    .setFirefoxOptions(firefoxOptions)
    .build();

const excel = require('exceljs')
/*
    End of initialization
*/

注意：要启用无头浏览（在此选项开启时，不弹出浏览器窗口），请取消注释以下行

/*
    Uncomment the following line to enable headless browsing
*/
//firefoxOptions.headless();

其余代码共有三个 async 方法

getcarlinks

以下方法检索第一页上的广告链接（共 120 条），并将它们返回在一个数组中。以下是该函数的进一步逻辑分解

洛杉矶 Craigslist 主页 ->
汽车+卡车 ->
仅限车主 ->
汽车品牌型号 =“honda civic”
在主搜索页面上，收集所有汽车广告链接并以数组形式返回

源代码

/*
    The following method retrieves the ad links on the first page, 120 of them
    LA Craigslist main page -> 
        cars+truks -> 
        By-Owner Only -> 
        auto make model = "honda civic"
*/
async function getcarlinks() {

    await driver.get('https://losangeles.craigslist.org/')
    await driver.findElement(By.linkText('cars+trucks')).click()
    await driver.findElement(By.linkText('BY-OWNER ONLY')).click()
    await driver.findElement(By.name('auto_make_model')).sendKeys('honda civic')
    /*
        Its important to note here is that the string "honda civic" when furnished 
        inside the auto_make_model textbox, it turns into a link that needs to be 
        clicked in order for the honda civic specific ads page to load. The 
        following function call handles the click part when string "honda civic" 
        turns into a link
    */
    await driver.wait(until.elementLocated(By.linkText('honda civic')), 50000)
        .then(
            elem => elem.click()
        )
    
    /*
        class 'result-info' helps in retrieving all those webelements that contain 
        the car ad link
    */
    let elems = await driver.findElements(By.className('result-info'))
    /*
        further parsing of the webelements to obtain the anchor ('a') tags
    */
    let linktagarr = await Promise.all(elems.map(
        async anelem => await anelem.findElements(By.tagName('a'))
    ))

    /*
        parse the actual links off the anchor tags into an array and return 
        the array
    */
    return await Promise.all(
        linktagarr.map(
            async anhref => await anhref[0].getAttribute('href')
        )
    )
}

processlinks

此方法：

接收由上述函数（getcarlinks）获得的汽车链接数组。
设置一个新的工作簿
向工作簿添加一个名为 'CL Links Sheet' 的新工作表
将以下列添加到工作表中：序号、标题、变速器、燃油类型、里程表以及汽车广告页面的链接
对于链接数组中的每个链接，一直处理到第 5 个元素（否则，应用程序将花费很长时间来处理所有 120 个链接，但此设置可以根据需要更改为任何可行数字），它执行以下操作：
- 递增电子表格中的 sr（序号）字段
- “获取”（抓取）给定的链接
- 在每个广告页面内，查找以下内容：标题、变速器、燃油类型、里程表和链接
- 添加一个新行，包含抓取/整理的信息
- 处理完给定的链接后，将电子表格保存为：output.xlsx

源代码

/*
    The following method:
    - Is passed a car links array
    - Sets up a new workbook
    - Adds a new worksheet to the workbook, named 'CL Links Sheet'
    - These columns are added to the worksheet: Sr Num, Title, Transmission, 
        Fuel, Odometer and link to the car's ad page
    - for each link in the links array all the way till 5 elements (otherwise 
    the app will take a long time to process all the 120 links, this setting 
    can be changed however to whichever number is deemed feasible), it does the 
    following:
        - Increments the sr (Sr Num) field in the spreadsheet 
        - 'gets' the given link
        - Inside each ad page, look for these: title, transmission, Fuel,  
            Odometer and the link
        - Add a new row with the fetched/furnished info
    - After processing the given links, it saves the spreadsheet with this 
        name: output.xlsx
    
*/

async function processlinks(links) {
    /* 
        init workbook, worksheet and the columns
    */
    const workbook = new excel.Workbook()
    let worksheet = workbook.addWorksheet('CL Links Sheet')
    worksheet.columns = [
        { header: 'Sr Num', key: 'sr', width: 5 },
        { header: 'Title', key: 'title', width: 25 },
        { header: 'Transmission', key: 'transmission', width: 25 },
        { header: 'Fuel', key: 'fuel', width: 25 },
        { header: 'Odometer', key: 'odometer', width: 25 },
        { header: 'link', key: 'link', width: 150 }
    ]

    /*
        end init
    */

    for (let [index, link] of links.entries()) {
        /*
            The following if condition limits the number of links to be processed.
            If removed, the loop will process all 120 links
        */
        if (index < 5) {
            let row = {}
            row.sr = ++index
            row.link = link
            await driver.get(link)
            let elems = await driver.findElements(By.className('attrgroup'))
            /*
                There are only two elements/sections that match 'attrgroup' 
                className search criterion, the first one contains the title 
                info and the other contains the info related to the remaining 
                elements: transmission, fuel odometer and the ad's link.
                As there are always going to be two attrgoup elements therefore 
                I have directly used the elems indexes rather than appllying a 
                loop to iterate over the array
            */
            if (elems.length === 2) {
                /*
                    fetching row.title form elems[0]
                */
                row.title = await elems[0].findElement(By.tagName('span')).getText()
                /*
                    gathering the remaining spans from elems[1] index. These 
                    span tags contain the pieces of information we are looking for
                */
                let otherspans = await elems[1].findElements(By.tagName('span'))

                /*
                    Looping over each span and fetching the values associated with 
                    transmission, fuel, odometer and the link
                */
                for (aspan of otherspans) {
                    let text = await aspan.getText()
                    /*
                        An example of the given spans text.
                            Odometer: 16000
                        the value is the piece after ':'.
                        The following regex is separating the value form the 
                        complete string and leaving the result in an array
                    */
                    let aspanval = text.match('(?<=:).*')
                    if (text.toUpperCase().includes('TRANSMISSION')) {
                        row.transmission = aspanval.pop()
                    }
                    else if (text.toUpperCase().includes('FUEL')) {
                        row.fuel = aspanval.pop()
                    }
                    else if (text.toUpperCase().includes('ODOMETER')) {
                        row.odometer = aspanval.pop()
                    }
                }
            }
            /*
                The given row is now furnished. It's time to add it to the 
                worksheet
            */
            worksheet.addRow(row).commit()
        }
    }
    /*
        All the rows in the worksheet are now furnished. Save the workbook now
    */
    workbook.xlsx.writeFile('output.xlsx')
}

startprocessing

此函数通过按顺序调用 getcarlinks 和 processcarlinks 来链接它们（JS 内部使用 Promise 链式调用这些函数）。调用此函数以启动应用程序，换句话说，它是入口函数。

源代码

/*
    The following method chains the getcarlinks and processcarlinks methods 
    by calling them in a sequence (JS internally promise chaining these 
    functions under the hood)
*/

async function startprocessing() {
    try {

        let carlinks = await getcarlinks();
        await processlinks(carlinks);
        console.log('Finished processing')
        await driver.quit()
    }
    catch (err) {
        console.log('Exception occurred while processing, details are: ', err)
        await driver.quit()
    }
}

/*
    Starting the engines 
*/
startprocessing()

好了，您可以下载附带的源代码来测试此应用程序，并扩展其功能以更好地满足您的需求。您也可以在我的 GitHub 页面上找到该代码。

重要链接

Selenium 网站：https://www.seleniumhq.org/
Webdriver 文档：https://www.seleniumhq.org/docs/03_webdriver.jsp
Webdriver JS GitHub 页面：https://github.com/SeleniumHQ/selenium/wiki/WebDriverJs
Exceljs on npm：https://npmjs.net.cn/package/exceljs
无头模式，MDN 文档：https://mdn.org.cn/en-US/Firefox/Headless_mode
Mozilla Geckodriver GitHub 页面：https://github.com/mozilla/geckodriver
您也可以在我的 GitHub 仓库中找到此项目：https://github.com/xeektech/samplenodeprojects/tree/master/craigslistparser