使用 Jenkins 工作流链接 MLOps 管道

Sergio Virahonda

0/5 (0投票)

2021 年 5 月 14 日

CPOL

5分钟阅读

8832

在上一篇文章中，我们设置了 Jenkins 工作流。在这篇文章中，我们将构建它们。

在之前的系列文章中，我们解释了如何在 CI/CD MLOps 流水线中，将要执行的脚本编码到我们的 Docker 容器组中。在本系列中，我们将设置一个 Google Kubernetes Engine (GKE) 集群来部署这些容器。

本系列文章假设您熟悉深度学习、DevOps、Jenkins 和 Kubernetes 的基础知识。

在本系列的上一篇文章中，我们配置了Jenkins 来帮助我们将 Docker 容器链接成一个实际的流水线，其中容器将自动构建、推送并按正确的顺序运行。在本篇文章中，我们将构建以下 Jenkins 工作流（实现 Jenkins 流水线所需的步骤）。

如果在 AutomaticTraining-CodeCommit 存储库（下图中的流水线 1）中检测到推送，则立即拉取代码并使用它构建容器（2），将其推送到 Google Cloud Registry，并使用此镜像启动 Google Kubernetes Engine (GKE) 中的训练作业。训练结束后，将训练好的模型推送到我们的 GCS/测试注册表中。接下来，拉取 AutomaticTraining-UnitTesting 存储库，用它构建一个容器（3）。遵循相同的过程来测试先前保存在模型测试注册表中的模型。发送一条包含流水线结果的通知。如果结果为正面，则开始半自动部署到生产环境（4），部署作为预测服务 API 的容器（5），以及可选的界面（6）。
如果在 AutomaticTraining-Dataset 存储库（下图中的流水线 2）中检测到推送，则立即拉取它，拉取存储代码以构建此容器的 AutomaticTraining-DataCommit 存储库（2），使用它在 GCS 中重新训练模型（如果之前的流水线已被触发），并在达到一定的性能指标时再次保存。稍后，触发前面提到的单元测试步骤（3），然后重复该周期（4）。
如果在 AutomaticTraining-UnitTesting 存储库（下图中的流水线 3）中检测到推送，则拉取它并使用它构建一个容器（3）来测试 GCS/测试注册表中的模型。此流水线的目的是允许数据科学家在不重复先前工作流的情况下，将新测试集成到最近部署的模型中。

构建 Jenkins 工作流

为了获得我们的 3 个 Jenkins 流水线，我们需要开发 6 个底层工作流，它们是执行特定任务的 Python 脚本。我们来构建工作流 1、2、3 和 5）。工作流 4 相对复杂一些——我们将在下一篇文章中讨论它。

工作流 1

在 Jenkins 仪表板上，选择新建任务，为任务命名，然后选择流水线，然后单击确定。

在下一页上，从左侧菜单中选择配置。在构建触发器部分，并勾选GitHub hook trigger for GITScm polling复选框。这将使工作流能够由 GitHub 推送触发。

向下滚动并将以下脚本——它将处理此工作流的执行——粘贴到流水线部分。

properties([pipelineTriggers([githubPush()])])
pipeline {
    agent any
    environment {
        PROJECT_ID = 'automatictrainingcicd'
        CLUSTER_NAME = 'training-cluster'
        LOCATION = 'us-central1-a'
        CREDENTIALS_ID = 'AutomaticTrainingCICD'
    }
    stages {
        stage('Cloning our GitHub repo') {
          steps {
            checkout([
              $class: 'GitSCM',
              branches: [[name: 'main']],
              userRemoteConfigs: [[
                url: 'https://github.com/sergiovirahonda/AutomaticTraining-CodeCommit.git',
                credentialsId: '',
              ]]
             ])
           }
        }
        stage('Building and pushing image to GCR') {
            steps {
                script {
                    docker.withRegistry('https://gcr.io', 'gcr:AutomaticTrainingCICD') {
                        app = docker.build('automatictrainingcicd/code-commit:latest')
                        app.push("latest")
                    }
                }
            }
        }
        stage('Deploying to GKE') {
            steps{
                step([$class: 'KubernetesEngineBuilder', projectId: env.PROJECT_ID, clusterName: env.CLUSTER_NAME, location: env.LOCATION, manifestPattern: 'pod.yaml', credentialsId: env.CREDENTIALS_ID, verifyDeployments: true])
            }
        }
    }
    post {
        unsuccessful {
            echo 'The Jenkins pipeline execution has failed.'
            emailext body: "The '${env.JOB_NAME}' job has failed during its execution. Check the logs for more information.", recipientProviders: [[$class: 'DevelopersRecipientProvider'], [$class: 'RequesterRecipientProvider']], subject: 'A Jenkins pipeline execution has failed.'
        }
        success {
            echo 'The Jenkins pipeline execution has ended successfully, triggering the next one.'
            build job: 'AutomaticTraining-UnitTesting', propagate: true, wait: false
        }
    }
}

让我们看看上面代码的关键组件。properties([pipelineTriggers([githubPush()])]) 表明工作流将由代码中提到的存储库的推送触发。environment 定义了在工作流执行期间将使用的环境变量。这些是用于处理 GCP 的变量。stage('Cloning our GitHub repo') 阶段拉取 AutomaticTraining-CodeCommit 存储库，并定义这是将触发工作流执行的存储库。stage('Building and pushing image to GCR') 阶段使用从上述存储库下载的可用 Dockerfile 构建容器，并将其推送到 GCR。stage('Deploying to GKE') 阶段使用 GCR 中最近推送的容器镜像（在也从存储库下载的 pod.yaml 文件中定义）在 GKE 上构建 Kubernetes 作业。如果作业成功完成，它将触发 AutomaticTraining-UnitTesting 工作流（3）；否则，它将通过电子邮件通知产品负责人。

工作流 2

按照构建工作流 1 时相同的步骤进行。输入以下脚本来构建此流水线。

properties([pipelineTriggers([githubPush()])])
pipeline {
    agent any
    environment {
        PROJECT_ID = 'automatictrainingcicd'
        CLUSTER_NAME = 'training-cluster'
        LOCATION = 'us-central1-a'
        CREDENTIALS_ID = 'AutomaticTrainingCICD'
    }
    stages {
        stage('Webhook trigger received. Cloning 1st repository.') {
          steps {
            checkout([
              $class: 'GitSCM',
              branches: [[name: 'main']],
              userRemoteConfigs: [[
                url: 'https://github.com/sergiovirahonda/AutomaticTraining-Dataset.git',
                credentialsId: '',
              ]]
             ])
           }
        }
        stage('Cloning GitHub repo that contains Dockerfile.') {
            steps {
                git url: 'https://github.com/sergiovirahonda/AutomaticTraining-DataCommit.git', branch: 'main'
            }
        }
        stage('Building and pushing image') {
            steps {
                script {
                    docker.withRegistry('https://gcr.io', 'gcr:AutomaticTrainingCICD') {
                        app = docker.build('automatictrainingcicd/data-commit:latest')
                        app.push("latest")
                    }
                }
            }
        }
        stage('Deploying to GKE') {
            steps{
                step([$class: 'KubernetesEngineBuilder', projectId: env.PROJECT_ID, clusterName: env.CLUSTER_NAME, location: env.LOCATION, manifestPattern: 'pod.yaml', credentialsId: env.CREDENTIALS_ID, verifyDeployments: true])
            }
        }
    }
    post {
        unsuccessful {
            echo 'The Jenkins pipeline execution has failed.'
            emailext body: "The '${env.JOB_NAME}' job has failed during its execution. Check the logs for more information.", recipientProviders: [[$class: 'DevelopersRecipientProvider'], [$class: 'RequesterRecipientProvider']], subject: 'A Jenkins pipeline execution has failed.'
        }
        success {
            echo 'The Jenkins pipeline execution has ended successfully, triggering the next one.'
            build job: 'AutomaticTraining-UnitTesting', propagate: true, wait: false
        }
    }
}

上面的脚本执行几乎与工作流 1 中的相同过程，除了它使用来自 AutomaticTraining-DataCommit 存储库的代码来构建容器。此外，此流水线由推送到 AutomaticTraining-Dataset 存储库的任何操作触发。最后，如果成功，它将触发 AutomaticTraining-UnitTesting 工作流（3）。

工作流 3

此工作流执行 GCS/测试注册表中模型的单元测试。如果更改被推送到 AutomaticTraining-UnitTesting 存储库，它将被触发。它也可以由工作流 1 或工作流 2 触发。构建此流水线的脚本如下。

properties([pipelineTriggers([githubPush()])])
pipeline {
    agent any
    environment {
        PROJECT_ID = 'automatictrainingcicd'
        CLUSTER_NAME = 'training-cluster'
        LOCATION = 'us-central1-a'
        CREDENTIALS_ID = 'AutomaticTrainingCICD'
    }
    stages {
        stage('Awaiting for previous training to be completed.'){
            steps{
                echo "Initializing prudential time"
                sleep(1200)
                echo "Ended"
            }
        }
        stage('Cloning our GitHub repo') {
          steps {
            checkout([
              $class: 'GitSCM',
              branches: [[name: 'main']],
              userRemoteConfigs: [[
                url: 'https://github.com/sergiovirahonda/AutomaticTraining-UnitTesting.git',
                credentialsId: '',
              ]]
             ])
           }
        }
        stage('Building and pushing image') {
            steps {
                script {
                    docker.withRegistry('https://gcr.io', 'gcr:AutomaticTrainingCICD') {
                        app = docker.build('automatictrainingcicd/unit-testing:latest')
                        app.push("latest")
                    }
                }
            }
        }
        stage('Deploying to GKE') {
            steps{
                step([$class: 'KubernetesEngineBuilder', projectId: env.PROJECT_ID, clusterName: env.CLUSTER_NAME, location: env.LOCATION, manifestPattern: 'pod.yaml', credentialsId: env.CREDENTIALS_ID, verifyDeployments: true])
            }
        }
    }
    post {
        unsuccessful {
            echo 'The Jenkins pipeline execution has failed.'
            emailext body: "The '${env.JOB_NAME}' job has failed during its execution. Check the logs for more information.", recipientProviders: [[$class: 'DevelopersRecipientProvider'], [$class: 'RequesterRecipientProvider']], subject: 'A Jenkins pipeline execution has failed.'
        }
        success {
            echo 'The Jenkins pipeline execution has ended successfully, check the GCP logs for more information.'
        }
    }
}

上面的脚本与之前的脚本只有细微差别。它等待 1,200 秒才开始，以便为训练（来自1或2工作流）提供足够的时间完成。此外，在模型单元测试之后，它会将结果通过电子邮件发送给产品负责人，让他们知道是否需要启动半自动部署到生产环境（4）。

工作流 5

此工作流从 AutomaticTraining-PredictionAPI 存储库（5）拉取代码，并构建一个启用预测服务的容器。该容器从生产注册表中加载模型（该模型已由半自动部署到生产环境（4）复制），并接收 POST 请求以 JSON 格式的预测进行响应。工作流脚本如下。

pipeline {
    agent any
    environment {
        PROJECT_ID = 'automatictrainingcicd'
        CLUSTER_NAME = 'training-cluster'
        LOCATION = 'us-central1-a'
        CREDENTIALS_ID = 'AutomaticTrainingCICD'
    }
    stages {
        stage('Cloning our Git') {
            steps {
                git url: 'https://github.com/sergiovirahonda/AutomaticTraining-PredictionAPI.git', branch: 'main'
            }
        }
        stage('Building and deploying image') {
            steps {
                script {
                    docker.withRegistry('https://gcr.io', 'gcr:AutomaticTrainingCICD') {
                        app = docker.build('automatictrainingcicd/prediction-api')
                        app.push("latest")
                    }
                }
            }
        }
        stage('Deploying to GKE') {
            steps{
                step([$class: 'KubernetesEngineBuilder', projectId: env.PROJECT_ID, clusterName: env.CLUSTER_NAME, location: env.LOCATION, manifestPattern: 'pod.yaml', credentialsId: env.CREDENTIALS_ID, verifyDeployments: true])
            }
        }
    }
    post {
        unsuccessful {
            echo 'The Jenkins pipeline execution has failed.'
            emailext body: "The '${env.JOB_NAME}' job has failed during its execution. Check the logs for more information.", recipientProviders: [[$class: 'DevelopersRecipientProvider'], [$class: 'RequesterRecipientProvider']], subject: 'A Jenkins pipeline execution has failed.'
        }
        success {
            echo 'The Jenkins pipeline execution has ended successfully, triggering the next one.'
            build job: 'AutomaticTraining-Interface', propagate: true, wait: false
        }
    }
}

最后，该脚本触发一个名为 AutomaticTraining-Interface 的工作流。这是一个“奖励”，为最终用户提供 Web 界面。我们不会在本系列中讨论它；但是，您可以在工作流的存储库中找到所有相关文件。

使用 GitHub Webhook 触发工作流

要在私有网络中本地运行 Jenkins，您需要安装 SocketXP 或类似服务。这将使可以在 https://:8080 访问的 Jenkins 服务器暴露给外部世界，包括 GitHub。

要安装 SocketXP，请选择您的操作系统类型并按照网站提供的说明进行操作。

要为 Jenkins 创建安全隧道，请运行此命令。

socketxp connect https://:8080

响应将为您提供一个公共 URL。

要触发本地 Jenkins 服务器上的工作流，我们需要创建 GitHub Webhook。转到触发工作流的存储库，选择设置 > Webhook > 添加 webhook。将公共 URL 粘贴到Payload URL字段，在末尾添加 "/github-webhook/"，然后单击添加 webhook。

完成后，您应该能够在将新代码推送到相应存储库时触发工作流。

为 AutomaticTraining-CodeCommit、AutomaticTraining-Dataset 和 AutomaticTraining-UnitTesting 存储库配置 Webhook，以便正确触发我们流水线中的持续集成。

后续步骤

在下一篇文章中，我们将开发一个半自动部署到生产环境的脚本，这将完成我们的项目。敬请关注！