使用 ML.NET 构建预测性维护模型






4.91/5 (4投票s)
完全在 .NET 平台使用 C# Jupyter Notebook 和 Daany 实现的 Notebook
摘要
这个 C# Notebook 是接续上一篇博文 “Predictive Maintenance on .NET Platform”(.NET 平台上的预测性维护)。
该 Notebook 完全在 .NET 平台使用 **C# Jupyter Notebook** 和 **Daany** – C# 数据分析库来实现。此 Notebook 与官方 Azure gallery portal 上的 Notebook 之间存在细微差别,但在大多数情况下,代码遵循那里定义的步骤。
该 Notebook 展示了如何使用 **.NET Jupyter Notebook** 结合 **Daany.DataFrame** 和 **ML.NET** 来准备数据并在 .NET 平台上构建预测性维护模型。
描述
在上一篇文章中,我们分析了 5 个数据集,其中包含关于 100 台机器的 `telemetry`(遥测)、`data`(数据)、`errors`(错误)和 `maintenance`(维护)以及 `failure`(故障)的信息。对这些数据进行了转换和分析,以创建用于构建预测性维护机器学习模型的数据集。
一旦我们从数据集中创建了所有特征,最后一步就是创建标签列,以描述某台机器在接下来的 24 小时内是否会因 `component1`(组件 1)、`component2`(组件 2)、`component3`(组件 3)、`component4`(组件 4)的故障而失效,或者它将继续工作。在这一部分,我们将执行机器学习任务的一部分,并开始训练一个机器学习模型,以预测某台机器在接下来的 24 小时内是否会因故障而失效,或者在此期间将正常运行。
我们将构建的模型是一个多类分类模型,因为它有 5 个要预测的值:
component1
component2
component3
- `component4` 或
- `none` – 表示将继续工作
ML.NET 框架作为训练库
为了训练模型,我们将使用 ML.NET – Microsoft 开源的 .NET 平台机器学习框架。首先,我们需要进行一些准备工作,例如:
- 所需的 Nuget 包
- 一组 `using` 语句以及用于格式化输出的代码
在此 Notebook 的开头,我们安装了几个 Nuget 包以完成此 Notebook。以下代码显示了 `using` 语句,以及用于格式化来自 `DataFrame` 的数据的函数。
//using Microsoft.ML.Data;
using XPlot.Plotly;
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Linq;
//
using Microsoft.ML;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;
using Microsoft.ML.Trainers.LightGbm;
//
using Daany;
using Daany.Ext;
//DataFrame formatter
using Microsoft.AspNetCore.Html;
Formatter.Register((df, writer) =>
{
var headers = new List();
headers.Add(th(i("index")));
headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c)));
//renders the rows
var rows = new List<List>();
var take = 20;
//
for (var i = 0; i < Math.Min(take, df.RowCount()); i++)
{
var cells = new List();
cells.Add(td(df.Index[i]));
foreach (var obj in df[i]){
cells.Add(td(obj));
}
rows.Add(cells);
}
var t = table(
thead(
headers),
tbody(
rows.Select(
r => tr(r))));
writer.Write(t);
}, "text/html");
一旦我们安装了 Nuget 包并定义了 `using` 语句,我们将定义创建 ML.NET 管道所需的类。
`PrMaintenanceClass` 类 – 包含我们在上一篇文章中构建的特征(属性)。我们需要它们来在 ML.NET 管道中定义特征。我们定义的第二个类是 `PrMaintenancePrediction`,用于预测和模型评估。
class PrMaintenancePrediction
{
[ColumnName("PredictedLabel")]
public string failure { get; set; }
}
class PrMaintenanceClass
{
public DateTime datetime { get; set; }
public int machineID { get; set; }
public float voltmean_3hrs { get; set; }
public float rotatemean_3hrs { get; set; }
public float pressuremean_3hrs { get; set; }
public float vibrationmean_3hrs { get; set; }
public float voltstd_3hrs { get; set; }
public float rotatestd_3hrs { get; set; }
public float pressurestd_3hrs { get; set; }
public float vibrationstd_3hrs { get; set; }
public float voltmean_24hrs { get; set; }
public float rotatemean_24hrs { get; set; }
public float pressuremean_24hrs { get; set; }
public float vibrationmean_24hrs { get; set; }
public float voltstd_24hrs { get; set; }
public float rotatestd_24hrs { get; set; }
public float pressurestd_24hrs { get; set; }
public float vibrationstd_24hrs { get; set; }
public float error1count { get; set; }
public float error2count { get; set; }
public float error3count { get; set; }
public float error4count { get; set; }
public float error5count { get; set; }
public float sincelastcomp1 { get; set; }
public float sincelastcomp2 { get; set; }
public float sincelastcomp3 { get; set; }
public float sincelastcomp4 { get; set; }
public string model { get; set; }
public float age { get; set; }
public string failure { get; set; }
}
既然我们已经定义了类类型,我们将实现此 ML 模型的管道。首先,我们创建一个具有固定种子的 `MLContext`,以便任何运行此 Notebook 的用户都可以重现模型。然后,我们加载数据并将数据分割为训练集和测试集。
MLContext mlContext= new MLContext(seed:88888);
var strPath="data/final_dataFrame.csv";
var mlDF= DataFrame.FromCsv(strPath);
//
//split data frame on training and testing part
//split at 2015-08-01 00:00:00,
//to train on the first 8 months and test on last 4 months
var trainDF = mlDF.Filter("datetime", new DateTime(2015, 08, 1, 1, 0, 0),
FilterOperator.LessOrEqual);
var testDF = mlDF.Filter("datetime", new DateTime(2015, 08, 1, 1, 0, 0),
FilterOperator.Greather);
训练集的摘要显示在以下表格中:
同样,测试集有以下摘要:
一旦数据加载到应用程序内存中,我们就可以准备 ML.NET 管道。该管道包括从 `Daany.DataFrame` 类型到 `IDataView` 集合的数据转换。为此任务,使用了 `LoadFromEnumerable` 方法。
//Load daany:DataFrame into ML.NET pipeline
public static IDataView loadFromDataFrame(MLContext mlContext,Daany.DataFrame df)
{
IDataView dataView = mlContext.Data.LoadFromEnumerable(df.GetEnumerator(oRow =>
{
//convert row object array into PrManitenance row
var ooRow = oRow;
var prRow = new PrMaintenanceClass();
prRow.datetime = (DateTime)ooRow["datetime"];
prRow.machineID = (int)ooRow["machineID"];
prRow.voltmean_3hrs = Convert.ToSingle(ooRow["voltmean_3hrs"]);
prRow.rotatemean_3hrs = Convert.ToSingle(ooRow["rotatemean_3hrs"]);
prRow.pressuremean_3hrs = Convert.ToSingle(ooRow["pressuremean_3hrs"]);
prRow.vibrationmean_3hrs = Convert.ToSingle(ooRow["vibrationmean_3hrs"]);
prRow.voltstd_3hrs = Convert.ToSingle(ooRow["voltsd_3hrs"]);
prRow.rotatestd_3hrs = Convert.ToSingle(ooRow["rotatesd_3hrs"]);
prRow.pressurestd_3hrs = Convert.ToSingle(ooRow["pressuresd_3hrs"]);
prRow.vibrationstd_3hrs = Convert.ToSingle(ooRow["vibrationsd_3hrs"]);
prRow.voltmean_24hrs = Convert.ToSingle(ooRow["voltmean_24hrs"]);
prRow.rotatemean_24hrs = Convert.ToSingle(ooRow["rotatemean_24hrs"]);
prRow.pressuremean_24hrs = Convert.ToSingle(ooRow["pressuremean_24hrs"]);
prRow.vibrationmean_24hrs = Convert.ToSingle(ooRow["vibrationmean_24hrs"]);
prRow.voltstd_24hrs = Convert.ToSingle(ooRow["voltsd_24hrs"]);
prRow.rotatestd_24hrs = Convert.ToSingle(ooRow["rotatesd_24hrs"]);
prRow.pressurestd_24hrs = Convert.ToSingle(ooRow["pressuresd_24hrs"]);
prRow.vibrationstd_24hrs = Convert.ToSingle(ooRow["vibrationsd_24hrs"]);
prRow.error1count = Convert.ToSingle(ooRow["error1count"]);
prRow.error2count = Convert.ToSingle(ooRow["error2count"]);
prRow.error3count = Convert.ToSingle(ooRow["error3count"]);
prRow.error4count = Convert.ToSingle(ooRow["error4count"]);
prRow.error5count = Convert.ToSingle(ooRow["error5count"]);
prRow.sincelastcomp1 = Convert.ToSingle(ooRow["sincelastcomp1"]);
prRow.sincelastcomp2 = Convert.ToSingle(ooRow["sincelastcomp2"]);
prRow.sincelastcomp3 = Convert.ToSingle(ooRow["sincelastcomp3"]);
prRow.sincelastcomp4 = Convert.ToSingle(ooRow["sincelastcomp4"]);
prRow.model = (string)ooRow["model"];
prRow.age = Convert.ToSingle(ooRow["age"]);
prRow.failure = (string)ooRow["failure"];
//
return prRow;
}));
return dataView;
}
将数据集加载到应用程序内存中
//Split dataset in two parts: TrainingDataset and TestDataset
var trainData = loadFromDataFrame(mlContext, trainDF);
var testData = loadFromDataFrame(mlContext, testDF);
在开始训练之前,我们需要处理这些数据,以便将所有非数值列编码为数值列。此外,我们需要定义哪些列将成为 `Features`(特征)的一部分,哪些将成为标签。出于此原因,我们定义了 `PrepareData` 方法。
public static IEstimator PrepareData(MLContext mlContext)
{
//one hot encoding category column
IEstimator dataPipeline =
mlContext.Transforms.Conversion.MapValueToKey
(outputColumnName: "Label", inputColumnName: nameof(PrMaintenanceClass.failure))
//encode model column
.Append(mlContext.Transforms.Categorical.OneHotEncoding
("model",outputKind: OneHotEncodingEstimator.OutputKind.Indicator))
//define features column
.Append(mlContext.Transforms.Concatenate("Features",
//
nameof(PrMaintenanceClass.voltmean_3hrs), nameof(PrMaintenanceClass.rotatemean_3hrs),
nameof(PrMaintenanceClass.pressuremean_3hrs),nameof(PrMaintenanceClass.vibrationmean_3hrs),
nameof(PrMaintenanceClass.voltstd_3hrs), nameof(PrMaintenanceClass.rotatestd_3hrs),
nameof(PrMaintenanceClass.pressurestd_3hrs), nameof(PrMaintenanceClass.vibrationstd_3hrs),
nameof(PrMaintenanceClass.voltmean_24hrs),nameof(PrMaintenanceClass.rotatemean_24hrs),
nameof(PrMaintenanceClass.pressuremean_24hrs),
nameof(PrMaintenanceClass.vibrationmean_24hrs),
nameof(PrMaintenanceClass.voltstd_24hrs),nameof(PrMaintenanceClass.rotatestd_24hrs),
nameof(PrMaintenanceClass.pressurestd_24hrs),nameof(PrMaintenanceClass.vibrationstd_24hrs),
nameof(PrMaintenanceClass.error1count), nameof(PrMaintenanceClass.error2count),
nameof(PrMaintenanceClass.error3count), nameof(PrMaintenanceClass.error4count),
nameof(PrMaintenanceClass.error5count), nameof(PrMaintenanceClass.sincelastcomp1),
nameof(PrMaintenanceClass.sincelastcomp2),nameof(PrMaintenanceClass.sincelastcomp3),
nameof(PrMaintenanceClass.sincelastcomp4),
nameof(PrMaintenanceClass.model), nameof(PrMaintenanceClass.age) ));
return dataPipeline;
}
可以看出,该方法将标签列 `failure`(一个简单的文本列)转换为包含每个不同类别数值表示的类别列,称为 `Keys`(键)。
现在我们完成了数据转换,我们将定义 `Train` 方法,该方法将实现 ML 算法、其超参数和训练过程。调用此方法后,它将返回训练好的模型。
//train method
static public TransformerChain Train(MLContext mlContext, IDataView preparedData)
{
var transformationPipeline=PrepareData(mlContext);
//settings hyper parameters
var options = new LightGbmMulticlassTrainer.Options();
options.FeatureColumnName = "Features";
options.LearningRate = 0.005;
options.NumberOfLeaves = 70;
options.NumberOfIterations = 2000;
options.NumberOfLeaves = 50;
options.UnbalancedSets = true;
//
var boost = new DartBooster.Options();
boost.XgboostDartMode = true;
boost.MaximumTreeDepth = 25;
options.Booster = boost;
// Define LightGbm algorithm estimator
IEstimator lightGbm = mlContext.MulticlassClassification.Trainers.LightGbm(options);
//train the ML model
TransformerChain model = transformationPipeline.Append(lightGbm).Fit(preparedData);
//return trained model for evaluation
return model;
}
训练过程和模型评估
既然我们有了所有必需的方法,主程序结构如下:
//prepare data transformation pipeline
var dataPipeline = PrepareData(mlContext);
//print prepared data
var pp = dataPipeline.Fit(trainData);
var transformedData = pp.Transform(trainData);
//train the model
var model = Train(mlContext, trainData);
一旦 `Train` 方法返回模型,评估阶段就开始了。为了评估模型,我们对训练和测试数据执行了全面评估。
使用训练数据集进行模型评估
将对训练和测试数据集执行模型评估
//evaluate train set
var predictions = model.Transform(trainData);
var metricsTrain = mlContext.MulticlassClassification.Evaluate(predictions);
ConsoleHelper.PrintMultiClassClassificationMetrics("TRAIN DataSet", metricsTrain);
ConsoleHelper.ConsoleWriteHeader("Train DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTrain.ConfusionMatrix);
模型评估输出
************************************************************
* Metrics for TRAIN DataSet multi-class classification model
*-----------------------------------------------------------
AccuracyMacro = 0.9603, a value between 0 and 1, the closer to 1, the better
AccuracyMicro = 0.999, a value between 0 and 1, the closer to 1, the better
LogLoss = 0.0015, the closer to 0, the better
LogLoss for class 1 = 0, the closer to 0, the better
LogLoss for class 2 = 0.088, the closer to 0, the better
LogLoss for class 3 = 0.0606, the closer to 0, the better
************************************************************
Train DataSet Confusion Matrix
###############################
Confusion table
||========================================
PREDICTED || none | comp4 | comp1 | comp2 | comp3 | Recall
TRUTH ||========================================
none || 165 371 | 0 | 0 | 0 | 0 | 1.0000
comp4 || 0 | 772 | 16 | 25 | 11 | 0.9369
comp1 || 0 | 8 | 884 | 26 | 4 | 0.9588
comp2 || 0 | 31 | 22 | 1 097 | 8 | 0.9473
comp3 || 0 | 13 | 4 | 8 | 576 | 0.9584
||========================================
Precision ||1.0000 |0.9369 |0.9546 |0.9490 |0.9616 |
可以看出,在训练数据集中,模型在大多数情况下都能正确预测值。现在让我们看看模型如何预测尚未作为训练过程一部分的数据。
使用测试数据集进行模型评估
//evaluate test set
var testPrediction = model.Transform(testData);
var metricsTest = mlContext.MulticlassClassification.Evaluate(testPrediction);
ConsoleHelper.PrintMultiClassClassificationMetrics("Test Dataset", metricsTest);
ConsoleHelper.ConsoleWriteHeader("Test DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTest.ConfusionMatrix);
************************************************************
* Metrics for Test Dataset multi-class classification model
*-----------------------------------------------------------
AccuracyMacro = 0.9505, a value between 0 and 1, the closer to 1, the better
AccuracyMicro = 0.9986, a value between 0 and 1, the closer to 1, the better
LogLoss = 0.0033, the closer to 0, the better
LogLoss for class 1 = 0.0012, the closer to 0, the better
LogLoss for class 2 = 0.1075, the closer to 0, the better
LogLoss for class 3 = 0.1886, the closer to 0, the better
************************************************************
Test DataSet Confusion Matrix
##############################
Confusion table
||========================================
PREDICTED || none | comp4 | comp1 | comp2 | comp3 | Recall
TRUTH ||========================================
none || 120 313 | 6 | 15 | 0 | 0 | 0.9998
comp4 || 1 | 552 | 10 | 17 | 4 | 0.9452
comp1 || 2 | 14 | 464 | 24 | 24 | 0.8788
comp2 || 0 | 39 | 0 | 835 | 16 | 0.9382
comp3 || 0 | 4 | 0 | 0 | 412 | 0.9904
||========================================
Precision ||1.0000 |0.8976 |0.9489 |0.9532 |0.9035 |
我们可以看到,模型整体准确率为 99%,每个类别的平均准确率为 95%。本文档的完整 Notebook 可在此处找到:here。