预测性维护在 .NET 平台





5.00/5 (1投票)
预测性维护在 .NET 平台
摘要
本文完整的 .NET Jupyter Notebook 可在此处找到: https://github.com/bhrnjica/notebooks/blob/master/PrM_DataPrep_Daany.DataFrame.ipynb
本文基于 Azure AI Gallery 的文章:预测性维护建模指南,其中包含本文使用的数据集。
然而,本 Notebook 完全在 .NET 平台上实现,使用了
- C# Jupyter Notebook - 通过 C# 和 .NET 体验 Jupyter Notebook
- ML.NET – Microsoft 的机器学习开源框架,以及
- Daany – 用于数据分析的DAta ANalYtics 开源库。它可以作为 Nuget 包安装。
本 Notebook 与官方 Azure Gallery 门户上的 Notebook 之间存在一些细微差别,但在大多数情况下,代码都遵循那里的步骤。本 Notebook 的目的是演示如何使用 **.NET Jupyter Notebook** 配合 Daany.DataFrame
和 ML.NET
来准备数据并在 .NET 平台上构建预测性维护模型。但首先,让我们了解什么是预测性维护以及它为何重要。
预测性维护快速入门
简单来说,这是一种预测未来不久机器部件发生故障的技术,以便在部件失效并导致生产过程停顿之前,根据维护计划将其更换。预测性维护可以改进生产过程并提高生产力。通过成功处理预测性维护,我们可以实现以下目标:
- 降低关键任务设备的运行风险
- 通过实现准时维护操作来控制维护成本
- 发现与各种维护问题相关的模式
- 提供关键绩效指标 (KPI)
下图展示了生产中不同类型的维护。
预测性维护数据收集
为了处理和使用这项技术,我们需要来自生产的各种数据,包括但不限于:
- 观测机器的遥测数据(振动、电压、温度等)
- 与每台机器相关的错误和日志数据
- 故障数据,例如更换了某个部件的时间等
- 质量和准确性数据、机器属性、型号、使用年限等
预测性维护的 3 个步骤
通常,每种预测性维护技术都应遵循以下 3 个主要步骤:
-
收集数据 – 收集所有可能的描述性、历史和实时数据,通常使用物联网设备、各种记录器、技术文档等。
-
预测故障 – 收集的数据可以用于转换成机器学习就绪的数据集,并构建机器学习模型来预测生产中机器组件的故障。
-
响应 – 通过获取哪些组件将在不久的将来发生故障的信息,我们可以启动更换过程,以便在组件失效前将其更换,从而不中断生产过程。
预测故障
在本文中,将介绍第二步,即数据准备。为了预测生产过程中的故障,必须执行一系列数据转换、清洗、特征工程和选择,以准备好用于构建机器学习模型的数据。数据准备部分在模型构建中起着至关重要的作用,因为高质量的数据准备将直接反映在模型的准确性和可靠性上。
软件要求
本文展示了数据准备的完整流程。整个过程是使用以下工具完成的:
- .NET Core 3.1 – 最新的 .NET 平台版本
- .NET Jupyter Notebook – 流行的 Jupyter Notebook 的 .NET 实现
- ML.NET – Microsoft 在 .NET 平台上用于机器学习的开源框架,以及
- Daany – DAta ANalYtics 库。它可以在 Github 上找到,也可以作为 Nuget 包使用。
Notebook 准备
为了完成这项任务,我们需要安装几个 Nuget 包并包含几个 using
关键字。下面的代码块显示了 using
关键字,以及与 Notebook 输出格式相关的附加代码。
注意:nuget 包的安装必须在 Notebook 的第一个单元格中完成,否则 Notebook 将无法按预期工作。希望在最终版本发布后这一情况会有所改变。
//using Microsoft.ML.Data;
using XPlot.Plotly;
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Linq;
//using statement of Daany package
using Daany;
using Daany.MathStuff;
using Daany.Ext;
//
using Microsoft.ML;
//DataFrame formatter
using Microsoft.AspNetCore.Html;
Formatter<DataFrame>.Register((df, writer) =>
{
var headers = new List<IHtmlContent>();
headers.Add(th(i("index")));
headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c)));
//renders the rows
var rows = new List<List<IHtmlContent>>();
var take = 20;
//
for (var i = 0; i < Math.Min(take, df.RowCount()); i++)
{
var cells = new List<IHtmlContent>();
cells.Add(td(df.Index[i]));
foreach (var obj in df[i])
{
cells.Add(td(obj));
}
rows.Add(cells);
}
var t = table(
thead(headers),
tbody(rows.Select(r => tr(r))));
writer.Write(t);
}, "text/html");
下载数据
为了开始数据准备,我们需要数据。数据可以在 Azure 存储Blob 中找到。数据由 Azure Gallery 文章维护。
一旦数据从存储Blob下载,它将不会再次下载,并将被用作本地副本。
数据
我们用于预测性维护的数据可以分为:
telemetry
– 收集机器行为的历史数据(电压、振动等)errors
– 关于机器警告和错误的数据maint
– 关于机器更换和维护的数据machines
– 关于机器的描述性信息failures
– 关于特定机器因部件故障而停机的日期
我们加载所有文件,以便为训练过程充分准备数据。下面的代码示例将数据加载到应用程序内存中。
%%time
//Load ALL 5 data frame files
//DataFrame Cols: datetime,machineID,volt,rotate,pressure,vibration
var telemetry = DataFrame.FromCsv("data/PdM_telemetry.csv", dformat: "yyyy-mm-dd hh:mm:ss");
var errors = DataFrame.FromCsv("data/PdM_errors.csv", dformat: "yyyy-mm-dd hh:mm:ss");
var maint = DataFrame.FromCsv("data/PdM_maint.csv", dformat: "yyyy-mm-dd hh:mm:ss");
var failures = DataFrame.FromCsv("data/PdM_failures.csv", dformat: "yyyy-mm-dd hh:mm:ss");
var machines = DataFrame.FromCsv("data/PdM_machines.csv", dformat: "yyyy-mm-dd hh:mm:ss");
遥测
第一个数据源是关于机器的遥测数据。它包括从 100 台机器上每小时实时测量的 voltage
、rotation
、pressure
和 vibration
。数据收集的时间段是 2015 年。以下数据显示了数据集中前 10 条记录。
下一个单元格显示了整个 dataset
的描述。可以看出,我们有近一百万条机器记录,这是分析的一个良好起点。
如果我们想查看遥测数据的可视化,我们可以选择其中几个列并显示它们。
错误
任何预测性维护系统中最重要的信息之一是错误数据。实际上,错误是机器仍在运行时记录的非破坏性事件。错误日期和时间四舍五入到最接近的小时,因为遥测数据是以小时为单位收集的。
errors.Head()
//count number of errors
var barValue = errors["errorID"].GroupBy(v => v)
.OrderBy(group => group.Key)
.Select(group => Tuple.Create(group.Key, group.Count()));
//Plot Errors data
var chart = Chart.Plot(
new Graph.Bar()
{
x = barValue.Select(x=>x.Item1),
y = barValue.Select(x=>x.Item2),
// mode = "markers",
}
);
var layout = new XPlot.Plotly.Layout.Layout()
{ title = "Error distribution",
xaxis=new XPlot.Plotly.Graph.Xaxis() { title="Error name" },
yaxis = new XPlot.Plotly.Graph.Yaxis() { title = "Error Count" } };
//put layout into chart
chart.WithLayout(layout);
display(chart)
维护
维护是下一个预测性维护 (PrM) 组件,它告诉我们计划内和计划外的维护。维护包含对应于定期组件检查以及故障的记录。要将记录添加到维护表中,必须在计划检查期间更换组件,或者因故障而更换。如果记录是因故障而创建的,则称为 failures
。维护包含 2014 年和 2015 年的数据。
maint.Head()
机器
这些数据包括有关 100 台机器的信息,这些机器是预测性维护分析的对象。信息包括:model type
和机器 age
。下图显示了按型号分类的机器使用年限在生产过程中的分布。
//Distribution of models across age
var d1 = machines.Filter("model", "model1", FilterOperator.Equal)["age"]
.GroupBy(g => g).Select(g=>(g.Key,g.Count()));
var d2 = machines.Filter("model", "model2", FilterOperator.Equal)["age"]
.GroupBy(g => g).Select(g=>(g.Key,g.Count()));
var d3 = machines.Filter("model", "model3", FilterOperator.Equal)["age"]
.GroupBy(g => g).Select(g=>(g.Key,g.Count()));
var d4 = machines.Filter("model", "model4", FilterOperator.Equal)["age"]
.GroupBy(g => g).Select(g=>(g.Key,g.Count()));
//define bars
var b1 = new Graph.Bar(){ x = d1.Select(x=>x.Item1),y = d1.Select(x=>x.Item2),name = "model1"};
var b2 = new Graph.Bar(){ x = d2.Select(x=>x.Item1),y = d2.Select(x=>x.Item2),name = "model2"};
var b3 = new Graph.Bar(){ x = d3.Select(x=>x.Item1),y = d3.Select(x=>x.Item2),name = "model3"};
var b4 = new Graph.Bar(){ x = d4.Select(x=>x.Item1),y = d4.Select(x=>x.Item2),name = "model4"};
//Plot machine data
var chart = Chart.Plot(new[] {b1,b2,b3,b4});
var layout = new XPlot.Plotly.Layout.Layout()
{ title = "Components Replacements",barmode="stack",
xaxis=new XPlot.Plotly.Graph.Xaxis() { title="Machine Age" },
yaxis = new XPlot.Plotly.Graph.Yaxis() { title = "Count" } };
//put layout into chart
chart.WithLayout(layout);
display(chart)
故障
Failures
数据代表因机器故障而导致的组件更换。一旦发生故障,机器就会停止。这是错误 (Errors) 和故障 (Failures) 之间的关键区别。
failures.Head()
//count number of failures
var falValues = failures["failure"].GroupBy(v => v)
.OrderBy(group => group.Key)
.Select(group => Tuple.Create(group.Key, group.Count()));
//Plot Failure data
var chart = Chart.Plot(
new Graph.Bar()
{
x = falValues.Select(x=>x.Item1),
y = falValues.Select(x=>x.Item2),
// mode = "markers",
}
);
var layout = new XPlot.Plotly.Layout.Layout()
{ title = "Failure Distribution across machines",
xaxis=new XPlot.Plotly.Graph.Xaxis() { title="Component Name" },
yaxis = new XPlot.Plotly.Graph.Yaxis() { title = "Number of components replaces" } };
//put layout into chart
chart.WithLayout(layout);
display(chart)
特征工程
本节包含几种特征工程方法,用于基于机器属性创建特征。
滞后遥测特征
首先,我们将创建几个滞后遥测数据,因为遥测数据是经典的时间序列数据。
接下来,计算每 3 小时的遥测数据在过去 3 小时滞后窗口内的滚动平均值和标准差。
//prepare rolling aggregation for each column for average values
var agg_curent = new Dictionary<string, Aggregation>()
{
{ "datetime", Aggregation.Last }, { "volt", Aggregation.Last },
{ "rotate", Aggregation.Last },
{ "pressure", Aggregation.Last },{ "vibration", Aggregation.Last }
};
//prepare rolling aggregation for each column for average values
var agg_mean = new Dictionary<string, Aggregation>()
{
{ "datetime", Aggregation.Last }, { "volt", Aggregation.Avg },
{ "rotate", Aggregation.Avg },
{ "pressure", Aggregation.Avg },{ "vibration", Aggregation.Avg }
};
//prepare rolling aggregation for each column for std values
var agg_std = new Dictionary<string, Aggregation>()
{
{ "datetime", Aggregation.Last }, { "volt", Aggregation.Std },
{ "rotate", Aggregation.Std },
{ "pressure", Aggregation.Std },{ "vibration", Aggregation.Std }
};
//group Telemetry data by machine ID
var groupedTelemetry = telemetry.GroupBy("machineID");
//calculate rolling mean for grouped data for each 3 hours
var _3AvgValue = groupedTelemetry.Rolling(3, 3, agg_mean)
.Create(("machineID", null),
("datetime", null),("volt", "voltmean_3hrs"),
("rotate", "rotatemean_3hrs"),
("pressure", "pressuremean_3hrs"),
("vibration", "vibrationmean_3hrs"));
//show head of the newely generated table
_3AvgValue.Head()
//calculate rolling std for grouped datat fro each 3 hours
var _3StdValue = groupedTelemetry.Rolling(3, 3, agg_mean)
.Create(("machineID", null), ("datetime", null),
("volt", "voltsd_3hrs"), ("rotate", "rotatesd_3hrs"),
("pressure", "pressuresd_3hrs"), ("vibration", "vibrationsd_3hrs"));
//show head of the new generated table
_3StdValue.Head()
为了捕捉更长期的影响,24 小时滞后特征,我们将计算滚动平均值和标准差。
//calculate rolling avg and std for each 24 hours
var _24AvgValue = groupedTelemetry.Rolling(24, 3, agg_mean)
.Create(("machineID", null), ("datetime", null),
("volt", "voltmean_24hrs"), ("rotate", "rotatemean_24hrs"),
("pressure", "pressuremean_24hrs"),
("vibration", "vibrationmean_24hrs"));
var _24StdValue = groupedTelemetry.Rolling(24, 3, agg_std)
.Create(("machineID", null), ("datetime", null),
("volt", "voltsd_24hrs"), ("rotate", "rotatesd_24hrs"),
("pressure", "pressuresd_24hrs"),
("vibration", "vibrationsd_24hrs"));
合并遥测特征
一旦计算了滚动滞后特征,我们就可以将它们合并到一个数据框中。
//before merge all features create set of features
//from the current values for every 3 or 24 hours
DataFrame _1CurrentValue = groupedTelemetry.Rolling(3, 3, agg_curent)
.Create(("machineID", null), ("datetime", null),
("volt", null), ("rotate", null),
("pressure", null), ("vibration", null));
现在我们有了基本数据框,将之前计算的数据框与它合并。
//merge all telemetry data frames into one
var mergeCols= new string[] { "machineID", "datetime" };
var df1 = _1CurrentValue.Merge
(_3AvgValue, mergeCols, mergeCols, JoinType.Left, suffix: "df1");
var df2 = df1.Merge(_24AvgValue, mergeCols, mergeCols, JoinType.Left, suffix: "df2");
var df3 = df2.Merge(_3StdValue, mergeCols, mergeCols, JoinType.Left, suffix: "df3");
var df4 = df3.Merge(_24StdValue, mergeCols, mergeCols, JoinType.Left, suffix: "df4");
在合并过程结束时,选择相关列。
//select final dataset for the telemetry
var telDF = df4["machineID","datetime","volt","rotate", "pressure", "vibration",
"voltmean_3hrs","rotatemean_3hrs","pressuremean_3hrs","vibrationmean_3hrs",
"voltmean_24hrs","rotatemean_24hrs","pressuremean_24hrs","vibrationmean_24hrs",
"voltsd_3hrs", "rotatesd_3hrs","pressuresd_3hrs","vibrationsd_3hrs",
"voltsd_24hrs", "rotatesd_24hrs","pressuresd_24hrs","vibrationsd_24hrs"];
//remove NANs
var telemetry_final = telDF.DropNA();
现在最终遥测数据的顶部 5 行如下所示。
telemetry_final.Head()
错误产生的滞后特征
与具有数值的遥测数据不同,错误数据具有表示在时间戳发生的不同类型错误。我们将聚合错误类别,并考虑滞后窗口中发生的各种不同类型的错误。
首先,使用独热编码对错误进行编码。
var mlContext = new MLContext(seed:2019);
//One Hot Encoding of error column
var encodedErr = errors.EncodeColumn(mlContext, "errorID");
//sum duplicated errors by machine and date
var errors_aggs = new Dictionary<string, Aggregation>();
errors_aggs.Add("error1", Aggregation.Sum);
errors_aggs.Add("error2", Aggregation.Sum);
errors_aggs.Add("error3", Aggregation.Sum);
errors_aggs.Add("error4", Aggregation.Sum);
errors_aggs.Add("error5", Aggregation.Sum);
//group and sum duplicated errors
encodedErr = encodedErr.GroupBy(new string[]
{ "machineID", "datetime" }).Aggregate(errors_aggs);
//
encodedErr = encodedErr.Create(("machineID", null), ("datetime", null),
("error1", "error1sum"), ("error2", "error2sum"),
("error3", "error3sum"), ("error4", "error4sum"),
("error5", "error5sum"));
encodedErr.Head()
// align errors with telemetry datetime values so that we can calculate aggregations
var er = telemetry.Merge(encodedErr,mergeCols, mergeCols, JoinType.Left, suffix: "error");
//
er = er["machineID","datetime", "error1sum",
"error2sum", "error3sum", "error4sum", "error5sum"];
//fill missing values with 0
er.FillNA(0);
er.Head()
//count the number of errors of different types in the last 24 hours, for every 3 hours
//define aggregation
var errors_aggs1 = new Dictionary<string, Aggregation>()
{
{ "datetime", Aggregation.Last },{ "error1sum", Aggregation.Sum },
{ "error2sum", Aggregation.Sum },
{ "error3sum", Aggregation.Sum },{ "error4sum", Aggregation.Sum },
{ "error5sum", Aggregation.Sum }
};
//count the number of errors of different types in the last 24 hours, for every 3 hours
var eDF = er.GroupBy(new string[] { "machineID"}).Rolling(24, 3, errors_aggs1);
//
var newdf= eDF.DropNA();
var errors_final = newdf.Create(("machineID", null), ("datetime", null),
("error1sum", "error1count"), ("error2sum", "error2count"),
("error3sum", "error3count"),
("error4sum", "error4count"), ("error5sum", "error5count"));
errors_final.Head()
距离上次更换的时间
由于这里的主要任务是如何创建相关特征以生成高质量的数据集用于机器学习部分,一个好的特征是过去 3 个月内每个组件的更换次数,以纳入更换的频率。
此外,我们可以计算自组件上次更换以来经过了多长时间,因为这有望与组件故障更好地相关,因为组件使用的时间越长,预期的退化就越多。首先,我们将对维护表进行编码。
//One Hot Encoding of error column
var encMaint = maint.EncodeColumn(mlContext, "comp");
encMaint.Head()
//create separate data frames in order to calculate proper time since last replacement
DataFrame dfComp1 = encMaint.Filter("comp1", 1, FilterOperator.Equal)["machineID", "datetime"];
DataFrame dfComp2 = encMaint.Filter("comp2", 1, FilterOperator.Equal)["machineID", "datetime"];;
DataFrame dfComp3 = encMaint.Filter("comp3", 1, FilterOperator.Equal)["machineID", "datetime"];;
DataFrame dfComp4 = encMaint.Filter("comp4", 1, FilterOperator.Equal)["machineID", "datetime"];;
dfComp4.Head()
//from telemetry data create helped data frame so we can calculate
//additional column from the maintenance data frame
var compData = telemetry_final.Create(("machineID", null), ("datetime", null));
%%time
//calculate new set of columns so that we have information
//the time since last replacement of each component separately
var newCols= new string[]{"sincelastcomp1","sincelastcomp2","sincelastcomp3","sincelastcomp4"};
var calcValues= new object[4];
//perform calculation
compData.AddCalculatedColumns(newCols,(row, i)=>
{
var machineId = Convert.ToInt32(row["machineID"]);
var date = Convert.ToDateTime(row["datetime"]);
var maxDate1 = dfComp1.Filter("machineID", machineId, FilterOperator.Equal)["datetime"]
.Where(x => (DateTime)x <= date).Select(x=>(DateTime)x).Max();
var maxDate2 = dfComp2.Filter("machineID", machineId, FilterOperator.Equal)["datetime"]
.Where(x => (DateTime)x <= date).Select(x=>(DateTime)x).Max();
var maxDate3 = dfComp3.Filter("machineID", machineId, FilterOperator.Equal)["datetime"]
.Where(x => (DateTime)x <= date).Select(x=>(DateTime)x).Max();
var maxDate4 = dfComp4.Filter("machineID", machineId, FilterOperator.Equal)["datetime"]
.Where(x => (DateTime)x <= date).Select(x=>(DateTime)x).Max();
//perform calculation
calcValues[0] = (date - maxDate1).TotalDays;
calcValues[1] = (date - maxDate2).TotalDays;
calcValues[2] = (date - maxDate3).TotalDays;
calcValues[3] = (date - maxDate4).TotalDays;
return calcValues;
});
Wall time: 178708.9764ms
var maintenance_final = compData;
maintenance_final.Head()
机器特征
机器数据集包含有关机器的描述性信息,例如机器类型和使用年限(即服役年数)。
machines.Head()
将特征合并到最终的 ML 就绪数据集
作为特征工程的最后一步,我们将所有特征合并到一个数据集中。
var merge2Cols=new string[]{"machineID"};
var fdf1= telemetry_final.Merge(errors_final, mergeCols, mergeCols,JoinType.Left, suffix: "er");
var fdf2 = fdf1.Merge(maintenance_final, mergeCols,mergeCols,JoinType.Left, suffix: "mn");
var features_final = fdf2.Merge(machines, merge2Cols,merge2Cols,JoinType.Left, suffix: "ma");
features_final= features_final["datetime", "machineID",
"voltmean_3hrs", "rotatemean_3hrs", "pressuremean_3hrs", "vibrationmean_3hrs",
"voltstd_3hrs", "rotatestd_3hrs", "pressurestd_3hrs", "vibrationstd_3hrs",
"voltmean_24hrs", "rotatemean_24hrs", "pressuremean_24hrs", "vibrationmean_24hrs",
"voltstd_24hrs","rotatestd_24hrs", "pressurestd_24hrs", "vibrationstd_24hrs",
"error1count", "error2count", "error3count", "error4count", "error5count",
"sincelastcomp1", "sincelastcomp2", "sincelastcomp3", "sincelastcomp4",
"model", "age"];
//
features_final.Head();
DataFrame.ToCsv("data/final_features.csv", features_final);
定义标签列
在预测性维护中,标签应该是机器在不久的将来因特定组件故障而发生故障的概率。如果我们将 24 小时作为此问题的任务,则标签构建包括在特征数据集中添加一个新列,该列指示特定机器是否会在未来 24 小时内因某个组件的故障而发生故障。
通过这种方式,我们将标签定义为包含以下内容的分类变量:– none
– 如果机器在未来 24 小时内不会发生故障 – comp1
到 comp4
。
- 如果机器在未来 24 小时内因特定组件的故障而发生故障
由于我们可以通过应用不同条件来试验标签构建,因此我们可以实现接受多个参数的方法来定义通用问题。
failures.Describe(false)
//constructing the label column which indicate if the current machine will
//fail in the next `predTime` (24 hours as default) due to failur certain component.
//create final data frame from feature df
var finalDf = new DataFrame(features_final);
//group failures by machineID and datetime
string[] cols = new string[] { "machineID" , "datetime"};
var failDfgrp = failures.GroupBy(cols);
//Add failure column to finalDF
var rV = new object[] { "none" };
finalDf.AddCalculatedColumns(new string[]{"failure"}, (object[] row, int i) => rV);
//create new data frame from featuresDF by grouping machineID and datatime
var featureDfGrouped = finalDf["datetime","machineID", "failure"].GroupBy(cols);
//now look for every failure and calculate if the machine will fail in the last 24 hours
//in case two or more components were failed for the ssame machine add new row in df
var failureDfExt = featureDfGrouped.Transform((xdf) =>
{
//extract the row from featureDfGrouped
var xdfRow = xdf[0].ToList();
var refDate = (DateTime)xdfRow[0];
var machineID = (int)xdfRow[1];
//now look if the failure contains the machineID
if(failDfgrp.Group2.ContainsKey(machineID))
{
//get the date and calculate total hours
var dff = failDfgrp.Group2[machineID];
foreach (var dfff in dff)
{
for (int i = 0; i < dfff.Value.RowCount(); i++)
{
//"datetime","machineID","failure"
var frow = dfff.Value[i].ToList();
var dft = (DateTime)frow[0];
//if total hours is less or equal than 24 hours set component
//to the failure column
var totHours = (dft - refDate).TotalHours;
if (totHours <= 24 && totHours >=0)
{
if (xdf.RowCount() > i)
xdf["failure", i] = frow[2];
else//in case two components were failed for the same machine and
//at the same time, add new row with new component name
{
var r = xdf[0].ToList();
r[2] = frow[2];
xdf.AddRow(r);
}
}
}
}
}
return xdf;
});
//Now merge extended failure Df with featureDF
var final_dataframe = finalDf.Merge(failureDfExt, cols, cols,JoinType.Left, "fail");
//define final set of columns
final_dataframe = final_dataframe["datetime", "machineID",
"voltmean_3hrs", "rotatemean_3hrs", "pressuremean_3hrs", "vibrationmean_3hrs",
"voltsd_3hrs", "rotatesd_3hrs", "pressuresd_3hrs", "vibrationsd_3hrs",
"voltmean_24hrs", "rotatemean_24hrs", "pressuremean_24hrs", "vibrationmean_24hrs",
"voltsd_24hrs", "rotatesd_24hrs", "pressuresd_24hrs", "vibrationsd_24hrs",
"error1count", "error2count", "error3count", "error4count", "error5count",
"sincelastcomp1", "sincelastcomp2", "sincelastcomp3", "sincelastcomp4",
"model", "age", "failure_fail"];
//rename column
final_dataframe.Rename(("failure_fail", "failure"));
//save the file data frame to disk
DataFrame.ToCsv("data/final_dataFrame.csv",final_dataframe);
最终数据框
让我们看看 final_dataframe
的样子。它包含 24 列。大多数列是数值型的。Model
列是分类型的,在准备机器学习部分时应该对其进行编码。
此外,标签列 failure
是一个分类列,包含 5 种不同的类别:none
、comp1
、comp2
、comp3
和 comp4
。我们还可以看到数据集是不平衡的,因为我们有 2785705 个 none
,其余行总共是 5923 个其他类别。这是一个典型的不平衡数据集,我们在评估模型时应该小心,因为一个始终返回 none
值的模型将获得超过 97% 的准确率。
final_dataframe.Describe(false)
在下一部分,我们将实现预测性维护模型的训练和评估过程。这篇博文的完整 Notebook 可以在这里找到。