强大的 CSV 文档包装库
我需要一个工具来在我的 .NET 程序和 R 程序之间交换数据,是的,这个强大的库让事情变得更好!
下载 DocumentFormat.Csv-noexe.zip
为什么选择 CSV?
- 首先,CSV 文档只是一个格式化的纯文本文件,可以被任何程序轻松读取和写入。
- 其次,如果我们只需要一个数据文件,而不是带有特定样式数据的格式良好的 Excel 文件,那么 CSV 文档是跨平台软件的最佳选择,因为 MS Office Excel 在 LINUX 上不起作用,并且在 LINUX Mono 环境下的 .NET 程序不容易在不修改源代码的情况下使用 OpenOffice 的 Excel 库 C++ 接口。
- 第三,作为一名生物研究员,R 程序是我们最常用的实验数据统计软件,当我们开发一个 .NET 程序通过 RDotNET 在 LINUX/Windows 上与 R 进行交互时,CSV 文档是 R 和我开发的工具之间数据交换的最佳选择。使用 Visual Basic 操作 R,通过 CSV 数据交换在 R 中创建复杂对象非常方便。
Call DataCollection.SaveTo(CsvPath)
R = R << Push(String.Format(“ExperimentData <- read.csv(“”{0}””)”, CsvPath))
R operation statements, balabalabala....
图 1. CSV 文档的优势
本库中的重要组件
CSV 文件对象的实现
CsvFile 对象
CsvFile 对象实例代表文件系统中的一个 CSV 文件。在您的代码中创建一个 CSV 文件对象非常容易。
Dim CsvFile As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File = Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File.Load(Path:=CsvPath)
Csv.File 类的 Load 函数定义如下:
Public Shared Function Load(Path As String, Optional encoding As System.Text.Encoding = Nothing) As File If encoding Is Nothing Then encoding = System.Text.Encoding.Default End If Dim Csv = CType(System.IO.File.ReadAllLines(Path, encoding:=encoding), File) Csv.File = Path Return Csv End Function
用于将 CSV 对象中的数据保存到文件系统的 save 方法。
''' <summary>
'''
''' </summary>
''' <param name="Path"></param>
''' <param name="LazySaved">Optional, this is for the consideration of performance and memory consumption.
''' When a data file is very large, then you may encounter a out of memory exception on a 32 bit platform,
''' then you should set this parameter to True to avoid this problem. Defualt is False for have a better
''' performance.
''' (当估计到文件的数据量很大的时候,请使用本参数,以避免内存溢出致使应用程序崩溃,默认为False,不开启缓存)
''' </param>
''' <remarks>当目标保存路径不存在的时候,会自动创建文件夹</remarks>
Public Sub Save(Optional Path As String = "", Optional LazySaved As Boolean = False, Optional encoding As System.Text.Encoding = Nothing)
这里有一个编码的可选参数;这是为了避免在不同操作系统上出现中文乱码问题,例如 Windows 默认使用 ANSI,而 LINUX 默认使用 Unicode。
这是 Csv.File 类主要的 Csv.File 数据结构定义。
''' <summary>
''' A comma character seperate table file that can be read and write in the EXCEL.(一个能够被Excel程序所读取的表格文件)
''' </summary>
''' <remarks></remarks>
Public Class File : Implements Generic.IEnumerable(Of Csv.File.Row)
Implements Generic.IList(Of DocumentFormat.Csv.File.Row)
''' <summary>
''' First line in the table is the column name definition line.
''' </summary>
''' <remarks></remarks>
Protected Friend _InnerListTable As List(Of Row) = New List(Of Row)
以及 Csv.File 类中定义的一些用于数据操作的有用方法。
''' <summary>
''' Add a data row collection into this Csv file object instance and then return the total row number after the add operation.
''' (向CSV文件之中批量添加行记录,之后返回当前所打开的文件在添加纪录之后的总行数)
''' </summary>
''' <param name="RowCollection"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function AppendRange(RowCollection As Generic.IEnumerable(Of Row)) As Long
''' <summary>
''' Get a data row in the specific row number, when a row is not exists in current csv file then the function will return a empty row.
''' (当目标对象不存在的时候,会返回一个空行对象)
''' </summary>
''' <param name="line"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Get_DataRow(line As Integer) As Row
''' <summary>
''' Using the content in a specific column as the target for search using a specific keyword, and then return all of the rows that have the query keyword.
''' (以指定的列中的内容搜索关键词,并返回检索成功的行的集合)
''' </summary>
''' <param name="KeyWord"></param>
''' <param name="Column"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function FindAtColumn(KeyWord As String, Column As Integer) As Row()
''' <summary>
''' Get and set the string content in a specific table cell.(设置或者获取某一个指定的单元格中的字符串内容)
''' </summary>
''' <param name="X"></param>
''' <param name="Y"></param>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public Property Cell(X As Integer, Y As Integer) As String
''' <summary>
''' Delete all of the row that meet the delete condition.(将所有满足条件的行进行删除)
''' </summary>
''' <param name="Condition"></param>
''' <remarks></remarks>
Public Function Remove(Condition As Func(Of Row, Boolean)) As Row()
''' <summary>
''' Remove the item in a specific row collection.
''' </summary>
''' <param name="RowCollection"></param>
''' <remarks></remarks>
Public Sub RemoveRange(RowCollection As Generic.IEnumerable(Of Row))
''' <summary>
''' Insert a new empty line of row data before the specific row number.(在指定列标号的列之前插入一行空列)
''' </summary>
''' <param name="column"></param>
''' <remarks></remarks>
Public Function InsertEmptyColumnBefore(column As Integer) As Integer
Row 对象
''' <summary>
''' A line of data in the csv file.(Csv表格文件之中的一行)
''' </summary>
''' <remarks></remarks>
Public Class Row : Implements Generic.IEnumerable(Of String)
Implements Generic.IList(Of System.String)
''' <summary>
''' 本行对象中的所有的单元格的数据集合
''' </summary>
''' <remarks></remarks>
Protected Friend _InnerColumnList As List(Of String) = New List(Of String)
如何将字符串数据解析为 Row 对象
正如我们所见,Row 对象主要由一个字符串集合组成,因此我们只需要将字符串分割成一个字符串集合,而一个神奇的正则表达式可以轻松完成这项工作。
''' <summary>
''' A regex expression string that use for split the line text.
''' </summary>
''' <remarks></remarks>
Protected Friend Const SplitRegxExpression As String = "[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)"
这是解析工作的代码实现细节。
''' <summary>
''' Row parsing into column tokens
''' </summary>
''' <param name="Line"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Widening Operator CType(Line As String) As Row
Dim Row = Regex.Split(Line, SplitRegxExpression)
For i As Integer = 0 To Row.Count - 1
If Not String.IsNullOrEmpty(Row(i)) Then
If Row(i).First = """"c AndAlso Row(i).Last = """"c Then
Row(i) = Mid(Row(i), 2, Len(Row(i)) - 2)
End If
End If
Next
Return Row
End Operator
以及 Row 数据操作的一些主要方法。
''' <summary>
''' Get the cell data in a specific column number. if the column is not exists in this row then will return a empty string.
''' (获取某一列中的数据,若该列不存在则返回空字符串)
''' </summary>
''' <param name="Index"></param>
''' <returns></returns>
''' <remarks></remarks>
Default Public Property Column(Index As Integer) As String Implements IList(Of String).Item
''' <summary>
''' is this row object contains any data?
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property IsNullOrEmpty As Boolean
''' <summary>
''' insert the data into a spercific column
''' </summary>
''' <param name="value"></param>
''' <param name="column"></param>
''' <returns>仅为LINQ查询使用的一个无意义的值</returns>
''' <remarks></remarks>
Public Function InsertAt(value As String, column As Integer) As Integer
''' <summary>
''' Takes the data in the specific number of columns, if columns is not exists in this row object, then a part of returned data will be the empty string.
''' </summary>
''' <param name="Count"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Takes(Count As Integer) As String()
''' <summary>
''' Takes the data in the specific column index collection, if the column is not exists in the row object, then a part of the returned data will be the empty string.
''' </summary>
''' <param name="Cols"></param>
''' <param name="retNullable">(当不存在数据的时候是否返回空字符串,默认返回空字符串)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Takes(Cols As Integer(), Optional retNullable As Boolean = True) As String()
Public Function AddRange(values As Generic.IEnumerable(Of String)) As Integer
Public Sub Add(columnValue As String) Implements ICollection(Of String).Add
导入操作
由于 CSV 文件格式使用逗号字符分隔单元格数据,但有些文本文件不使用逗号作为分隔符,它可能使用 TAB 或几个空格字符,那么我们可以导入目标文本文件中的数据来创建一个 CSV 文件对象,就像 Excel 应用程序的数据导入操作一样。
DataImports 模块中定义了两种数据导入操作。
''' <summary>
''' Imports the data in a well formatted text file using a specific delimiter, default delimiter is comma character.
''' </summary>
''' <param name="txtPath">The file path for the data imports text file.(将要进行数据导入的文本文件)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function [Imports](txtPath As String, Optional delimiter As Char = ","c, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
Dim Expression As String = String.Format(SplitRegxExpression, delimiter)
If encoding Is Nothing Then
encoding = System.Text.Encoding.Default
End If
Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
Dim LQuery = (From line As String In Lines Select RowParsing(line, Expression)).ToArray
Return LQuery
End Function
''' <summary>
''' Imports the data in a well formatted text file using the fix length as the data separate method.
''' </summary>
''' <param name="txtPath"></param>
''' <param name="length">固定的列字符数的宽度</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function FixLengthImports(txtPath As String, Optional length As Integer = 10, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
If encoding Is Nothing Then
encoding = System.Text.Encoding.Default
End If
Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
Dim LQuery = (From line As String In Lines Select RowParsing(line, length:=length)).ToArray
Return LQuery
End Function
令人惊叹的反射包装器
我在这库中的这部分工作是最强大和最令人惊叹的!它让我的编码工作更快乐,我相信它也会让您更快乐!让我们来看一个代码示例!所有这些包装器操作都定义在 Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection 命名空间中。
它是如何工作的?
反射包装器操作基于 .NET 中的反射操作。实现此功能有三个步骤。
图 2. 包装器操作的工作流程
第一步:创建数据架构
必需的自定义属性
Namespace Csv.Reflection
<AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
Public Class ColumnAttribute : Inherits Attribute
Protected Friend _ElementDataType As Type = Type.String
Protected Friend _Name As String
Protected Friend _bindProperty As System.Reflection.PropertyInfo
Sub New(Name As String, Optional Type As Type = Type.String)
''' <summary>
''' 并不建议使用本Csv属性来储存大量的文本字符串,极容易出错
''' </summary>
''' <remarks></remarks>
<AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
Public Class ArrayColumn : Inherits Attribute
Protected Friend _Delimiter As String
Protected Friend _ElementDataType As ColumnAttribute.Type
Protected Friend _Name As String
Protected Friend _bindProperty As System.Reflection.PropertyInfo
Sub New(Name As String, Optional Delimiter As String = "; ", Optional ElementDataType As ColumnAttribute.Type = ColumnAttribute.Type.String)
用于创建数据架构的两个函数。
Private Shared Function GetArrayColumns(ItemTypeProperties As System.Reflection.PropertyInfo()) As Reflection.ArrayColumn()
Dim LQuery = (From [PropertyInfo] As System.Reflection.PropertyInfo
In ItemTypeProperties
Let attrs As Object() = [PropertyInfo].GetCustomAttributes(Reflection.ArrayColumn.TypeInfo, inherit:=False)
Where Not attrs.IsNullOrEmpty
Select DirectCast(attrs.First, Reflection.ArrayColumn).Bind(PropertyInfo)).ToArray
Return LQuery
End Function
Private Shared Function GetColumns(Type As System.Type) As Reflection.ColumnAttribute()
Dim ItemTypeProperties = Type.GetProperties(BindingFlags.Public Or BindingFlags.Instance)
Dim [Property] As PropertyInfo
Dim ColumnList As List(Of Reflection.ColumnAttribute) = New List(Of ColumnAttribute)
For i As Integer = 0 To ItemTypeProperties.Length - 1
[Property] = ItemTypeProperties(i)
Dim ColumnAttrList = (From attr In [Property].GetCustomAttributes(ColumnAttribute.TypeInfo, inherit:=False) Select DirectCast(attr, ColumnAttribute)).ToArray
If Not ColumnAttrList.IsNullOrEmpty Then
For idx As Integer = 0 To ColumnAttrList.Count - 1
Dim columnAttr = ColumnAttrList(idx)
If Not Len(columnAttr._Name) > 0 Then
columnAttr._Name = [Property].Name
End If
ColumnList.Add(columnAttr.Bind([Property]))
Next
End If
Next
Return ColumnList.ToArray
End Function
由于只有一些基本数据类型可以直接转换为字符串或将字符串转换为目标类型,因此我定义了一个类型枚举来指定目标属性的数据类型,这段代码比我们直接使用反射获取属性类型信息能提高程序性能。
Public Enum Type
[Object] = -1
[Integer] = 0
[Long]
[String]
[Double]
[DateTime]
[Bool]
End Enum
第二步:数据转换
在上面通过反射操作创建数据架构后,我们就可以在数据对象和 Row 对象之间进行数据类型转换。
这是数据转换方法。
Friend Shared ReadOnly CTypers As System.Func(Of String, Object)() = {
Function(s As String) CInt(Val(s)),
Function(s As String) CType(Val(s), Long),
Function(s As String) s,
Function(s As String) Val(s),
Function(s As String) CType(s, DateTime),
Function(s As String) CType(s, Boolean)}
Public Shared Function Convert(s As String, DataType As Type) As Object
Return CTypers(DataType)(s)
End Function
还有一个问题是包装器操作可以转换数据数组对象,所以我创建了这个方法用于将数组转换为字符串行,并将字符串数组转换为对象数组。
Public Function CreateObject(cellData As String) As String()
Dim Tokens As String() = Split(cellData, _Delimiter)
Return Tokens
End Function
Private ReadOnly Property DelimiterLength As Integer
Get
Return Len(_Delimiter)
End Get
End Property
Public Function CreateObject(Of T)(DataCollection As Generic.IEnumerable(Of T)) As String
Dim Type As System.Type = GetType(T)
Dim sBuilder As StringBuilder = New StringBuilder(1024)
If Not (Type.IsValueType OrElse Type = GetType(String)) Then
Call Console.WriteLine("[WARNNING] DataType ""{0}"" is not a validated value type, trying to get string data from its Object.ToString() method!", Type.FullName)
End If
Dim StringCollection As String() = (From item In DataCollection Let _create = Function() item.ToString Select _create()).ToArray
For Each item In StringCollection
Call sBuilder.AppendFormat("{0}{1}", item, _Delimiter)
Next
Call sBuilder.Remove(sBuilder.Length - DelimiterLength, DelimiterLength)
Return sBuilder.ToString
End Function
有一个问题是我们在将集合数据传递给 Property.SetValue 方法时,我们解析的是从文本文件中加载的字符串数组,但目标属性的数据类型可能是 Double 数组,那么如果我们直接将这个对象集合参数传递给 Property.SetValue 方法,就会得到一个数据类型不匹配的异常,所以我使用了 lambda 表达式来解决这个问题。
Protected Friend Shared ReadOnly FillObjects As Dictionary(Of Reflection.ColumnAttribute.Type, System.Action(Of PropertyInfo, Object, String())) =
New Dictionary(Of ColumnAttribute.Type, Action(Of PropertyInfo, Object, String()))
From {
{ColumnAttribute.Type.Bool, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Boolean)).ToArray, Nothing)},
{ColumnAttribute.Type.DateTime, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, DateTime)).ToArray, Nothing)},
{ColumnAttribute.Type.Double, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Double)).ToArray, Nothing)},
{ColumnAttribute.Type.Integer, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Integer)).ToArray, Nothing)},
{ColumnAttribute.Type.Long, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Long)).ToArray, Nothing)},
{ColumnAttribute.Type.String, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, strValues, Nothing)}}
所有包装器操作代码都可以在 Csv.Reflection.Reflector 类中找到。
''' <summary>
''' Method for load a csv data file into a specific type of object collection.
''' </summary>
''' <typeparam name="ItemType"></typeparam>
''' <param name="Explicit">当本参数值为False的时候,所有的简单属性值都将被解析出来,而忽略掉其是否带有<see cref="Csv.Reflection.ColumnAttribute"></see>自定义属性</param>
''' <param name="Path"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function Load(Of ItemType)(Path As String, Optional Explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of ItemType)
''' <summary>
''' Save the specifc type object collection into the csv data file
''' </summary>
''' <typeparam name="ItemType"></typeparam>
''' <param name="Collection"></param>
''' <param name="Explicit"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function Save(Of ItemType)(Collection As Generic.IEnumerable(Of ItemType), Optional Explicit As Boolean = True) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
上面这两个方法有一个可选参数 Explicit,表示 True 时只解析具有 column 自定义属性的属性,False 时解析所有属性,即使它们没有 column 自定义属性。我只是想让事情变得更好,更容易控制。这就像在进行 XML 序列化时,所有没有 xml serialization 自定义属性的属性也可以被序列化到 XML 文件中。
我还创建了两个扩展方法,以便更容易进行 IO 操作。
''' <summary>
''' Load a csv data file document using a specific object type.(将某一个Csv数据文件加载仅一个特定类型的对象集合中)
''' </summary>
''' <typeparam name="T"></typeparam>
''' <param name="Path"></param>
''' <param name="explicit"></param>
''' <param name="encoding"></param>
''' <returns></returns>
''' <remarks></remarks>
<Extension> Public Function LoadCsv(Of T)(Path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of T)
Return Csv.Reflection.Reflector.Load(Of T)(Path, explicit, encoding)
End Function
<Extension> Public Sub SaveTo(Of T)(Collection As Generic.IEnumerable(Of T), path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing)
Call Csv.Reflection.Reflector.Save(Collection, explicit).Save(path, False, encoding)
End Sub
一个简单的例子:如何使用?
如果我们有一个 .NET 程序中具有这种数据结构的项目数据集合,在预处理后,将其传递给 R 程序并创建 R 中的复杂结构对象,我们该如何做?最神奇的事情是使用本库中的反射包装器来完成这项工作:
就像我们在 .NET 程序中有这样的示例数据结构,并且目标属性上已经定义了自定义属性。
Public Class ExampleExperimentData
Public Property Id As String
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
Public Property ExpressionRPKM As Double
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
Public Property Tags As String()
End Class
那么目标 CSV 数据文件将被加载到一个特定类型的数据集合对象中!事情就是这么简单和智能!
Dim CsvPath As String = "csv/data/file/path"
Dim DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=True)
你看,就像读取数据一样,将数据集合对象中的数据保存也很简单!
Call DataCollection.SaveTo(CsvPath, explicit:=True)
一个简单的测试示例
Imports Microsoft.VisualBasic.DataVisualization.DocumentFormat.Extensions
Module TestMain
Sub Main()
Dim DataCollection As ExampleExperimentData() = New ExampleExperimentData() {
New ExampleExperimentData With {.Id = "GeneId_0001", .ExpressionRPKM = 0, .Tags = New String() {"Up", "Regulator"}},
New ExampleExperimentData With {.Id = "GeneId_0002", .ExpressionRPKM = 1, .Tags = New String() {"Up", "PathwayA"}},
New ExampleExperimentData With {.Id = "GeneId_0003", .ExpressionRPKM = 2, .Tags = New String() {"Down", "Virulence"}}}
Dim CsvPath As String = "./TestData.csv"
Call DataCollection.SaveTo(CsvPath, explicit:=False)
DataCollection = Nothing
DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=False).ToArray
Dim File = Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Reflector.Save(Of ExampleExperimentData)(DataCollection, Explicit:=False)
For Each row In File
Call Console.WriteLine(row.ToString)
Next
Console.WriteLine("Press any key to continute...")
Console.Read()
End Sub
Public Class ExampleExperimentData
Public Property Id As String
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
Public Property ExpressionRPKM As Double
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
Public Property Tags As String()
End Class
End Module
图 3. 测试示例控制台输出和自动生成的 CSV 文档